re: [gutvol-d] new thread for noring
jon said:
I hope you get an error rate that is one per ten pages
i'll do my best. :+)
even if you do, I still believe a DP-like process is necessary to catch errors that OCR can't handle
human readers will _always_ be necessary. (and easy enough to find. if no one wants to read a book, there's little call to digitize it.) thus a system of "continuous proofreading" will be quite good enough if we can make the computer-guided processing accurate enough.
and for someone to properly assemble the pages, structure the document, etc., after the OCRing/proofing is complete.
that's part of what i include in "post-o.c.r. processing".
I don't quite put the same level of faith in OCR as you seem to.
except that once you see the evidence i lay out, you will realize "faith" has nothing to do with it. as i've been saying all along, for professionally typeset books, the structure is _in_ the presentation. so o.c.r. gives you all the information you need, if you know how to look for it, and do so diligently.
Btw, I believe as you do that an error reporting system is a good idea so readers may submit errors they find in the texts they use -- sort of an ongoing post-DP proofing process.
post-d.p.? i see it _replacing_ d.p. for most books. and good thing, too. once the coming avalanche of scanned-books engulfs us, it'll be the only way most books have a chance to surface. that will take the pressure off distributed proofreaders, and they'll be able to focus on the books that _need_ them.
Obviously, it is necessary to make available the page scans of the source document to aid in this process. How can an error be properly verified and corrected when the source work is not available?
i've always said i think that page-scans should be publicly available. particularly if your mission is _transcribing_an_existing_edition_. (although, to remind people again, copyism is _not_ the mission that michael hart chose to embed within his project gutenberg.) but even in the case of project gutenberg's "amalgamated" e-texts, i believe that a page-image graphic-version should be made available. this would allow people to view it on a dvd-player, just as an example.
Scanning took quite a while (much more than four hours)
that doesn't surprise me. nonetheless, i'll limit myself to 4 hours. that's quite enough time to devote to it. and to prove the point too.
I deemed it important for processing purposes that the name of the image contain semantic information of what it represents, and that naming be consistent for file sorting purposes.
as one improvement, i would suggest _not_ using "001.png", etc. instead, preface each one with a string that will make it _unique_, such as "ma2005feb001.png". it's easy to tell that to the o.c.r. app -- you just type it in one time -- and it's an unmistakable stamp. and of course, if you're going to do hundreds or thousands of books, you want to cook up a naming convention that conveys information. on big multimedia projects, it is not at all uncommon to have one _full-time_ employee dedicated _solely_ to maintaining filenames. because if things go wrong, it can waste a whole lot of man-hours. oh yeah, one more suggestion. your front-matter filenames were prefaced with an "r". my typical recommendation is that they be prefaced with an "f", and that the regular pages be named with a "p", so the front-matter files will sort _on_top_of_ the regular pages. i want to be able to depend on the operating-system filename sort to give me pages in the exact order they appear in the book itself. so i use a "q" on back-matter files, so they will drop to the bottom. for illustration plates, i use a name that sorts _them_ correctly; for instance, if an illustration page is between pages 168 and 169, name it "p168a.png". (and don't forget the blank verso side either!, which you will name "p168b.png".)
The publisher simply chose to start at page 3. Was this common?
it's not uncommon. oftentimes there is a "title-page", consisting of nothing more than the name of the book, which is considered "page 1", with its blank verso being "page 2", so chapter 1 starts on "page 3". sometimes chapter 1 starts on page 7. or page 11. publishers are weird.
Maybe there was an intent to insert a page there, which after typesetting it was decided not to.)
sometimes that happens too, yep. an "unnecessary" page gets dropped when the typesetter realizes they didn't plan the signatures correctly. or when the preface runs two pages longer than was originally intended. or any number of other snafus spring up. shit happens.
It was my intent to reproduce each page for direct reading purposes -- that is, if somebody wanted to read the book as it was printed, then they could.
yeah, and sometimes people want to do exactly that. which is why the page-images should be made available. for many illustrated books, the text alone is not enough. you want to be able to see the pages as they were printed. my viewer-program will work with either, text or images. it'll even work in "hybrid" mode, so you can display the text in one of the 2-up pages, and the page-image on the other side. (and of course that is the mode which is used for proofreading.) that's why things like _blank_pages_ are so important to include. because if you toss them out, you screw up the left/right sequence. a convention of paper-books is that odd pages always go on the right. screw that up and you make yourself look silly. anyway, that's all for now. -bowerbird
Bowerbird wrote:
as one improvement, i would suggest _not_ using "001.png", etc. instead, preface each one with a string that will make it _unique_, such as "ma2005feb001.png". it's easy to tell that to the o.c.r. app -- you just type it in one time -- and it's an unmistakable stamp.
Yes, a very good suggestion, and one that is being planned. I held off because we are still thinking through the exact syntax of the book identifier, although it *might* be based somewhat on the WEMI (Work/ Expression/Manifestation/Item) principle. The LibraryCity ID used at the current "My Antonia" site is just a quick improvisation of the WEMI principle. For example: Work: "Frankenstein" by Mary Shelley Expression: Second edition (which differs a lot from the First) Manifestation: 1895 printing edited by John Doe (just a dummy example) Item: XHTML So in Trusted Editions, filed under the WorkID for "Frankenstein", we could have multiple Expressions each with its own ExprID, e.g. First Edition, Second Edition, a lost manuscript for a third edition, etc. (many books will have only Expression since they did not become popular and no author manuscript exists.) Under Manifestation we could have several (with ManfID's) based on later edited editions as well as a modern "Michael Hart" style amalgamated/edited edition. And then for each Manifestation we can have several formats (Items, ItemID -- yeah, this is a small twist on WEMI as it officially exists since 'item' in the pbook world usually refers to a particular printed copy of a Manifestation, with coffee stains and page rips and all -- but this works well for ebooks/etexts where each item is a duplicatable digital format derived from the paper Manifestation. This is not yet etched in concrete -- it is still in the idea stage.) So, as an example, we might have for Identifiers: WorkID: 00000000025 (enough for 100 billion general Works.) ExprID: 02 ManfID: 03 ItemID: 008 (referring to some standardized list which expands over time) So the overall ID for a particular format of a particular source paper book might be: 00000000025-02-03-008 (yeah, it's long) Page scans only need the WEM portion of the ID for prefixing on the filename: 00000000025-02-03-p295.png (If we only care about 100 million Works, then we may have: 00000025-02-03-p295.png ) Of course, the WEM-ID itself does not contain any metadata other than identifiers, but that would mesh with a database. It is very problematic to include any Dublin Core type of metadata within an identifier. It is understandable maybe using the two first letters associated with the first two words of the title (ignoring articles), such as MA for "My Antonia", but that's as far as I'd go.
and of course, if you're going to do hundreds or thousands of books, you want to cook up a naming convention that conveys information. on big multimedia projects, it is not at all uncommon to have one _full-time_ employee dedicated _solely_ to maintaining filenames. because if things go wrong, it can waste a whole lot of man-hours.
Every scanned image is a unique digital object, so it needs to have a unique identifier in the object's file name, applied when it is created, along with a metadata record somewhere to describe and keep track of it. The catalogers will take care of the identifers and metadata, which go hand in hand.
oh yeah, one more suggestion. your front-matter filenames were prefaced with an "r". my typical recommendation is that they be prefaced with an "f", and that the regular pages be named with a "p", so the front-matter files will sort _on_top_of_ the regular pages. i want to be able to depend on the operating-system filename sort to give me pages in the exact order they appear in the book itself. so i use a "q" on back-matter files, so they will drop to the bottom. for illustration plates, i use a name that sorts _them_ correctly; for instance, if an illustration page is between pages 168 and 169, name it "p168a.png". (and don't forget the blank verso side either!, which you will name "p168b.png".)
Also an excellent suggestion. The 'r' stands for "Roman", but I noticed in sorting that the pages are not ordered, so the front-/body-/end-matter approach makes sense. Too bad 'b' comes before 'f', as you noted.
It was my intent to reproduce each page for direct reading purposes -- that is, if somebody wanted to read the book as it was printed, then they could.
that's why things like _blank_pages_ are so important to include. because if you toss them out, you screw up the left/right sequence. a convention of paper-books is that odd pages always go on the right. screw that up and you make yourself look silly.
Definitely! I will certainly need to relook at what I did to make sure it's all there. Handling inserted illustrations is a problem name-wise since in "My Antonia", the illustrations were inserts between numbered pages. So for naming/sorting purposes that will need to be worked out. Thanks for the ideas. Jon
The publisher simply chose to start at page 3. Was this common?
it's not uncommon. oftentimes there is a "title-page", consisting of nothing more than the name of the book, which is considered "page 1",
Those are called "half-titles," btw.
participants (3)
-
Bowerbird@aol.com -
D Garcia -
Jon Noring