Re: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting

jon said:
Not possible, unless one bought the *big buck* (above office-level) sheet feed or page turning scanners, or one simply used a photocopy machine, and captured the low-rez images it produces.
my girlfriend's office has a $10,000 lanier just down the hall. that's the kind of machine i was talking about. their website says that their high-end machines can scan 60+ pages an hour. but i grant you that a scanning time of a few hours (or more) is much more in line with what most normal people can attain, even those with lots of experience like yourself...
Yes, but this is not for the average, ordinary Joe working in his basement. This requires a lot of $$$ in upfront investment to get this fancy equipment and software.
i think you might be surprised in the coming months, jon.
There's still need for the whiz-bang scan cleanup software, which I know is expensive.
donovan was working on some open-source deskewing routines. might want to check that out. and i'm told that abbyy does a fairly good job setting brightness and contrast automatically. so the other thing that needs to be done is to standardize the placement of each scan relative to each other, which isn't hard. (removing curvature is a bear, but the best new scanner out -- the optik? -- lets you lay the book on the edge of the bed, which i understand effectively cures the curvature problems.)
in "My Antonia" a lot of pages were not numbered at all
that's not uncommon.
(such as the last page in each chapter).
yes, i noticed that. _that_ is a little uncommon. but like i said earlier, publishers can be weird.
I had to be especially careful not to mess up and lose which page is which.
it's _fairly_ easy to do each page in sequence -- just have to pay some attention turning the page -- and then using the auto-increment-name option will ensure that all of the files are named correctly.
Hmmm, this is a lot like what James Linden is developing, which may be incorporated into PG of Canada's operations. <smile/>
if you check the archives you'll find i'm the one who posted it. i also offered to write all the software. all that was ignored. doesn't matter though, i'm proceeding to build my own system. if james took my post to heart, then he's smart. :+)
Doing 15,000 texts, or a million texts, still needs some manual processing.
if you're manually opening every file, and manually summoning every scan you need to check, you're going to burn yourself out. _plus_ expose yourself to the reality of inadvertent changes. you have to have a system that tracks every change that's made, so you can review the log to make sure it was the correct change, and that nothing else was changed. reviewing the log is "manual", and so is the decision as to _approval/rejection_ of the change, but the change itself should be totally automated.
That's why, to me, it is more important to redo the collection, put it on a common, surer footing (including building trust), before launching into doing a lot more texts.
the library needs to be _corrected_, yes, but _not_ "redone". and i think you do more damage than good when you talk about e-texts being done "incorrectly", when what you _really_ mean is that an edition was used that you don't happen to approve of, or that metadata isn't included, just to use some most examples. there are _real_ errors in the e-texts. honest-to-goodness mistakes. we need to concentrate on _those_, not on some edition that uses the british spellings instead of american ones. (even if that _was_ silly.) but distributed proofreaders is more interested in doing new books than fixing old ones. they're volunteers who set their own priorities.
Imagine how difficult it would be to process one million texts if they were produced in the same ad-hoc fashion, without following some common standards.
i don't have to "imagine" it. that's the way the library is now. and i made my fair share of efforts to try and convince the powers that that situation needed to be addressed with some standardization. but the difficulty of doing it with the type of heavy-markup that you like has held up that whole darn process. if we would have proceeded with the "zen markup language" that i like, the library would have been clean now.
PG's ad hoc approach up to now (which DP has partly fixed)
the d.p. e-texts still exhibit a large degree of inconsistencies. and contrary to what you imply, they are not generally error-free. some are, but others are not. the same is true of earlier e-texts. the quality has improved, yes, surely. it is still not highest quality. but they are volunteers, and thus they set their own bar for quality. and they certainly deliver quality that is high enough that we could use "continuous proofreading" and have the public zoom us to perfect.
it can't be done using any plain text regularization scheme
you're wrong. dead wrong. *** anyway, jon, thanks for the information on your scanning experience. i come away from hearing it with an even more firm conclusion that scanning and image-cleanup is indeed the biggest part of the process. -bowerbird

Bowerbird wrote:
jon said:
Not possible, unless one bought the *big buck* (above office-level) sheet feed or page turning scanners, or one simply used a photocopy machine, and captured the low-rez images it produces.
my girlfriend's office has a $10,000 lanier just down the hall. that's the kind of machine i was talking about. their website says that their high-end machines can scan 60+ pages an hour.
But what resolution? With scanners that move something with respect to the page, the higher the resolution, the slower it is. (On the other hand, today's 12 megapixel digital cameras, which for "My Antonia" would produce approximately 600 dpi quality, take a snapshot of the whole page, and can transfer the file in very short time, short than it takes to turn the page.)
but i grant you that a scanning time of a few hours (or more) is much more in line with what most normal people can attain, even those with lots of experience like yourself...
Well, I'm not an experienced scanner (there's a difference between understanding the principles, and actual experience), but I think by the time I got finished with My Antonia, I gained a few stripes. <smile/>
There's still need for the whiz-bang scan cleanup software, which I know is expensive.
donovan was working on some open-source deskewing routines. might want to check that out.
O.k., thanks. Open source, high-quality deskewing routines are definitely needed! Now, it's a matter to also get a high-quality open source cropping and normalization application.
and i'm told that abbyy does a fairly good job setting brightness and contrast automatically. so the other thing that needs to be done is to standardize the placement of each scan relative to each other, which isn't hard. (removing curvature is a bear, but the best new scanner out -- the optik? -- lets you lay the book on the edge of the bed, which i understand effectively cures the curvature problems.)
Yes, I've heard of these book-oriented scanners which are more gentle on bindings (but even here the binding is stressed.) There's a web site somewhere giving a review of the model you describe, but don't have the URL handy.
but distributed proofreaders is more interested in doing new books than fixing old ones. they're volunteers who set their own priorities.
Yes, that is true. There is a lot of interest in DP to redo a lot of the pre-DP classics in the PG corpus, from what I understand, so it may get done anyway even if PG does not encourage it. Jon

"Jon" == Jon Noring <jon@noring.name> writes:
Jon> Yes, I've heard of these book-oriented scanners which are Jon> more gentle on bindings (but even here the binding is Jon> stressed.) There's a web site somewhere giving a review of Jon> the model you describe, but don't have the URL handy. I have one, Plustek OpticBook 3600, and I am very much satisfied of it, but scanning books in book mode trims away at least 1cm. in the middle, so can be used only if the margins are generous. To use it you have to open the book at 90 degrees, usually possible. I am satisfied nevertheless for the speed, the depth of the scan (there is almost no shadow in the gutter), the overall quality for the price. I see it quoted now $239, but it is difficult to find it in online shops: apparently there is much demand. However, in my experience, the limit is not scanning quality: it is print quality. OCR quality is pretty good on modern editions, but old books, often stained, and even more often with defective print, give raise to a lot of errors. Often you don't have the choice of a better print. Carlo

Carlo Traverso wrote:
I have one, Plustek OpticBook 3600, and I am very much satisfied of it, but scanning books in book mode trims away at least 1cm. in the middle, so can be used only if the margins are generous. To use it you have to open the book at 90 degrees, usually possible.
I use the same model, and am very happy with its speed; for 300 dpi images of 8vo sized books, I have clocked myself at 300 pages per hour on a book with a good binding. I don't know what software you use it with, but if you have Abbyy, you might do what I do and run it through Abbyy's interface rather than its own "book mode" interface. The Abbyy driver should capture the entire platen rather than throwing away the outer cm. My experience is that having the book only 90 degrees open eliminates much of the gutter shadow on its own, and the additional processing that "book mode" does is largely unnecessary.
However, in my experience, the limit is not scanning quality: it is print quality. OCR quality is pretty good on modern editions, but old books, often stained, and even more often with defective print, give raise to a lot of errors. Often you don't have the choice of a better print.
This can't be helped. However, the other issue that gives problematic raw OCR is that even when character recognition is good, layout detection can be poor, and sidenotes, multi-column text, and the like can be blended in with the main text, while corners might be chopped off, and in older printings where the inter-line spacing might not be exactly constant, whole lines can be elided. If I'm going to exert more effort in getting images and OCR, I've found that the place where it pays off the most is in previewing and correcting the recognition areas before letting the OCR do its work. -- RS
participants (4)
-
Bowerbird@aol.com
-
Carlo Traverso
-
Jon Noring
-
Robert Shimmin