
jon said:
Not possible, unless one bought the *big buck* (above office-level) sheet feed or page turning scanners, or one simply used a photocopy machine, and captured the low-rez images it produces.
my girlfriend's office has a $10,000 lanier just down the hall. that's the kind of machine i was talking about. their website says that their high-end machines can scan 60+ pages an hour. but i grant you that a scanning time of a few hours (or more) is much more in line with what most normal people can attain, even those with lots of experience like yourself...
Yes, but this is not for the average, ordinary Joe working in his basement. This requires a lot of $$$ in upfront investment to get this fancy equipment and software.
i think you might be surprised in the coming months, jon.
There's still need for the whiz-bang scan cleanup software, which I know is expensive.
donovan was working on some open-source deskewing routines. might want to check that out. and i'm told that abbyy does a fairly good job setting brightness and contrast automatically. so the other thing that needs to be done is to standardize the placement of each scan relative to each other, which isn't hard. (removing curvature is a bear, but the best new scanner out -- the optik? -- lets you lay the book on the edge of the bed, which i understand effectively cures the curvature problems.)
in "My Antonia" a lot of pages were not numbered at all
that's not uncommon.
(such as the last page in each chapter).
yes, i noticed that. _that_ is a little uncommon. but like i said earlier, publishers can be weird.
I had to be especially careful not to mess up and lose which page is which.
it's _fairly_ easy to do each page in sequence -- just have to pay some attention turning the page -- and then using the auto-increment-name option will ensure that all of the files are named correctly.
Hmmm, this is a lot like what James Linden is developing, which may be incorporated into PG of Canada's operations. <smile/>
if you check the archives you'll find i'm the one who posted it. i also offered to write all the software. all that was ignored. doesn't matter though, i'm proceeding to build my own system. if james took my post to heart, then he's smart. :+)
Doing 15,000 texts, or a million texts, still needs some manual processing.
if you're manually opening every file, and manually summoning every scan you need to check, you're going to burn yourself out. _plus_ expose yourself to the reality of inadvertent changes. you have to have a system that tracks every change that's made, so you can review the log to make sure it was the correct change, and that nothing else was changed. reviewing the log is "manual", and so is the decision as to _approval/rejection_ of the change, but the change itself should be totally automated.
That's why, to me, it is more important to redo the collection, put it on a common, surer footing (including building trust), before launching into doing a lot more texts.
the library needs to be _corrected_, yes, but _not_ "redone". and i think you do more damage than good when you talk about e-texts being done "incorrectly", when what you _really_ mean is that an edition was used that you don't happen to approve of, or that metadata isn't included, just to use some most examples. there are _real_ errors in the e-texts. honest-to-goodness mistakes. we need to concentrate on _those_, not on some edition that uses the british spellings instead of american ones. (even if that _was_ silly.) but distributed proofreaders is more interested in doing new books than fixing old ones. they're volunteers who set their own priorities.
Imagine how difficult it would be to process one million texts if they were produced in the same ad-hoc fashion, without following some common standards.
i don't have to "imagine" it. that's the way the library is now. and i made my fair share of efforts to try and convince the powers that that situation needed to be addressed with some standardization. but the difficulty of doing it with the type of heavy-markup that you like has held up that whole darn process. if we would have proceeded with the "zen markup language" that i like, the library would have been clean now.
PG's ad hoc approach up to now (which DP has partly fixed)
the d.p. e-texts still exhibit a large degree of inconsistencies. and contrary to what you imply, they are not generally error-free. some are, but others are not. the same is true of earlier e-texts. the quality has improved, yes, surely. it is still not highest quality. but they are volunteers, and thus they set their own bar for quality. and they certainly deliver quality that is high enough that we could use "continuous proofreading" and have the public zoom us to perfect.
it can't be done using any plain text regularization scheme
you're wrong. dead wrong. *** anyway, jon, thanks for the information on your scanning experience. i come away from hearing it with an even more firm conclusion that scanning and image-cleanup is indeed the biggest part of the process. -bowerbird