
16 Oct
2011
16 Oct
'11
4:38 a.m.
and of the ~15000 lines in this book, approximately 9,000 have identical o.c.r. across all 4 digitizations, with another 3,000 that have identical o.c.r. across 3 of the 4 digitizations, meaning that the "problem lines" can be boiled down to a fairly small subset...
Or one can actually go to archive.org, take a look at the txt ocr's, and evaluate for oneself how much work this is, or isn't going to be. [Not exactly sure how BB comes up with his measures!]