hello. i hope you're having an enjoyable holiday weekend. :+)
as i said, the thrust of the thread is how to digitize a book.
if anyone had suggested a book, i woulda used their choice;
but since they didn't, i decided to do an old standby again...
"books and culture", by hamilton mabie, is a pleasant book.
it's 280 pages, with about 220k of text, so it's average size.
i used a version scanned by the internet archive last november.
find it by searching archive.org for "booksculture00mabiuoft".
as many of you will remember, we've looked at this book a lot.
in fact, it was one of the examples which i just gave benjamin.
this means that i can test my efforts in this current digitization
by comparing my results to a very-well-proofed criterion-file...
i'll share all my in-process versions with you later in the week,
but for now i can post some preliminary results. as i expected,
the o.c.r. results were very good. ignoring front-matter pages,
the text in the body of the book contained ~5700 lines of text.
i did a few standard cleanup routines on the o.c.r. file, such as
closing up spacey punctuation and converting utf8 em-dashes.
after that cleaning, of the ~5700 lines, over 97% were correct...
specifically, about 5,555 were correct, and about 123 were not.
that, in a nutshell, is the problem with the workflow at pgdp.net.
d.p. has volunteers examine every line in a file, multiple times!,
rather than laser-focusing on the 3% which are "problem lines".
as i have shown before, and will show once again, it's rather easy
to find those "problem lines". heck, most show up in spellcheck!
after the cleanup, it's a breeze to turn the file into a nice e-book.
-bowerbird