these grapes are sweet -- lesson #02

hello. i hope you're having an enjoyable holiday weekend. :+) as i said, the thrust of the thread is how to digitize a book. if anyone had suggested a book, i woulda used their choice; but since they didn't, i decided to do an old standby again... "books and culture", by hamilton mabie, is a pleasant book. it's 280 pages, with about 220k of text, so it's average size. i used a version scanned by the internet archive last november. find it by searching archive.org for "booksculture00mabiuoft". as many of you will remember, we've looked at this book a lot. in fact, it was one of the examples which i just gave benjamin. this means that i can test my efforts in this current digitization by comparing my results to a very-well-proofed criterion-file... i'll share all my in-process versions with you later in the week, but for now i can post some preliminary results. as i expected, the o.c.r. results were very good. ignoring front-matter pages, the text in the body of the book contained ~5700 lines of text. i did a few standard cleanup routines on the o.c.r. file, such as closing up spacey punctuation and converting utf8 em-dashes. after that cleaning, of the ~5700 lines, over 97% were correct... specifically, about 5,555 were correct, and about 123 were not. that, in a nutshell, is the problem with the workflow at pgdp.net. d.p. has volunteers examine every line in a file, multiple times!, rather than laser-focusing on the 3% which are "problem lines". as i have shown before, and will show once again, it's rather easy to find those "problem lines". heck, most show up in spellcheck! after the cleanup, it's a breeze to turn the file into a nice e-book. -bowerbird
participants (1)
-
Bowerbird@aol.com