
james- work is coming along fine on the file, but i'm sorry that i haven't gotten it turned back to you sooner than this... i got caught up in some other things, plus the clean-up is taking longer than i'd hoped it would, mostly because the typography of the book really wasn't quite up to snuff. in the meantime, you can always work on the family-trees. *** to make this post more widely applicable, i can probably reiterate the point i usually describe at the very beginning: take care to select a well-done book which will scan nicely. or, if you're using pre-existing o.c.r., make sure it's good. otherwise, you might well have to spend a lot more time. for instance, on this book which james is digitizing now, he faces the triple whammy -- poor p-book typography, bad scans, and internet archive's careless text handling (where the book is missing em-dashes, utf8, and italics). so even after all the time he and i have spent cleaning it, my guess is that he would still be finished with it sooner if he were to start from scratch and do the job correctly... -bowerbird

Bowerbird, I'm not exactly sure what "start from scratch and do the job correctly" would mean in my case. I do not own and don't intend to buy ABBYY Fine Reader. I did attempt to make my own page-at-a-time OCR with Tesseract and the ABBYY OCR from archive.org was far, far better. Obviously room for improvement, but much better than what I started with. Putting in em-dashes and accents is painful, but realistically what alternative do I have? My understanding of what you're currently trying to do is get my partially corrected text to match line-by-line and page-by-page with the page images. Once you have that I'm supposed to go through page by page and change the text to look exactly like the page images. I add underscores for italics, but I do not re-wrap the text, remove page numbers, join paragraphs split between pages, or anything else I've been doing so far. I just correct spelling and accents. After I've done all that I go back and put in the ZML stuff like extra blank lines for page headings. Then the magic happens and we get several formats of the book from one source. I'm not entirely clear on that part. James Simmons On Wed, Jan 11, 2012 at 2:59 PM, <Bowerbird@aol.com> wrote:
james-
work is coming along fine on the file, but i'm sorry that i haven't gotten it turned back to you sooner than this...
i got caught up in some other things, plus the clean-up is taking longer than i'd hoped it would, mostly because the typography of the book really wasn't quite up to snuff.
in the meantime, you can always work on the family-trees.
***
to make this post more widely applicable, i can probably reiterate the point i usually describe at the very beginning: take care to select a well-done book which will scan nicely. or, if you're using pre-existing o.c.r., make sure it's good.
otherwise, you might well have to spend a lot more time.
for instance, on this book which james is digitizing now, he faces the triple whammy -- poor p-book typography, bad scans, and internet archive's careless text handling (where the book is missing em-dashes, utf8, and italics).
so even after all the time he and i have spent cleaning it, my guess is that he would still be finished with it sooner if he were to start from scratch and do the job correctly...
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (2)
-
Bowerbird@aol.com
-
James Simmons