
James: re em-dashes - you have to consider how much your time is worth, i.e. is it more cost/time-effective to buy and use Finereader, or use something that may be free but does a less than satisfactory job requiring more of your time to compensate for its shortcomings. If your time is worth, say, $50/hour, how many hours of replacing em-dashes would it take to justify Finereader's cost? Probably not many. re submitting scansets to DP/DPC - I've submitted dozens of scansets to DPC, and personally, I couldn't care less how long it takes them to produce ebooks from those scansets. I look on DPC as a fire-and-forget thing--I fire the scanset at them, and forget it. It's not as if there aren't millions (literally) of other books to deal with. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Simmons Sent: Wednesday, December 21, 2011 3:06 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] book of james -- 002 Bowerbird, I have noticed the lack of em-dashes. They are painful to put back in. If I had UTF-8 output that would be even better. I understand the DP guys swear by ABBY Fine Reader and I can see why they would but I'm an open source kind of guy and Tesseract worked really well on most of my earlier books. I would have submitted page images to DP for this book if not for the fact that I want the book to be finished while I'm still alive. I've submitted stuff to both DP and DP Canada over a year ago and it's still working its way through the queues. I've got three Raymond Chandler novels and two Robert C. Benchley humor collections at DP Canada now. I want to make it clear that I'm going to finish this book as I started it, by fixing up the archive.org text file. I only offered it up as an example of a really hard book to do. If you can make your approach work on this it will work on anything. I will help with this in any way I can short of starting the whole process over again. If I can do half the new way and half the old I'll do it. James Simmons On Wed, Dec 21, 2011 at 1:37 PM, <Bowerbird@aol.com> wrote: james said:
I started doing the page-at-a-time thing and gave up. Your pages are already better than mine because I used Tesseract and archive.org uses ABBY Fine Reader.
except the o.c.r. from archive.org, in this case, is screwed up. it's missing its em-dashes. i've dealt with this problem before, and it's less work to re-do the o.c.r. than to fix the em-dashes. will someone with a good version of abbyy please re-do this o.c.r.?
2). This book really requires a way to enter UTF-8 characters.
if someone does the o.c.r. for you, they can specify utf8 output...
If I could just stick a circumflex above a's, u's, and i's (both lower and upper case) that would be 99% of what I need.
if you can pull out a list of the words that require circumflexes, we can create a script that does a global change in one swoop.
(after de-hyphenating
do not dehyphenate! the program will do that for you.
re-wrapping
do not rewrap! if you need to rewrap, the program can do it. rewrapping is evil. it just makes it harder for the next guy... -bowerbird _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d