James:
re
em-dashes - you have to consider how much your time is worth, i.e. is
it more cost/time-effective to buy and use Finereader, or use
something that may be free but does a less than satisfactory job requiring more
of your time to compensate for its shortcomings. If your time
is worth, say, $50/hour, how many hours of replacing em-dashes would it take to
justify Finereader's cost? Probably not many.
re
submitting scansets to DP/DPC - I've submitted dozens of scansets to DPC, and
personally, I couldn't care less how long it takes them to produce ebooks
from those scansets. I look on DPC as a fire-and-forget thing--I fire
the scanset at them, and forget it. It's not as if there aren't
millions (literally) of other books to deal with.
Al
Bowerbird,
I have noticed the lack of em-dashes. They are painful to put back
in.
If I had UTF-8 output that would be even better. I understand the
DP guys swear by ABBY Fine Reader and I can see why they would but I'm an open
source kind of guy and Tesseract worked really well on most of my earlier
books. I would have submitted page images to DP for this book if not for
the fact that I want the book to be finished while I'm still alive. I've
submitted stuff to both DP and DP Canada over a year ago and it's still
working its way through the queues. I've got three Raymond Chandler
novels and two Robert C. Benchley humor collections at DP Canada now.
I want to make it clear that I'm going to finish this book as I started
it, by fixing up the
archive.org text file.
I only offered it up as an example of a really hard book to do. If
you can make your approach work on this it will work on anything. I will
help with this in any way I can short of starting the whole process over
again. If I can do half the new way and half the old I'll do it.
James Simmons
On Wed, Dec 21, 2011 at 1:37 PM,
<Bowerbird@aol.com> wrote:
james said:
> I started doing the page-at-a-time
thing and gave up.
> Your pages are already better than
mine because
> I used Tesseract and archive.org uses ABBY Fine
Reader.
except the o.c.r. from archive.org, in this case, is screwed up.
it's missing
its em-dashes. i've dealt with this problem before,
and it's less
work to re-do the o.c.r. than to fix the em-dashes.
will someone with
a good version of abbyy please re-do this
o.c.r.?
> 2). This book really requires a
way to enter UTF-8 characters.
if someone does the o.c.r. for you,
they can specify utf8 output...
> If I could just
stick a circumflex above a's, u's, and i's
> (both lower
and upper case) that would be 99% of what I need.
if you can pull out
a list of the words that require circumflexes,
we can create a script
that does a global change in one swoop.
> (after
de-hyphenating
do not dehyphenate! the program will do that for
you.
> re-wrapping
do not rewrap! if
you need to rewrap, the program can do it.
rewrapping is evil.
it just makes it harder for the next
guy...
-bowerbird
_______________________________________________
gutvol-d
mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d