james said:
>   I'm not exactly sure what
>   "start from scratch and do the job correctly"
>   would mean in my case.

well, if it's any consolation, i didn't mean to imply
that anything done "incorrectly" was _your_ fault...

mostly this is a botched job by the internet archive.

they did a lousy job of scanning the book, and then
their stupid workflow went and sabotaged the o.c.r.

so i consider you to be an innocent victim of them...

the one thing you did do wrong was the rewrapping.

but again, your guide in that was a misguided policy
from project gutenberg, which _might__ have made
some kind of sense back when it was established, but
has outlived all of its usefulness by now, to the point
it's a bad practice which can only be preached against.

so, you know, once again, you are the victim here...

having said all of that, though, my advice to you has
not changed since i first gave it to you a while back,
which is that you would be better off starting over...

you could spend a little time cleaning up the scans,
and then ensure that the o.c.r. gets done correctly,
with the options specified intelligently, and output
saved properly, and _still_ finish the thing _faster_
than continuing to work with the file as it now exists.
(and that's _after_ we've both put in a bunch of work!)

i'm just about done doing my part, so i no longer have
a dog in this fight, so my advice is for _your_ benefit...
nobody else here will care.  they aren't being affected.

but that's my advice.  for you.

and i feel some kind of urgency and obligation to tell
everyone down the line to avoid a situation like this...

because it simply doesn't need to be _this_ difficult...

you're willing to take on the burden, and that's fine, but
even then, it's my opinion that your time would be better
spent doing four books the easy way than one the hard way.


>   I do not own and don't intend to buy ABBYY Fine Reader.

i guess mr. haines didn't convince you with his argument
that -- at a value of $50 an hour for your time -- it would
_pay_ for itself very quickly...                     :+)

truth is, even at $1/hour, it will pay for itself soon enough,
if you think you're gonna keep doing more digitizations...

you can also pick up a "used" copy at a really low price...

and it's not just the money; it's your _sanity_ you're saving...

but that's beside the point, too.

there are enough people here that _do_ own finereader that
you should be able to find one of them to do you the favor
of doing the o.c.r. for you.  you could do them a favor back...

heck, if you woulda asked me, in the first place, i would have
done it with my mac copy of finereader (which is way behind,
yet would still probably give you better results than you have).
but you wanted to pursue the file you had, so i went that way.
if i had it to do over, however, i would reverse that decision...


>   I did attempt to make my own page-at-a-time OCR
>   with Tesseract

yeah, tesseract is an even bigger waste of time and energy.


>   Obviously room for improvement, but
>   much better than what I started with.

there might be different classes of trash, but it's all garbage.

don't start comparing one pile of garbage to another one, or
you're just gonna convince yourself to end up with garbage...


>   Putting in em-dashes and accents is painful,
>   but realistically what alternative do I have?

re-doing the o.c.r., so those things are _retained_, instead of
being thrown away.  that is exactly what i am saying, james...

you _did_ have an alternative.  but you choose not to use it...
because you didn't want to throw away the time you'd spent.
so, instead, you will end up throwing away even more time,
working on a decidedly-inferior file.  i told you that goin' in,
and now that i've spent a bunch of time with the file myself,
i can confirm that my original advice was extremely sound...
and i shoulda stuck with it, until i eventually convinced you.
i took up the challenge because i knew i could learn from it,
and i did, so it wasn't a waste of _my_ time.  but i can tell you
that you'll be wasting your time, doing unnecessary hard work.

and let me reiterate:  you _still_have
_ that better alternative!

you can still find someone to do the o.c.r. for you, and you
can clean up the scans first, so they give you better results.
and then, because you will not do the stupid stuff that was
done by internet archive, you won't have to waste your time
undoing the damage that they did.  you still have a choice...

and it's not just you, james.  there are a whole lot of people
who're beating their heads against the wall to digitize books,
because they think they have "no alternative" to head-beating.


>   My understanding of what you're currently trying to do
>   is get my partially corrected text to match line-by-line
>   and page-by-page with the page images.

i'm not "trying" to do it.  i'm actually doing it.  and i'm also
doing _some_ cleaning of the text, and a lot of formatting.

but you're going to have to do a whole lot _more_ cleaning...
much of the work will be painful, and some of it unnecessary.

oh, and just for the record, i have spent a good many hours
getting those pagebreaks and linebreaks back like they were.

it might not have taken you more than a minute to rewrap it,
but it's taking me hours and hours and hours to undo that...

it's not a job that i _like_ doing...  in fact, i totally detest it...
and the reason i hate it so much is that it is _unnecessary_...
you should have never rewrapped that text before proofing.

but let me tell you why i did that job.  i did it for you, james.
i did it because i knew that every hour i spent "fixing" that
would save you 3-to-7 hours of proofing time, and _sanity_.

so i _spent_ my own time, james, so you could save _yours_.

(
i wouldn't have done it if it would've been a 1-to-1 payoff;
but with a 3-to-1 payoff, maybe even 7-1, i decided to do it.)

so i think i've "earned" the right to give you some good advice.
(as if i really had to earn something like that in the first place.)

the important thing is that you learn that you should _never_
do a book this way ever again.  it's simply not worth the work.


>   Once you have that I'm supposed to go through
>   page by page and change the text to look
>   exactly like the page images.

i will have done most of that by the time i hand it over to you.

once the original linebreaks are restored, it's a piece of cake.
that's why i was able to use the text from the internet archive
to show well-formatted model-pages right from the outset...


>   I add underscores for italics,

right, and that's something you shouldn't have had to do,
since abbyy does a good job of catching most of the italics.

but then the internet archive went and threw them away,
along with the em-dashes, which you must also restore...

in the future, i'll make available an app that will go and grab
that information -- it's sitting there, in one of the data-files
at archive.org -- so people don't have to suffer because of
a bad decision made outside of their control, and i could've
helped _you_ to avoid that unnecessary work as well, but...

one thing you will not have to do is to worry about diacritics.
at least for the names you've done thus far, i can capture 'em,
and make a global search-and-replace to put in the diacritics.
but that's another thing you shouldn't have had to worry about,
because i'm pretty sure abbyy can recognize those characters...


>   but I do not re-wrap the text, remove page numbers,
>   join paragraphs split between pages, or anything else
>   I've been doing so far.

that's right.  don't do any of that.  the machine can do that.
don't waste any of your time doing what the machine can do.


>   I just correct spelling and accents.

no accents either.  but it's more than just "spelling".
indeed, the "spelling" part is the easy part, because
the spellcheck routine flags most of the bad words.

but where the bulk of the work exists for this text is
in the correction of the punctuation.  the scans were
bad, so that means the o.c.r. of punctuation is bad...

but the problem is exacerbated because the book was
typeset very poorly, with lots of punctuation glitches...

and the exacerbation continues, because many of the
punctuation glitches _can't_ be found with algorithms.

and, just as another example of a task that took hours,
the doublequote checking was a bear, because the o.c.r.
was bad, but -- even worse -- because the typographer
made lots and lots and lots of doublequote mistakes...
this includes a failure to mark continuing dialog with
an open-quote at the start of subsequent paragraphs.
so i installed most of them, to the best of my abilities,
but it's very difficult sometimes to make that judgment
when you're not actually absorbing the book's content.

and this means i also introduced lots of doublequotes
that were _not_ in the p-book itself, and you might be
uncomfortable with that, and want to go delete them...
(which is fine, by the way; i have no attachment to 'em;
if you decide that's what you want to do, i can probably
write a routine that could locate them rapidly for you.)

but most of your work will revolve around punctuation.
and whether you'll want to standardize inconsistencies
in the book's typography, some of which are... weird...

and this is picayune stuff.  it's... _irritating_.  not fun.

now, this punctuation work would have been necessary
even if you would have had reasonably good o.c.r. text.

but after having done all the unnecessary work required,
i'm not sure that i would have the stomach to be able to
face the stresses of handling those types of judgments...

so i suggest that you gird yourself...


>   After I've done all that I go back and put in
>   the ZML stuff like extra blank lines for page headings.

again, most of that will have already been done.


>   Then the magic happens and we get several formats
>   of the book from one source.  I'm not entirely clear
>   on that part.

you don't need to be clear.  you just need to click a button.

-bowerbird