alex said:
>   The script uses only the gzipped
>   abbyy output to generate the html
>   (ie, what's available here

alex, i'll let you sort out lee's questions with him.

good luck with that...

but...

from the standpoint of what can be "generated"
from the abbyy files, you _could_ make an .rtf...

and when i say that "you" could do it, what i mean
is that you could simply instruct abbyy to output it.
no coding, no nothing. just tell abby to output .rtf.
and that .rtf file would be enough to _build_ from...

i have detailed the problems with your .txt files:
1. pagebreaks have been, for the most part, lost.
2. styling is lost (like italics, bold, and fontsize).
3. in some of your books, em-dashes were lost.

it's worth noting that _all_ of these problems
were introduced by the archive.org workflow.

_none_ of 'em are endemic to abbyy, or its output.

indeed, the abbyy file you've just referenced above
-- like all abbyy output -- contains _all_ this stuff.

i can write a program (already wrote it, long ago)
which pulls all of the relevant data out of that file.

the problem is that "the relevant data" is buried in
~60megs of x.m.l. crud. it only takes a few minutes
to pull out "the gold" from this one file, but when we
multiply those "few minutes" by two _million_ books,
the task becomes undoable. that's the problem, alex.

the file that contains the data also contains _crud_.
and when you output the data, you eliminated the
crud, yes, but you also eliminated important data.
that is, you threw the baby out with the bath water.

also, generating .html from the abbyy output is _not_
a good solution. because that .html is _not_ going to
be "semantically accurate". for the obvious example,
the headers are _not_ gonna be _tagged_ as headers.
they're just going to be marked as 24-point bold text.
so you're _still_ gonna have to do some analysis of it.

don't get sucked into a wild goose chase...

-bowerbird