somebody at archive.org said:
> KH: It's a big file because it includes
> lots of the original geometry information and
> wraps each word in its own span to support corrections.
just another example of their ludicrous workflow over there...
the important content from the o.c.r. process -- the text -- gets
muddled inside a huge amount of other data which they consider
to be "necessary" solely because their major focus is on the scans.
and, just in case you didn't realize it, _this_ is why the _search_
capability at archive.org is so darn slow, often to the point that
the machine gives up, meaning you get no results at all, because
the search routine has to plow through so much irrelevant data.
-bowerbird
p.s. in the rare case when one needs to know the coordinates
of a word's placement on the scan, one can easily determine it.
the routine for finding a line's baseline is simple, and very fast,
and -- turned on its side -- works fine for locating each word.
i once posted a sample graphic, but i can't run it down it now.
but for heaven's sake, if you feel that you _must_ save the data,
then at least have the good sense to store it in a separate file...