
There's nothing stopping someone writing an abbyy.gz to plain (non-encumbered) html converter, aside from effort, you know... (not trying to snark at BB) Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Mon, Oct 24, 2011 at 4:19 PM, <Bowerbird@aol.com> wrote:
somebody at archive.org said:
KH: It's a big file because it includes lots of the original geometry information and wraps each word in its own span to support corrections.
just another example of their ludicrous workflow over there...
the important content from the o.c.r. process -- the text -- gets muddled inside a huge amount of other data which they consider to be "necessary" solely because their major focus is on the scans.
and, just in case you didn't realize it, _this_ is why the _search_ capability at archive.org is so darn slow, often to the point that the machine gives up, meaning you get no results at all, because the search routine has to plow through so much irrelevant data.
-bowerbird
p.s. in the rare case when one needs to know the coordinates of a word's placement on the scan, one can easily determine it. the routine for finding a line's baseline is simple, and very fast, and -- turned on its side -- works fine for locating each word. i once posted a sample graphic, but i can't run it down it now. but for heaven's sake, if you feel that you _must_ save the data, then at least have the good sense to store it in a separate file...
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d