Re: [gutvol-d] Fwd: Re: epubeditor.sourceforge.net

24 Oct 2011


      There's nothing stopping someone writing an abbyy.gz to plain
(non-encumbered) html converter, aside from effort, you know...

(not trying to snark at BB)

Alex
--
Alex Buie
Network Coordinator / Server Engineer
KWD Services, Inc
Media and Hosting Solutions
+1(703)445-3391
+1(480)253-9640
+1(703)919-8090
abuie@kwdservices.com


On Mon, Oct 24, 2011 at 4:19 PM,  <Bowerbird@aol.com> wrote:
...
somebody at archive.org said:
...
   KH: It's a big file because it includes
   lots of the original geometry information and
   wraps each word in its own span to support corrections.
just another example of their ludicrous workflow over there...
the important content from the o.c.r. process -- the text -- gets
muddled inside a huge amount of other data which they consider
to be "necessary" solely because their major focus is on the scans.
and, just in case you didn't realize it, _this_ is why the _search_
capability at archive.org is so darn slow, often to the point that
the machine gives up, meaning you get no results at all, because
the search routine has to plow through so much irrelevant data.
-bowerbird
p.s.  in the rare case when one needs to know the coordinates
of a word's placement on the scan, one can easily determine it.
the routine for finding a line's baseline is simple, and very fast,
and -- turned on its side -- works fine for locating each word.
i once posted a sample graphic, but i can't run it down it now.
but for heaven's sake, if you feel that you _must_ save the data,
then at least have the good sense to store it in a separate file...
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d