Re: [gutvol-d] Fwd: Re: epubeditor.sourceforge.net

somebody at archive.org said:
KH: It's a big file because it includes lots of the original geometry information and wraps each word in its own span to support corrections.
just another example of their ludicrous workflow over there... the important content from the o.c.r. process -- the text -- gets muddled inside a huge amount of other data which they consider to be "necessary" solely because their major focus is on the scans. and, just in case you didn't realize it, _this_ is why the _search_ capability at archive.org is so darn slow, often to the point that the machine gives up, meaning you get no results at all, because the search routine has to plow through so much irrelevant data. -bowerbird p.s. in the rare case when one needs to know the coordinates of a word's placement on the scan, one can easily determine it. the routine for finding a line's baseline is simple, and very fast, and -- turned on its side -- works fine for locating each word. i once posted a sample graphic, but i can't run it down it now. but for heaven's sake, if you feel that you _must_ save the data, then at least have the good sense to store it in a separate file...

There's nothing stopping someone writing an abbyy.gz to plain (non-encumbered) html converter, aside from effort, you know... (not trying to snark at BB) Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Mon, Oct 24, 2011 at 4:19 PM, <Bowerbird@aol.com> wrote:
somebody at archive.org said:
KH: It's a big file because it includes lots of the original geometry information and wraps each word in its own span to support corrections.
just another example of their ludicrous workflow over there...
the important content from the o.c.r. process -- the text -- gets muddled inside a huge amount of other data which they consider to be "necessary" solely because their major focus is on the scans.
and, just in case you didn't realize it, _this_ is why the _search_ capability at archive.org is so darn slow, often to the point that the machine gives up, meaning you get no results at all, because the search routine has to plow through so much irrelevant data.
-bowerbird
p.s. in the rare case when one needs to know the coordinates of a word's placement on the scan, one can easily determine it. the routine for finding a line's baseline is simple, and very fast, and -- turned on its side -- works fine for locating each word. i once posted a sample graphic, but i can't run it down it now. but for heaven's sake, if you feel that you _must_ save the data, then at least have the good sense to store it in a separate file...
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 10/24/2011 2:19 PM, Bowerbird@aol.com wrote:
but for heaven's sake, if you feel that you _must_ save the data, then at least have the good sense to store it in a separate file...
Ahh, but what he provided me /was/ the separate file. /Your/ file is http://ia700600.us.archive.org/16/items/artofbook00holm/artofbook00holm_djvu.... There it is, the text, the whole text, and nothing but the text. I'm just being a little more demanding. What /I/ want is the output from FineReader as though the "Save as HTML" option was selected, with all the markup that FineReader was able to intuit, together with information about line breaks, page breaks and soft hyphens, but without any of the geometry data. Now I believe that no file is really good enough to be published without some human attention and refinement. So my next "demand" would be for this FineReader HTML output to be placed in an environment where it /could/ be refined; at this point the autogenerated formats, such as ePub and Kindle, should be generated from the refined file, not the raw file. Like you, I have no expectation that IA will create any kind of environment where human "beans" can refine the texts. But by giving us the output of the PHP script, it should now be possible to off-load the refinement and publication of digital texts to a third-party organization. I think it should be fairly easy to set up a rudimentary system to do this. Does anyone want to furnish me a *nix server with a fat pipe?
participants (3)
-
Alex Buie
-
Bowerbird@aol.com
-
Lee Passey