[gutvol-d] Fwd: Re: epubeditor.sourceforge.net

22 Oct 2011

      Comment from the current author:

[11:47:01 AM] KH: It's a big file because it includes lots of the
original geometry information and wraps each word in its own span to
support corrections.

He didn't say this, but you're right in that it's also to help support
the flippy setup.

As far as the script, the actual php itself is still internal only, as
it's part of the main codebase (which isn't open). Perhaps I can see
about making it public.

You can use it on any item, however, by visiting
"http://www-kenh.archive.org/download/<IDENTIFIER>/<IDENTIFIER>_abbyy.html"

Alex

--
Alex Buie
Network Coordinator / Server Engineer
KWD Services, Inc
Media and Hosting Solutions
+1(703)445-3391
+1(480)253-9640
+1(703)919-8090
abuie@kwdservices.com

On Fri, Oct 21, 2011 at 11:33 PM, Lee Passey <lee@novomail.net> wrote:
...
On 10/21/2011 9:58 AM, Alex Buie wrote:
...
Lee,
There's a very alpha script we're testing you can try out. (No
guarantee, may burn the house down, etc)
http://www-kenh.archive.org/download/artofbook00holm/artofbook00holm_abbyy.h...
What we have here is not the script, but the output of the script as applied
to "The Art of the Book." As you might expect, it's a mess. Like all
FileReader output it is in SGML HTML, which XML parsers cannot use. It
appears that in standard usage at the Internet Archive, the primary (perhaps
only) reason for doing OCR is to support searching inside the FlipBooks (or
the PDF files which appear to be the same thing in a different format).
What I see in this file is that /every/ word in the file has a surrounding
<span> of class "abbyyword" and a title attribute containing the coordinates
indicating where that word appeared in the image. The end of every line that
was detected is indicated by <br class="abbyybreak">, and every line begins
with an <a>nchor of class "abbyyline" that not only has an identifier it
also has data indicating where the line starts in the image.
The file is full of this sort of cruft; if we want data that is not
inextricably tied to a picture, we don't need any of this stuff.
BUT...
If this is the best I can get, I'll take it. I think I can write some XSL
scripts, maybe together with some real programming, that can reduce the
garbage to a manageable level. I would much rather have too much data than
too little, because its easier to ignore irrelevant data than it is to guess
at the existence of unknown data.
Whoever wrote this should be continued to improve it; don't let her/him
become discouraged. And if IA doesn't have the resources to pursue it,
please make it available in the condition it's in now.
If you could provide a pointer to the actual script, that would be really
great.