
Comment from the current author: [11:47:01 AM] KH: It's a big file because it includes lots of the original geometry information and wraps each word in its own span to support corrections. He didn't say this, but you're right in that it's also to help support the flippy setup. As far as the script, the actual php itself is still internal only, as it's part of the main codebase (which isn't open). Perhaps I can see about making it public. You can use it on any item, however, by visiting "http://www-kenh.archive.org/download/<IDENTIFIER>/<IDENTIFIER>_abbyy.html" Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Fri, Oct 21, 2011 at 11:33 PM, Lee Passey <lee@novomail.net> wrote:
On 10/21/2011 9:58 AM, Alex Buie wrote:
Lee,
There's a very alpha script we're testing you can try out. (No guarantee, may burn the house down, etc)
http://www-kenh.archive.org/download/artofbook00holm/artofbook00holm_abbyy.h...
What we have here is not the script, but the output of the script as applied to "The Art of the Book." As you might expect, it's a mess. Like all FileReader output it is in SGML HTML, which XML parsers cannot use. It appears that in standard usage at the Internet Archive, the primary (perhaps only) reason for doing OCR is to support searching inside the FlipBooks (or the PDF files which appear to be the same thing in a different format).
What I see in this file is that /every/ word in the file has a surrounding <span> of class "abbyyword" and a title attribute containing the coordinates indicating where that word appeared in the image. The end of every line that was detected is indicated by <br class="abbyybreak">, and every line begins with an <a>nchor of class "abbyyline" that not only has an identifier it also has data indicating where the line starts in the image.
The file is full of this sort of cruft; if we want data that is not inextricably tied to a picture, we don't need any of this stuff.
BUT...
If this is the best I can get, I'll take it. I think I can write some XSL scripts, maybe together with some real programming, that can reduce the garbage to a manageable level. I would much rather have too much data than too little, because its easier to ignore irrelevant data than it is to guess at the existence of unknown data.
Whoever wrote this should be continued to improve it; don't let her/him become discouraged. And if IA doesn't have the resources to pursue it, please make it available in the condition it's in now.
If you could provide a pointer to the actual script, that would be really great.