I might be able to hook you up on one of my internal dev vms, so the ia requests don't even have to go over the internet.
What kind of software do you need installed?
On Tuesday, January 3, 2012, Lee Passey <lee@novomail.net> wrote:
> There have been many complaints here (quite a few from yours truly) about the
> difficulty of harvesting more than the raw text of a work from the Internet
> Archive. As a first step to resolving some of these complaints I have created
> a servlet which parses the Abbyy output file and converts it to HTML.
>
> The resultant HTML identifies words which Abbyy could not find in its internal
> dictionary, and words which contain uncertain characters.
>
> Some attempt is made to identify blocks of text which are centered, and to set
> the relative font size for those spans of text which are significantly larger
> or smaller than the norm.
>
> Existing line breaks are preserved by the addition of a <br class="ocr"/>
> element. Line-ending soft hyphens (as identified by Abbyy) are replaced by
> "­~".
>
> Word coordinates are maintained (or perhaps more accurately, recomputed) and
> are attached to each word as a "title" attribute, e.g. <span class="word"
> title="(760,843),(791,996)">, where the first cartesian pair is the upper left
> coordiate, and the second pair is the lower right coordinate. (Technically,
> the coordinates are presented as (y,x). Should I change this to (x,y)?)
>
> A link to "archive.css" is added to the beginning of the file. This allows an
> end user to, among other things, highlight or otherwise mark words which are
> uncertain or not in the dictionary. To view the document without the original
> line breaks, add "br.ocr { display:none }" to the .css file.
>
> My next step will be to allow the document to be gzipped before downloading.
> After that, I will add an option to break the HTML file into multiple files,
> each of which matches a single page image, and return the collection as a zip
> archive.
>
> To use this service, in a browser navigate to
> "http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]", where [iaid] is the
> Internet Archive IDentifer for a specific work. For example,
> "http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546" will return the
> _Writings of Henry David Thoreau_, and
> "http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft" will return
> _Tarzan of the Apes_.
>
> This servlet makes an HTTP connection to the Internet Archive to download the
> *_abbyy.gz file, builds a DOM in memory, does an overall evaluation, some
> transformations, then serializes it to the servlet output. It is running on an
> old Pentium /// in my basement at the end of a DSL line, so expect it to be
> quite slow (it could require a matter of minutes to construct the file). Also
> be kind; try not to overload or monopolize it. If the Internet Archive would
> like to give me access to a servlet engine on a fast server with a fat pipe
> I'm sure performance would be vastly improved.
>
> Feedback is encouraged.
>
> Cheers,
> Lee
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
--
--
Alex Buie
Network Coordinator / Server Engineer
KWD Services, Inc
Media and Hosting Solutions
+1(703)445-3391
+1(480)253-9640
+1(703)919-8090
abuie@kwdservices.com
ज़रा