Re: [gutvol-d] Recording the quality of a book's OCR

10 Jan 2012

      On 1/3/2012 12:52 PM, Lee Passey wrote:
...
On Tue, January 3, 2012 11:46 am, Edward Betts wrote:
...
I agree we need a way to correct our books. I find it difficult to
read the ePub or Kindle versions of scanned books because of the
OCR errors. I'm sure the OCR errors are irritating when using
text-to-speech.
As a first step to move this process along I have created a servlet
which parses the Abbyy output file and converts it to HTML.
[snip]
...
My next step will be to allow the document to be gzipped before
downloading. After that, I will add an option to break the HTML file
into multiple files, each of which matches a single page image, and
return the collection as a zip archive.
This step is now complete. Additionally, there is now an option to omit
(or include, depending on your perspective) word coordinates.
...
To use this service, in a browser navigate to
"http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]", where [iaid] is
the Internet Archive IDentifer for a specific work.
By default, Abbyy output is returned as a single HTML file without word
coordinates. To add coordinates add "&coords" to the query string; e.g.
"http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546&coords". To
return the file as a gzipped HTML file add "&gzip" to the query string.
To return the file as a zip archive of HTML files where each file
represents a single page (and should have the same naming convention as
the image files) add "&zip" to the query string.

Note that "&zip" and "&gzip" are incompatible, with "&zip" taking
precedence; if you use both options, "&gzip" will be ignored. If you
were building an online editing tool you would probably want to use a
query string like this:

"http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft&coords&zip"

Again, let me remind you that this service is running on an old Pentium
/// in my basement at the end of a DSL line, so expect it to be quite
slow (it could require a matter of minutes to construct the file).

As always, feedback is encouraged.

Cheers,
Lee