
On 1/3/2012 12:52 PM, Lee Passey wrote:
On Tue, January 3, 2012 11:46 am, Edward Betts wrote:
I agree we need a way to correct our books. I find it difficult to read the ePub or Kindle versions of scanned books because of the OCR errors. I'm sure the OCR errors are irritating when using text-to-speech.
As a first step to move this process along I have created a servlet which parses the Abbyy output file and converts it to HTML.
[snip]
My next step will be to allow the document to be gzipped before downloading. After that, I will add an option to break the HTML file into multiple files, each of which matches a single page image, and return the collection as a zip archive.
This step is now complete. Additionally, there is now an option to omit (or include, depending on your perspective) word coordinates.
To use this service, in a browser navigate to "http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]", where [iaid] is the Internet Archive IDentifer for a specific work.
By default, Abbyy output is returned as a single HTML file without word coordinates. To add coordinates add "&coords" to the query string; e.g. "http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546&coords". To return the file as a gzipped HTML file add "&gzip" to the query string. To return the file as a zip archive of HTML files where each file represents a single page (and should have the same naming convention as the image files) add "&zip" to the query string. Note that "&zip" and "&gzip" are incompatible, with "&zip" taking precedence; if you use both options, "&gzip" will be ignored. If you were building an online editing tool you would probably want to use a query string like this: "http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft&coords&zip" Again, let me remind you that this service is running on an old Pentium /// in my basement at the end of a DSL line, so expect it to be quite slow (it could require a matter of minutes to construct the file). As always, feedback is encouraged. Cheers, Lee