Re: [gutvol-d] Recording the quality of a book's OCR

On 1/3/2012 12:52 PM, Lee Passey wrote:
On Tue, January 3, 2012 11:46 am, Edward Betts wrote:
I agree we need a way to correct our books. I find it difficult to read the ePub or Kindle versions of scanned books because of the OCR errors. I'm sure the OCR errors are irritating when using text-to-speech.
As a first step to move this process along I have created a servlet which parses the Abbyy output file and converts it to HTML.
[snip]
My next step will be to allow the document to be gzipped before downloading. After that, I will add an option to break the HTML file into multiple files, each of which matches a single page image, and return the collection as a zip archive.
This step is now complete. Additionally, there is now an option to omit (or include, depending on your perspective) word coordinates.
To use this service, in a browser navigate to "http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]", where [iaid] is the Internet Archive IDentifer for a specific work.
By default, Abbyy output is returned as a single HTML file without word coordinates. To add coordinates add "&coords" to the query string; e.g. "http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546&coords". To return the file as a gzipped HTML file add "&gzip" to the query string. To return the file as a zip archive of HTML files where each file represents a single page (and should have the same naming convention as the image files) add "&zip" to the query string. Note that "&zip" and "&gzip" are incompatible, with "&zip" taking precedence; if you use both options, "&gzip" will be ignored. If you were building an online editing tool you would probably want to use a query string like this: "http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft&coords&zip" Again, let me remind you that this service is running on an old Pentium /// in my basement at the end of a DSL line, so expect it to be quite slow (it could require a matter of minutes to construct the file). As always, feedback is encouraged. Cheers, Lee

On 1/10/2012 2:26 PM, Lee Passey wrote:
To use this service, in a browser navigate to "http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]", where [iaid] is the Internet Archive IDentifer for a specific work.
By default, Abbyy output is returned as a single HTML file without word coordinates. To add coordinates add "&coords" to the query string; e.g. "http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546&coords". To return the file as a gzipped HTML file add "&gzip" to the query string. To return the file as a zip archive of HTML files where each file represents a single page (and should have the same naming convention as the image files) add "&zip" to the query string.
Note that "&zip" and "&gzip" are incompatible, with "&zip" taking precedence; if you use both options, "&gzip" will be ignored. If you were building an online editing tool you would probably want to use a query string like this:
"http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft&coords&zip"
Thanks to Mr. Newby's good offices, this service is now running at a much higher performance level at http://readingroo.ms:8080. Thus, to see Tarzan (what I believe is a non-bowdlerized version) go to http://readingroo.ms:8080/ebookcoop/FromIA?tarzanofapes00burruoft.

What are word coordinates? Can you give an example of how we might use this?

On 2/28/2012 7:19 PM, don kretz wrote:
What are word coordinates? Can you give an example of how we might use this?
It provides the coordinates of where the word was found on the original image. This is most useful when you generate a zip file, as the file names in the .zip mirror the file names of the images at IA. Thus, tarzanofapes00burruoft_0010.html matches tarzanofapes00burruoft_0010.jp2, and <span class="word" title="(1540,510),(1606,740)">Emma</span> indicates that Abbyy derived that word from the stated coordinates. (tlbr? ltrb? Can't remember off the top of my head). By using the title attribute, in a browser you can also get the coordinates by doing a flyover with the mouse. Hopefully, this will be useful in creating a side-by-side, web-based editing system that mirrors the Abbyy UI. An end user should be able to click on a word in the structured text editor and via JavaScript the associated area on the image will be highlighted (or reversed, or something; imaging processing in JavaScript is not yet one of my areas of expertise). The reverse process might also be possible. The key concept, though, is to synchronize area of an image with text in a structured text editor.
participants (2)
-
don kretz
-
Lee Passey