
On Thu, November 10, 2011 4:42 pm, Alex Buie wrote:
On Wed, Nov 9, 2011 at 12:28 AM, Lee Passey <lee@novomail.net> wrote:
In the IA output, I'm discovering that that data has been preserved. I think with some effort, it would be possible to use this data to build a web interface substantially identical to the proofreading interface provided by FineReader.
So Alex, all that talk a while back about how I wanted a "leaner, meaner" file? Forget about it. I think I like it just the way it is. I can select out what I need, and it has some potential.
Here's something someone at the archive is working on (after hours, since it's not an official project yet). He'd love to hear your thoughts.
http://edwardbetts.com/correct
(Note you can't actually submit edits yet, but he's working on getting there)
Already I find that proofing like that is a LOT easier than proofing on pgdp, but maybe that's me.
This is interesting, but not exactly what I had in mind. What Mr. Betts has developed here (I use the past tense, because I obviously don't know what he has in mind for the future, only what has already been accomplished) is another mechanism for comparing a line of text with a segment of an image that purports to represent that text. Even when the text becomes editable, you still only have a mechanism to help ensure that the ASCII encoding of an image is accurate. This is just a refinement on the Distributed Proofreader proofing model, variations of which others have done before. On the other hand, if you have used the Abbyy FineReader interface, you know it provides two side-by-side windows, one containing an image of a scanned page, and the other a representation of the text that was derived from the OCR. It is more that just a method of editing the text, however. For one thing, it shows paragraphs as paragraphs, usually indented. It also can highlight, using various user-definable colors, words that it could not find in it's dictionary, and "uncertain" word (FineReader's "best guess" but not meeting a certain threshold of reliability). When you select a word in the editing window, the image of the word is highlighted on the scanned image, and usually there is a "zoom" window at the bottom of the screen that shows the selected text on a single "zoomed" line. Having carefully looked through the contents of one of the "*_abbyy.xml" files made available on IA, I have come to the conclusion that the results of Ken H.'s script are /not/ derived from the data from those files. Not only does Ken's script preserve geometry information (which permits a "click on a word see it highlighted on the scan" function), it also shows which words FineReader could not find in it's dictionary, and which words FineReader was uncertain of. It also marks line-ending hyphenation, and whether FineReader considered it "soft" hyphenation, which should be removed if word wrapping no longer puts it on the end of a line, or "hard" hyphenation which should be displayed in all cases. I suspect that the file also contains other useful information that I simply haven't discovered yet. With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface. It is also interesting to note, that the script output is /not/ valid XHTML, but it is virtually identical to the SGML/HTML produced by FineReader. I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored. For the last couple of weeks the script has been non-responsive. It would be nice if we could encourage Ken to troubleshoot that script so it was working again; even better would be if we could get access to those file which form the input source for his script in the first place.