
On Mon, November 21, 2011 4:19 pm, Edward Betts wrote: [NB: all of Mr. Betts' reply has been included here, even those parts which I am not replying to, so everyone can get the benefit of his response.]
Hi Lee, I sent the message blow to the mailing list, it is waiting to be moderated.
On 21/11/11 12:25, Lee Passey wrote:
This is interesting, but not exactly what I had in mind. What Mr. Betts has developed here (I use the past tense, because I obviously don't know what he has in mind for the future, only what has already been accomplished) is another mechanism for comparing a line of text with a segment of an image that purports to represent that text.
Still working on it. The code is here: https://github.com/edwardbetts/corrections
You can now login using an Open Library or Internet Archive account. Saving works, but fixes aren't yet visible on the edit screen. I need to add more editing beyond changing single words.
My emphasis is building something that will save edits and maybe generate epubs using these fixes. I'm keen to retain word coordinates with edits, which things harder. With coordinates we can highlight corrected words in the original page images when using the Internet Archive book reader search and read aloud features.
Having carefully looked through the contents of one of the "*_abbyy.xml" files made available on IA, I have come to the conclusion that the results of Ken H.'s script are /not/ derived from the data from those files. Not only doesKen's script preserve geometry information (which permits a "click on a word see it highlighted on the scan" function), it also shows which words FineReader could not find in it's dictionary, and which words FineReader was uncertain of. It also marks line-ending hyphenation, and whether FineReader considered it "soft" hyphenation, which should be removed if word wrapping no longer puts it on the end of a line, or "hard" hyphenation which should be displayed in all cases. I suspect that the file also contains other useful information that I simply haven't discovered yet.
All this information is in the abbyy file, it is in the charParams tags, for example:
<charParams l="316" t="51" r="336" b="76" suspicious="true" wordStart="true" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="50" serifProbability="12" wordPenalty="36" meanStrokeWidth="107">J</charParams> <charParams l="331" t="46" r="343" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="53" serifProbability="100" wordPenalty="36" meanStrokeWidth="107">t</charParams> <charParams l="339" t="57" r="348" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="44" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">i</charParams> <charParams l="343" t="53" r="384" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="48" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">o</charParams> <charParams l="379" t="46" r="402" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="40" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">K</charParams>
The interesting attributes are:
wordStart: tells if abbyy things a character is the first letter in a word wordFromDictionary: tells you if the word in the abbyy dictionary wordIdentifier: tells you if abbyy thinks the word is some kind of identifier charConfidence: gives a score for how confident abbyy is about the character
These are certainly /some/ interesting attributes, but are no means the /only/ interesting atttibutes. At least as interesting are "suspicious" and "proofed". And what the heck to the values for "l", "t", "r" and "b" mean? The complete schema definition for the Abbyy.xml file can be found at http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml, but it suffers from the serious flaw that most schema definitions do, which is that while it explains what the various components /are/ it contains little explanation of what those components /mean/. I'm pretty sure FineReader understands what all those elements mean, but is there some official way for /me/ to learn what they mean?
You can use wordStart to find the line-ending hyphenation.
Actually, you can't. If a <charParams> element contains a single hyphen, /and/ if it is the last <charParams> element of a <line> element, /then/ you can conclude that it is line ending hyphenation. But just knowing it is line-ending isn't enough. The FineReader interface (and the HTML file generated from the "fromabbyy.php" script) also indicates whether line-ending hyphenation is hard or soft. In the file I'm looking at right now, I can't see any way to distinguish between the hyphen in "hard-working" (which is hard) and the hyphen in "prop-erty" (which is soft). How do I use the Abbyy markup to make this distinction?
With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
Storing the edits is the easiest part (depending, of course, on how the edits are created). All you need is an unambiguous way to link a character in the editor with a character in the file. The "l,t,r,b" attributes may provide just such a unique identifier. It's also possible that such a thing could be accomplished using XPATH. Setting the "proofed" attribute on a <charParams> element should be a signal that "a human being has accepted this, don't mess with it again." When I built my proof-of-concept web-based editing system last year, I simply split the document into page-sized HTML files, and passed an entire page file to Kupu, which is a JavaScript-based HTML editor (part of one of the Apache projects, I forget which). When the user was done editing and pressed the "Save" button, the HTML document was posted back to the server where a very simple Java servlet committed the file into a CVS repository. The working directory of the repository was set as an Apache document directory, so CVS would maintain a versioning history (to protect against cyber-vandalism) but the Web Server would always return the most current version. Saving incremental edits is a problem that has been solved over and over again. By leveraging other peoples' efforts, I'm sure that this would be the easiest part of the entire scheme.
I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored.
The abbyy.xml is the source for Ken's fromabbyy.php
If you say so I've no choice but to believe you. But the output from the script has FineReader fingerprints all over it. It is so similar to the HTML output I see directly from FineReader that I have a hard time believing that FineReader itself did not generate the HTML. Why would the script take well-formed XML and convert it to non-well-formed SGML/HTML? If the input to the "fromabbyy.php" script is a *_abbyy.xml file, then the script must be passing that file to the FineReader engine to produce the HTML output. How is that done? I own a copy of FineReader, can I do it too?