
On 21/11/11 12:25, Lee Passey wrote:
This is interesting, but not exactly what I had in mind. What Mr. Betts has developed here (I use the past tense, because I obviously don't know what he has in mind for the future, only what has already been accomplished) is another mechanism for comparing a line of text with a segment of an image that purports to represent that text.
Still working on it. The code is here: https://github.com/edwardbetts/corrections You can now login using an Open Library or Internet Archive account. Saving works, but fixes aren't yet visible on the edit screen. I need to add more editing beyond changing single words. My emphasis is building something that will save edits and maybe generate epubs using these fixes. I'm keen to retain word coordinates with edits, which makes things harder. With coordinates we can highlight corrected words in the original page images when using the Internet Archive book reader search and read aloud features.
Having carefully looked through the contents of one of the "*_abbyy.xml" files made available on IA, I have come to the conclusion that the results of Ken H.'s script are /not/ derived from the data from those files. Not only does Ken's script preserve geometry information (which permits a "click on a word see it highlighted on the scan" function), it also shows which words FineReader could not find in it's dictionary, and which words FineReader was uncertain of. It also marks line-ending hyphenation, and whether FineReader considered it "soft" hyphenation, which should be removed if word wrapping no longer puts it on the end of a line, or "hard" hyphenation which should be displayed in all cases. I suspect that the file also contains other useful information that I simply haven't discovered yet.
All this information is in the abbyy file, it is in the charParams tags, for example: <charParams l="316" t="51" r="336" b="76" suspicious="true" wordStart="true" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="50" serifProbability="12" wordPenalty="36" meanStrokeWidth="107">J</charParams> <charParams l="331" t="46" r="343" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="53" serifProbability="100" wordPenalty="36" meanStrokeWidth="107">t</charParams> <charParams l="339" t="57" r="348" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="44" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">i</charParams> <charParams l="343" t="53" r="384" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="48" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">o</charParams> <charParams l="379" t="46" r="402" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="40" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">K</charParams> The interesting attributes are: wordStart: tells if abbyy things a character is the first letter in a word wordFromDictionary: tells you if the word in the abbyy dictionary wordIdentifier: tells you if abbyy thinks the word is some kind of identifier charConfidence: gives a score for how confident abbyy is about the character You can use wordStart to find the line-ending hyphenation.
With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored.
The abbyy.xml is the source for Ken's fromabbyy.php -- Edward.