Re: [gutvol-d] epubeditor.sourceforge.net

22 Nov 2011

      On Mon, November 21, 2011 4:19 pm, Edward Betts wrote:

[NB: all of Mr. Betts' reply has been included here, even those parts which I
am not replying to, so everyone can get the benefit of his response.]
...
Hi Lee, I sent the message blow to the mailing list, it is waiting to be
moderated.
On 21/11/11 12:25, Lee Passey wrote:
...
This is interesting, but not exactly what I had in mind. What Mr. Betts has
developed here (I use the past tense, because I obviously don't know what he
has in mind for the future, only what has already been accomplished) is
another mechanism for comparing a line of text with a segment of an image
that purports to represent that text.
Still working on it. The code is here:
https://github.com/edwardbetts/corrections
You can now login using an Open Library or Internet Archive account.
Saving works, but fixes aren't yet visible on the edit screen. I need to
add more editing beyond changing single words.
My emphasis is building something that will save edits and maybe
generate epubs using these fixes. I'm keen to retain word coordinates
with edits, which things harder. With coordinates we can highlight corrected
words in the original page images when using the Internet Archive book reader
search and read aloud features.
...
Having carefully looked through the contents of one of the "*_abbyy.xml"
files made available on IA, I have come to the conclusion that the results
of Ken H.'s script are /not/ derived from the data from those files. Not
only doesKen's script preserve geometry information (which permits a "click
on a word see it highlighted on the scan" function), it also shows which
words FineReader could not find in it's dictionary, and which words
FineReader was uncertain of. It also marks line-ending hyphenation, and
whether FineReader considered it "soft" hyphenation, which should be removed
if word wrapping no longer puts it on the end of a line, or "hard"
hyphenation which should be displayed in all cases. I suspect that the file
also contains other useful information that I simply haven't discovered yet.
All this information is in the abbyy file, it is in the charParams tags,
for example:
<charParams l="316" t="51" r="336" b="76" suspicious="true"
wordStart="true" wordFromDictionary="false" wordNormal="true"
wordNumeric="false" wordIdentifier="false" charConfidence="50"
serifProbability="12" wordPenalty="36" meanStrokeWidth="107">J</charParams>
<charParams l="331" t="46" r="343" b="76" suspicious="true"
wordStart="false" wordFromDictionary="false" wordNormal="true"
wordNumeric="false" wordIdentifier="false" charConfidence="53"
serifProbability="100" wordPenalty="36" meanStrokeWidth="107">t</charParams>
<charParams l="339" t="57" r="348" b="76" suspicious="true"
wordStart="false" wordFromDictionary="false" wordNormal="true"
wordNumeric="false" wordIdentifier="false" charConfidence="44"
serifProbability="255" wordPenalty="36" meanStrokeWidth="107">i</charParams>
<charParams l="343" t="53" r="384" b="76" suspicious="true"
wordStart="false" wordFromDictionary="false" wordNormal="true"
wordNumeric="false" wordIdentifier="false" charConfidence="48"
serifProbability="255" wordPenalty="36" meanStrokeWidth="107">o</charParams>
<charParams l="379" t="46" r="402" b="76" suspicious="true"
wordStart="false" wordFromDictionary="false" wordNormal="true"
wordNumeric="false" wordIdentifier="false" charConfidence="40"
serifProbability="255" wordPenalty="36" meanStrokeWidth="107">K</charParams>
The interesting attributes are:
wordStart: tells if abbyy things a character is the first letter in a word
wordFromDictionary: tells you if the word in the abbyy dictionary
wordIdentifier: tells you if abbyy thinks the word is some kind of
identifier
charConfidence: gives a score for how confident abbyy is about the character
These are certainly /some/ interesting attributes, but are no means the /only/
interesting atttibutes. At least as interesting are "suspicious" and
"proofed". And what the heck to the values for "l", "t", "r" and "b" mean? The
complete schema definition for the Abbyy.xml file can be found at
http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml, but it suffers
from the serious flaw that most schema definitions do, which is that while it
explains what the various components /are/ it contains little explanation of
what those components /mean/. I'm pretty sure FineReader understands what all
those elements mean, but is there some official way for /me/ to learn what
they mean?
...
You can use wordStart to find the line-ending hyphenation.
Actually, you can't. If a <charParams> element contains a single hyphen, /and/
if it is the last <charParams> element of a <line> element, /then/ you can
conclude that it is line ending hyphenation. But just knowing it is
line-ending isn't enough. The FineReader interface (and the HTML file
generated from the "fromabbyy.php" script) also indicates whether line-ending
hyphenation is hard or soft. In the file I'm looking at right now, I can't see
any way to distinguish between the hyphen in "hard-working" (which is hard)
and the hyphen in "prop-erty" (which is soft). How do I use the Abbyy markup
to make this distinction?
...
...
With this information, and a little JavaScript programming, I believe it
would be possible to put together a web-based interface which could mimic
the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
Storing the edits is the easiest part (depending, of course, on how the edits
are created). All you need is an unambiguous way to link a character in the
editor with a character in the file. The "l,t,r,b" attributes may provide just
such a unique identifier. It's also possible that such a thing could be
accomplished using XPATH. Setting the "proofed" attribute on a <charParams>
element should be a signal that "a human being has accepted this, don't mess
with it again."

When I built my proof-of-concept web-based editing system last year, I simply
split the document into page-sized HTML files, and passed an entire page file
to Kupu, which is a JavaScript-based HTML editor (part of one of the Apache
projects, I forget which). When the user was done editing and pressed the
"Save" button, the HTML document was posted back to the server where a very
simple Java servlet committed the file into a CVS repository. The working
directory of the repository was set as an Apache document directory, so CVS
would maintain a versioning history (to protect against cyber-vandalism) but
the Web Server would always return the most current version.

Saving incremental edits is a problem that has been solved over and over
again. By leveraging other peoples' efforts, I'm sure that this would be the
easiest part of the entire scheme.
...
...
I'm forced to conclude that the source for Ken's fromabbyy.php script is
/not/ any of the files that are part of the standard download location. If
the files he uses /are/ publically available, so far we don't know where
they are stored.
The abbyy.xml is the source for Ken's fromabbyy.php
If you say so I've no choice but to believe you. But the output from the
script has FineReader fingerprints all over it. It is so similar to the HTML
output I see directly from FineReader that I have a hard time believing that
FineReader itself did not generate the HTML. Why would the script take
well-formed XML and convert it to non-well-formed SGML/HTML?

If the input to the "fromabbyy.php" script is a *_abbyy.xml file, then the
script must be passing that file to the FineReader engine to produce the HTML
output. How is that done? I own a copy of FineReader, can I do it too?