Re: [gutvol-d] epubeditor.sourceforge.net

21 Nov 2011

      On Thu, November 10, 2011 4:42 pm, Alex Buie wrote:
...
On Wed, Nov 9, 2011 at 12:28 AM, Lee Passey <lee@novomail.net> wrote:
...
In the IA output, I'm discovering that that data has been preserved. I think
with some effort, it would be possible to use this data to build a web
interface substantially identical to the proofreading interface provided by
FineReader.
So Alex, all that talk a while back about how I wanted a "leaner, meaner"
file? Forget about it. I think I like it just the way it is. I can select
out what I need, and it has some potential.
Here's something someone at the archive is working on (after hours,
since it's not an official project yet). He'd love to hear your
thoughts.
http://edwardbetts.com/correct
(Note you can't actually submit edits yet, but he's working on getting there)
Already I find that proofing like that is a LOT easier than proofing
on pgdp, but maybe that's me.
This is interesting, but not exactly what I had in mind. What Mr. Betts has
developed here (I use the past tense, because I obviously don't know what he
has in mind for the future, only what has already been accomplished) is
another mechanism for comparing a line of text with a segment of an image that
purports to represent that text. Even when the text becomes editable, you
still only have a mechanism to help ensure that the ASCII encoding of an image
is accurate. This is just a refinement on the Distributed Proofreader proofing
model, variations of which others have done before.

On the other hand, if you have used the Abbyy FineReader interface, you know
it provides two side-by-side windows, one containing an image of a scanned
page, and the other a representation of the text that was derived from the
OCR. It is more that just a method of editing the text, however. For one
thing, it shows paragraphs as paragraphs, usually indented. It also can
highlight, using various user-definable colors, words that it could not find
in it's dictionary, and "uncertain" word (FineReader's "best guess" but not
meeting a certain threshold of reliability). When you select a word in the
editing window, the image of the word is highlighted on the scanned image, and
usually there is a "zoom" window at the bottom of the screen that shows the
selected text on a single "zoomed" line.

Having carefully looked through the contents of one of the "*_abbyy.xml" files
made available on IA, I have come to the conclusion that the results of Ken
H.'s script are /not/ derived from the data from those files. Not only does
Ken's script preserve geometry information (which permits a "click on a word
see it highlighted on the scan" function), it also shows which words
FineReader could not find in it's dictionary, and which words FineReader was
uncertain of. It also marks line-ending hyphenation, and whether FineReader
considered it "soft" hyphenation, which should be removed if word wrapping no
longer puts it on the end of a line, or "hard" hyphenation which should be
displayed in all cases. I suspect that the file also contains other useful
information that I simply haven't discovered yet.

With this information, and a little JavaScript programming, I believe it would
be possible to put together a web-based interface which could mimic the
behavior of the Fine Reader interface.

It is also interesting to note, that the script output is /not/ valid XHTML,
but it is virtually identical to the SGML/HTML produced by FineReader.

I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/
any of the files that are part of the standard download location. If the files
he uses /are/ publically available, so far we don't know where they are
stored.

For the last couple of weeks the script has been non-responsive. It would be
nice if we could encourage Ken to troubleshoot that script so it was working
again; even better would be if we could get access to those file which form
the input source for his script in the first place.