Re: [gutvol-d] epubeditor.sourceforge.net

22 Nov 2011

      On 2011-11-22 10:07, Lee Passey wrote:
...
These are certainly /some/ interesting attributes, but are no means the /only/
interesting atttibutes. At least as interesting are "suspicious" and
"proofed". And what the heck to the values for "l", "t", "r" and "b" mean?
Coordinates of the character in the source image.

l: left, t: top, r: right, b: bottom
...
...
You can use wordStart to find the line-ending hyphenation.
Actually, you can't. If a<charParams>  element contains a single hyphen, /and/
if it is the last<charParams>  element of a<line>  element, /then/ you can
conclude that it is line ending hyphenation. But just knowing it is
line-ending isn't enough. The FineReader interface (and the HTML file
generated from the "fromabbyy.php" script) also indicates whether line-ending
hyphenation is hard or soft. In the file I'm looking at right now, I can't see
any way to distinguish between the hyphen in "hard-working" (which is hard)
and the hyphen in "prop-erty" (which is soft). How do I use the Abbyy markup
to make this distinction?
The 'w' in hard-working is tagged with wordStart="true" to show a hard 
hyphen. The 'e' in property is tagged with wordStart="false" because it 
is a soft hyphen.
...
...
...
With this information, and a little JavaScript programming, I believe it
would be possible to put together a web-based interface which could mimic
the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
Storing the edits is the easiest part (depending, of course, on how the edits
are created). All you need is an unambiguous way to link a character in the
editor with a character in the file. The "l,t,r,b" attributes may provide just
such a unique identifier. It's also possible that such a thing could be
accomplished using XPATH. Setting the "proofed" attribute on a<charParams>
element should be a signal that "a human being has accepted this, don't mess
with it again."
When I built my proof-of-concept web-based editing system last year, I simply
split the document into page-sized HTML files, and passed an entire page file
to Kupu, which is a JavaScript-based HTML editor (part of one of the Apache
projects, I forget which). When the user was done editing and pressed the
"Save" button, the HTML document was posted back to the server where a very
simple Java servlet committed the file into a CVS repository. The working
directory of the repository was set as an Apache document directory, so CVS
would maintain a versioning history (to protect against cyber-vandalism) but
the Web Server would always return the most current version.
Saving incremental edits is a problem that has been solved over and over
again. By leveraging other peoples' efforts, I'm sure that this would be the
easiest part of the entire scheme.
Can you maintain the coordinates or words on the source image using this 
scheme?
...
...
The abbyy.xml is the source for Ken's fromabbyy.php
If you say so I've no choice but to believe you. But the output from the
script has FineReader fingerprints all over it. It is so similar to the HTML
output I see directly from FineReader that I have a hard time believing that
FineReader itself did not generate the HTML. Why would the script take
well-formed XML and convert it to non-well-formed SGML/HTML?
If the input to the "fromabbyy.php" script is a *_abbyy.xml file, then the
script must be passing that file to the FineReader engine to produce the HTML
output. How is that done? I own a copy of FineReader, can I do it too?
The script does not pass the file to the FineReader engine to produce 
the HTML output.

-- 
Edward.

Re: [gutvol-d] epubeditor.sourceforge.net

Edward Betts