
On 2011-11-22 10:07, Lee Passey wrote:
These are certainly /some/ interesting attributes, but are no means the /only/ interesting atttibutes. At least as interesting are "suspicious" and "proofed". And what the heck to the values for "l", "t", "r" and "b" mean?
Coordinates of the character in the source image. l: left, t: top, r: right, b: bottom
You can use wordStart to find the line-ending hyphenation.
Actually, you can't. If a<charParams> element contains a single hyphen, /and/ if it is the last<charParams> element of a<line> element, /then/ you can conclude that it is line ending hyphenation. But just knowing it is line-ending isn't enough. The FineReader interface (and the HTML file generated from the "fromabbyy.php" script) also indicates whether line-ending hyphenation is hard or soft. In the file I'm looking at right now, I can't see any way to distinguish between the hyphen in "hard-working" (which is hard) and the hyphen in "prop-erty" (which is soft). How do I use the Abbyy markup to make this distinction?
The 'w' in hard-working is tagged with wordStart="true" to show a hard hyphen. The 'e' in property is tagged with wordStart="false" because it is a soft hyphen.
With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
Storing the edits is the easiest part (depending, of course, on how the edits are created). All you need is an unambiguous way to link a character in the editor with a character in the file. The "l,t,r,b" attributes may provide just such a unique identifier. It's also possible that such a thing could be accomplished using XPATH. Setting the "proofed" attribute on a<charParams> element should be a signal that "a human being has accepted this, don't mess with it again."
When I built my proof-of-concept web-based editing system last year, I simply split the document into page-sized HTML files, and passed an entire page file to Kupu, which is a JavaScript-based HTML editor (part of one of the Apache projects, I forget which). When the user was done editing and pressed the "Save" button, the HTML document was posted back to the server where a very simple Java servlet committed the file into a CVS repository. The working directory of the repository was set as an Apache document directory, so CVS would maintain a versioning history (to protect against cyber-vandalism) but the Web Server would always return the most current version.
Saving incremental edits is a problem that has been solved over and over again. By leveraging other peoples' efforts, I'm sure that this would be the easiest part of the entire scheme.
Can you maintain the coordinates or words on the source image using this scheme?
The abbyy.xml is the source for Ken's fromabbyy.php
If you say so I've no choice but to believe you. But the output from the script has FineReader fingerprints all over it. It is so similar to the HTML output I see directly from FineReader that I have a hard time believing that FineReader itself did not generate the HTML. Why would the script take well-formed XML and convert it to non-well-formed SGML/HTML?
If the input to the "fromabbyy.php" script is a *_abbyy.xml file, then the script must be passing that file to the FineReader engine to produce the HTML output. How is that done? I own a copy of FineReader, can I do it too?
The script does not pass the file to the FineReader engine to produce the HTML output. -- Edward.