
On Thu, December 8, 2011 12:14 am, Keith J. Schultz wrote:
Am 08.12.2011 um 03:39 schrieb Lee Passey:
Here's my challenge to anyone who thinks this is an easy nut to crack.
I will take a moderately complex e-book (containing at least as much markup as PG 31103, which BowerBird seems to have a great deal of respect for) but also containing lists and tables (I'm thinking maybe the Autobiography of Benjamin Franklin, or maybe Pudd'n'head Wilson). I will mark those books up with ReStructured Text, Markdown, and z.m.l. (just to keep the playing field level, I will start with identical HTML in all cases, and do an automated conversion).
I will pay $100 to the first person who can come up with a computer program that can reconstruct substantially identical HTML from those three text formats plus the Project Gutenberg version, without knowing which file is which and without human intervention.
When you have your program in a form that can be run by an independent third party, I will provide the files for testing.
Hi Lee,
For $100??? You can be serious! Make that $1000 and I might think about it!
Fair enough. $1000 donated in your name to the non-profit organization of your choice. I believe that this kind of program is possible, and $1000 is about all my wife will let me put up. Maybe others would like to help out with their own pledges to sweeten the pot? But one further condition: once validated, the source code must be placed into the public domain, and placed in an accessible location on the internet.
If you said substantially identical screen output! That would be simple enough. But, substantially identical HTML. That is a very tall order. Then problem is not the algorithm as such, but the heuristics involved in transforming the the input into an intermediate form.
True. And yet, that is exactly what BowerBird has said 1. is possible, 2. is easy, and 3. he has done it. If he has, and it can be verified, then the $1000 is his.
Anyone who has used commercial web editors can easily see simple conversions can cause quite drastic changes in the mark-up with just subtle differences. The rendering in a browser will give you almost identical display, yet the mark-up elements may be completely different.
But I'm not interested in display, I'm interested in document structure. Let me say right now, that an HTML document that consists entirely of <p> elements with style attributes will not satisfy the requirements. Headers must be constructed as headers, lists must be constructed as lists, tables must be constructed and tables and so on.
What would have to be done is a style analysis of the input. Maybe one could render the input to PDF and run that through a style analyzer for OCR. Sounds promising.
FineReader actually does a pretty good job of intuiting document structure from its OCRed images, so if you could duplicate the kind of heuristics that FineReader performs you might be well on your way. (FineReader has the additional advantage of being able to detect font sizes, weights and styles, which data is unavailable in most simplified text markup.) The problem as I see it is that for these kind of heuristics to be useful you must first normalize the display, which leads you back to the first stumbling block. I suppose one approach may be to develop heuristics to determine what kind of /markup/ had been applied to a file, and then branch into code specific to that markup (converting one variety of markup to another is a fairly straight-forward process). The kicker comes when that program has to convert the Project Gutenberg file which has /inconsistent/ markup, if it has any at all. But the whole goal of this exercise is to create a /generic/ program which can be used to convert /any/ markup, or /no/ markup to uniformly consistent HTML (which can then be used to create ePub, Kindle, eReader, etc. formats).