
Hi Lee, For $100??? You can be serious! Make that $1000 and I might think about it! If you said substantially identical screen output! That would be simple enough. But, substantially identical HTML. That is a very tall order. Then problem is not the algorithm as such, but the heuristics involved in transforming the the input into an intermediate form. Anyone who has used commercial web editors can easily see simple conversions can cause quite drastic changes in the mark-up with just subtle differences. The rendering in a browser will give you almost identical display, yet the mark-up elements may be completely different. What would have to be done is a style analysis of the input. Maybe one could render the input to PDF and run that through a style analyzer for OCR. Sounds promising. regards Keith. Am 08.12.2011 um 03:39 schrieb Lee Passey:
Here's my challenge to anyone who thinks this is an easy nut to crack.
I will take a moderately complex e-book (containing at least as much markup as PG 31103, which BowerBird seems to have a great deal of respect for) but also containing lists and tables (I'm thinking maybe the Autobiography of Benjamin Franklin, or maybe Pudd'n'head Wilson). I will mark those books up with ReStructured Text, Markdown, and z.m.l. (just to keep the playing field level, I will start with identical HTML in all cases, and do an automated conversion).
I will pay $100 to the first person who can come up with a computer program that can reconstruct substantially identical HTML from those three text formats plus the Project Gutenberg version, without knowing which file is which and without human intervention.
When you have your program in a form that can be run by an independent third party, I will provide the files for testing.