Re: [gutvol-d] seriously, feel free to ask any questions

8 Dec 2011

      On Thu, December 8, 2011 12:14 am, Keith J. Schultz wrote:
...
Am 08.12.2011 um 03:39 schrieb Lee Passey:
...
Here's my challenge to anyone who thinks this is an easy nut to crack.
I will take a moderately complex e-book (containing at least as much markup
as PG 31103, which BowerBird seems to have a great deal of respect for) but
also containing lists and tables (I'm thinking maybe the Autobiography of
Benjamin Franklin, or maybe Pudd'n'head Wilson). I will mark those books up
with ReStructured Text, Markdown, and z.m.l. (just to keep the playing field
level, I will start with identical HTML in all cases, and do an automated
conversion).
I will pay $100 to the first person who can come up with a computer program
that can reconstruct substantially identical HTML from those three text
formats plus the Project Gutenberg version, without knowing which file is
which and without human intervention.
When you have your program in a form that can be run by an independent third
party, I will provide the files for testing.
Hi Lee,
For $100??? You can be serious! Make that $1000 and I might think
about it!
Fair enough. $1000 donated in your name to the non-profit organization of your
choice. I believe that this kind of program is possible, and $1000 is about
all my wife will let me put up. Maybe others would like to help out with their
own pledges to sweeten the pot? But one further condition: once validated, the
source code must be placed into the public domain, and placed in an accessible
location on the internet.
...
If you said substantially identical screen output! That would be simple
enough. But, substantially identical HTML. That is a very tall order. Then
problem is not the algorithm as such, but the heuristics involved in
transforming the the input into an intermediate form.
True. And yet, that is exactly what BowerBird has said 1. is possible, 2. is
easy, and 3. he has done it. If he has, and it can be verified, then the $1000
is his.
...
Anyone who has used commercial web editors can easily see simple conversions
can cause quite drastic changes in the mark-up with just subtle differences.
The rendering in a browser will give you almost identical display, yet the
mark-up elements may be completely different.
But I'm not interested in display, I'm interested in document structure. Let
me say right now, that an HTML document that consists entirely of <p> elements
with style attributes will not satisfy the requirements. Headers must be
constructed as headers, lists must be constructed as lists, tables must be
constructed and tables and so on.
...
What would have to be done is a style analysis of the input. Maybe one could
render the input to PDF and run that through a style analyzer for OCR. Sounds
promising.
FineReader actually does a pretty good job of intuiting document structure
from its OCRed images, so if you could duplicate the kind of heuristics that
FineReader performs you might be well on your way. (FineReader has the
additional advantage of being able to detect font sizes, weights and styles,
which data is unavailable in most simplified text markup.) The problem as I
see it is that for these kind of heuristics to be useful you must first
normalize the display, which leads you back to the first stumbling block.

I suppose one approach may be to develop heuristics to determine what kind of
/markup/ had been applied to a file, and then branch into code specific to
that markup (converting one variety of markup to another is a fairly
straight-forward process). The kicker comes when that program has to convert
the Project Gutenberg file which has /inconsistent/ markup, if it has any at
all. But the whole goal of this exercise is to create a /generic/ program
which can be used to convert /any/ markup, or /no/ markup to uniformly
consistent HTML (which can then be used to create ePub, Kindle, eReader, etc.
formats).