Roger,

This looks interesting.  I like the idea of having an intermediate file you can correct before generating the HTML.  I'm still plugging away at the Bhagavata Purana text file, but when its done it should be a good test for your conversion.  It has family trees, tables, footnotes, poetry and who knows what else.

James Simmons


On Wed, Jan 4, 2012 at 6:22 AM, Roger Frank <rfrank@rfrank.net> wrote:
When I look at a page from a book, I can figure out what is
poetry, what is a block quote, what is a chapter title, and
so forth. I got to thinking: if I can do that, why can't a
computer? How powerful it would be, especially for a new
post processor, if all they had to do was get the text to
match the book and have the computer make all the decisions
to make the HTML and other formats! I've been experimenting
with that and have generated a few books for PG using a
“zero" markup language.

The books I work on tend to be simple and the computer
doesn't have to think very hard to recognize the
typographical constructions. Thanks to BB for his
test-suite-2012.txt file, which is a good collection of what
my generator should handle. I have some coding to do to be
able to handle the more complicated situations.

For those that are interested in technical details, here is
more information. The way I did this was to generate an
intermediate file showing results of the analysis by the
computer. Here is a paragraph from the intermediate file:

p    | It was not Sammy who awoke the next time, but
p    | Tess. She became wide awake in a moment, hearing
p    | a sound from somewhere outside of the cave.
p    | She sat up to hear it repeated.

and here is poetry:

v    |   “‘Katie Beardie had a grice,
v    |     It could skate upon the ice;
v    |   Wasna that a dainty grice?
v    |     Dance, Katie Beardie!

This allows me to see how accurate the program is and adjust
the coding appropriately. The marks in the left column
describe the text in the right column, and then this file
generates the HTML or other output formats. As of now, it's
a two-step process: generating the input file marked up as
shown above and then generating the output files. A key
point here is that everything to the right of the vertical
bar is text exactly as a non-technical post-processor would
want to produce it. There is essentially no markup to the
right of the vertical bar.

To contrast this to what I think it would be in z.m.l.,
consider a signature line on a letter that is in a block
quote, which is fairly common in the books I work on. With
"zero" markup, everything to the right of the vertical bar
would be just as it appeared in the original book. In the
left margin would be "b-0" which means the computer decided
it was a block quote and right justified, indented by 0
spaces. In z.m.l., I believe this would have been
"~tab~~tab~~tab~name~tab~." I believe that for many people
who might try post-processing, the simpler form is more
approachable.

My “zero markup” is not robust. In every book I've produced
with this so far, I have had to edit the intermediate file.
But the goal—having the computer make human-like decisions
about formatting–seems so worthwhile that I will continue to
pursue it.

--Roger

_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d