
Roger, This looks interesting. I like the idea of having an intermediate file you can correct before generating the HTML. I'm still plugging away at the Bhagavata Purana text file, but when its done it should be a good test for your conversion. It has family trees, tables, footnotes, poetry and who knows what else. James Simmons On Wed, Jan 4, 2012 at 6:22 AM, Roger Frank <rfrank@rfrank.net> wrote:
When I look at a page from a book, I can figure out what is poetry, what is a block quote, what is a chapter title, and so forth. I got to thinking: if I can do that, why can't a computer? How powerful it would be, especially for a new post processor, if all they had to do was get the text to match the book and have the computer make all the decisions to make the HTML and other formats! I've been experimenting with that and have generated a few books for PG using a “zero" markup language.
The books I work on tend to be simple and the computer doesn't have to think very hard to recognize the typographical constructions. Thanks to BB for his test-suite-2012.txt file, which is a good collection of what my generator should handle. I have some coding to do to be able to handle the more complicated situations.
For those that are interested in technical details, here is more information. The way I did this was to generate an intermediate file showing results of the analysis by the computer. Here is a paragraph from the intermediate file:
p | It was not Sammy who awoke the next time, but p | Tess. She became wide awake in a moment, hearing p | a sound from somewhere outside of the cave. p | She sat up to hear it repeated.
and here is poetry:
v | “‘Katie Beardie had a grice, v | It could skate upon the ice; v | Wasna that a dainty grice? v | Dance, Katie Beardie!
This allows me to see how accurate the program is and adjust the coding appropriately. The marks in the left column describe the text in the right column, and then this file generates the HTML or other output formats. As of now, it's a two-step process: generating the input file marked up as shown above and then generating the output files. A key point here is that everything to the right of the vertical bar is text exactly as a non-technical post-processor would want to produce it. There is essentially no markup to the right of the vertical bar.
To contrast this to what I think it would be in z.m.l., consider a signature line on a letter that is in a block quote, which is fairly common in the books I work on. With "zero" markup, everything to the right of the vertical bar would be just as it appeared in the original book. In the left margin would be "b-0" which means the computer decided it was a block quote and right justified, indented by 0 spaces. In z.m.l., I believe this would have been "~tab~~tab~~tab~name~tab~." I believe that for many people who might try post-processing, the simpler form is more approachable.
My “zero markup” is not robust. In every book I've produced with this so far, I have had to edit the intermediate file. But the goal—having the computer make human-like decisions about formatting–seems so worthwhile that I will continue to pursue it.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d