
john, thanks for joining the betty lee fun parade! :+) can we retire the "mount high horse" header, though? it was an honorific for 2011, intended to end an era... i've looked at your work, and have a few notes for you. first of all, those hyphens are not "spurious", they're the hyphens which actually exist in the printed-book. but yes, you want to get rid of them, and here's how... (first i'll tell you that i will make web-apps available to perform these routines for people automatically, so it's not like you really _need_ this documentation, but i happily tell you this information if you want it.) here's how you might wanna rework a .zml text-file... 1. a hyphen at the very end of the line is a "soft" one, meaning it can -- should -- be eliminated on unwrap. you can do this programmatically. in python, it'd be:
thebook=re.sub("-\n","",thebook)
but note that you _must_ attend to em-dashes _first_:
thebook=re.sub("--\n","—",thebook)
2. if there's a hyphen-tilde combo at the end of a line, this indicates a _hard_ hyphen that should be retained. the tilde is eliminated. (think of it as "sacrificing itself", with a heroic gesture, so that the hyphen can be saved.)
thebook=re.sub("-~\n","-",thebook)
3. you'll also find cases of a tilde-tilde pair at line-end, or -- to be more accurate -- at the end of a _paragraph_. this tilde-tilde pair indicates a doublequote mark which was _dropped_ because the speaker's dialog _continues_ in the next paragraph. (this will enable you to perform a check on the balancing of doublequotes, but otherwise, the only thing to do with the tilde-tilde pair is delete it.) 4. you'll need to do the other lines to finish an unwrap. all lines should be unwrapped _except_ blank lines and any lines which start with a space as their first character. i do this as a multi-step procedure in my wordprocessor, but i'm sure some reg-ex person can "show us the way". any linebreak with whitespace on either side is retained; all other linebreaks are deleted. shouldn't be that hard. (but if no one spills, i will give you my python routines.) 5. "{{" lines give the filename of the _scan_ for that page. "[[" lines are the pagenumber of the page, printed or not. the "[[" lines can be preceded by up to 3 blank lines, and _all_ of those blank lines must be deleted on an unwrap, as should the "[[" lines themselves (although you _might_ wanna save some kind of reference to the pagenumber). "{{" lines should be deleted as well, of course, and they are always followed by at least _one_ blank line, which must also _always_ be deleted. sometimes there'll be _more_ than one blank line following a "{{" line, _but_ additional lines after the first one _must_ be retained. you'll understand these rules _implicitly_ once you've been told the _reasons_ for them, but that's for later. 6. looks like you figured out everything else you need. i especially like how you handled the letter from rose... the p-book didn't really set it off much, so neither did i, but i generally think that such material should be set off. as criticism, i'd say that your table of contents is skimpy.
The conversions are with a python script, as you will have guessed. If you want it, you can have it---Now!
you guys should take up john on his open-source offer.
The HTML looks generally acceptable to me, but the images are missing.
the scans are in the same directory as the .zml file.
i leave my book subdirectories (under "go") wide open, and name my files intelligently, so it should all be clear.
BB, could you be persuaded to add the HTML to your treasury, so that we can see the result in context?
do you want me to add _my_ .html file to my site? if so, then that will be coming very soon, john, yes. or do you want me to add _your_ .html to my site? i'm willing to do that, if you'd consider it an honor, or something, but i'll assure you that it's really not. your .html version was serviceable, but i believe you could probably do a better job on a conversion now, by using the information which i've given you above. a book this simple isn't really much of a test, however. try working on the test-suite which you can find here:
-bowerbird

When I look at a page from a book, I can figure out what is poetry, what is a block quote, what is a chapter title, and so forth. I got to thinking: if I can do that, why can't a computer? How powerful it would be, especially for a new post processor, if all they had to do was get the text to match the book and have the computer make all the decisions to make the HTML and other formats! I've been experimenting with that and have generated a few books for PG using a “zero" markup language. The books I work on tend to be simple and the computer doesn't have to think very hard to recognize the typographical constructions. Thanks to BB for his test-suite-2012.txt file, which is a good collection of what my generator should handle. I have some coding to do to be able to handle the more complicated situations. For those that are interested in technical details, here is more information. The way I did this was to generate an intermediate file showing results of the analysis by the computer. Here is a paragraph from the intermediate file: p | It was not Sammy who awoke the next time, but p | Tess. She became wide awake in a moment, hearing p | a sound from somewhere outside of the cave. p | She sat up to hear it repeated. and here is poetry: v | “‘Katie Beardie had a grice, v | It could skate upon the ice; v | Wasna that a dainty grice? v | Dance, Katie Beardie! This allows me to see how accurate the program is and adjust the coding appropriately. The marks in the left column describe the text in the right column, and then this file generates the HTML or other output formats. As of now, it's a two-step process: generating the input file marked up as shown above and then generating the output files. A key point here is that everything to the right of the vertical bar is text exactly as a non-technical post-processor would want to produce it. There is essentially no markup to the right of the vertical bar. To contrast this to what I think it would be in z.m.l., consider a signature line on a letter that is in a block quote, which is fairly common in the books I work on. With "zero" markup, everything to the right of the vertical bar would be just as it appeared in the original book. In the left margin would be "b-0" which means the computer decided it was a block quote and right justified, indented by 0 spaces. In z.m.l., I believe this would have been "~tab~~tab~~tab~name~tab~." I believe that for many people who might try post-processing, the simpler form is more approachable. My “zero markup” is not robust. In every book I've produced with this so far, I have had to edit the intermediate file. But the goal—having the computer make human-like decisions about formatting–seems so worthwhile that I will continue to pursue it. --Roger

Roger, This looks interesting. I like the idea of having an intermediate file you can correct before generating the HTML. I'm still plugging away at the Bhagavata Purana text file, but when its done it should be a good test for your conversion. It has family trees, tables, footnotes, poetry and who knows what else. James Simmons On Wed, Jan 4, 2012 at 6:22 AM, Roger Frank <rfrank@rfrank.net> wrote:
When I look at a page from a book, I can figure out what is poetry, what is a block quote, what is a chapter title, and so forth. I got to thinking: if I can do that, why can't a computer? How powerful it would be, especially for a new post processor, if all they had to do was get the text to match the book and have the computer make all the decisions to make the HTML and other formats! I've been experimenting with that and have generated a few books for PG using a “zero" markup language.
The books I work on tend to be simple and the computer doesn't have to think very hard to recognize the typographical constructions. Thanks to BB for his test-suite-2012.txt file, which is a good collection of what my generator should handle. I have some coding to do to be able to handle the more complicated situations.
For those that are interested in technical details, here is more information. The way I did this was to generate an intermediate file showing results of the analysis by the computer. Here is a paragraph from the intermediate file:
p | It was not Sammy who awoke the next time, but p | Tess. She became wide awake in a moment, hearing p | a sound from somewhere outside of the cave. p | She sat up to hear it repeated.
and here is poetry:
v | “‘Katie Beardie had a grice, v | It could skate upon the ice; v | Wasna that a dainty grice? v | Dance, Katie Beardie!
This allows me to see how accurate the program is and adjust the coding appropriately. The marks in the left column describe the text in the right column, and then this file generates the HTML or other output formats. As of now, it's a two-step process: generating the input file marked up as shown above and then generating the output files. A key point here is that everything to the right of the vertical bar is text exactly as a non-technical post-processor would want to produce it. There is essentially no markup to the right of the vertical bar.
To contrast this to what I think it would be in z.m.l., consider a signature line on a letter that is in a block quote, which is fairly common in the books I work on. With "zero" markup, everything to the right of the vertical bar would be just as it appeared in the original book. In the left margin would be "b-0" which means the computer decided it was a block quote and right justified, indented by 0 spaces. In z.m.l., I believe this would have been "~tab~~tab~~tab~name~tab~." I believe that for many people who might try post-processing, the simpler form is more approachable.
My “zero markup” is not robust. In every book I've produced with this so far, I have had to edit the intermediate file. But the goal—having the computer make human-like decisions about formatting–seems so worthwhile that I will continue to pursue it.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

HI Roger, I believe I can help you on setting up the heuristics that you will be needing as this kind of work is right up my alley. I you care to you can back channel me. What I am uncertain about is if you using scans or text files. regards Keith Am 04.01.2012 um 13:22 schrieb Roger Frank:
When I look at a page from a book, I can figure out what is poetry, what is a block quote, what is a chapter title, and so forth. I got to thinking: if I can do that, why can't a computer? How powerful it would be, especially for a new post processor, if all they had to do was get the text to match the book and have the computer make all the decisions to make the HTML and other formats! I've been experimenting with that and have generated a few books for PG using a “zero" markup language.
The books I work on tend to be simple and the computer doesn't have to think very hard to recognize the typographical constructions. Thanks to BB for his test-suite-2012.txt file, which is a good collection of what my generator should handle. I have some coding to do to be able to handle the more complicated situations.
For those that are interested in technical details, here is more information. The way I did this was to generate an intermediate file showing results of the analysis by the computer. Here is a paragraph from the intermediate file:
p | It was not Sammy who awoke the next time, but p | Tess. She became wide awake in a moment, hearing p | a sound from somewhere outside of the cave. p | She sat up to hear it repeated.
and here is poetry:
v | “‘Katie Beardie had a grice, v | It could skate upon the ice; v | Wasna that a dainty grice? v | Dance, Katie Beardie!
This allows me to see how accurate the program is and adjust the coding appropriately. The marks in the left column describe the text in the right column, and then this file generates the HTML or other output formats. As of now, it's a two-step process: generating the input file marked up as shown above and then generating the output files. A key point here is that everything to the right of the vertical bar is text exactly as a non-technical post-processor would want to produce it. There is essentially no markup to the right of the vertical bar.
To contrast this to what I think it would be in z.m.l., consider a signature line on a letter that is in a block quote, which is fairly common in the books I work on. With "zero" markup, everything to the right of the vertical bar would be just as it appeared in the original book. In the left margin would be "b-0" which means the computer decided it was a block quote and right justified, indented by 0 spaces. In z.m.l., I believe this would have been "~tab~~tab~~tab~name~tab~." I believe that for many people who might try post-processing, the simpler form is more approachable.
My “zero markup” is not robust. In every book I've produced with this so far, I have had to edit the intermediate file. But the goal—having the computer make human-like decisions about formatting–seems so worthwhile that I will continue to pursue it.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Jan 5, 2012, at 2:30 AM, Keith J. Schultz wrote:
I believe I can help you on setting up the heuristics that you will be needing as this kind of work is right up my alley ... What I am uncertain about is if you using scans or text files.
Hi Keith. I'm using a single text file. In particular, I made it work with the concatenated text file that comes out of DP. My goal is to make it easier for someone to get into post-processing. I know many want to but are overwhelmed by PPing--especially the formats other than text parts. The format recognition code actually works pretty well already and I'm surprised how fast I can go from DP text file to HTML. I haven't written back-end generators for other formats, but that would be next. Since it is a two step process and since I can manually mark up the infrequent special cases, it's not that important to work on the hueristics to get it all just right automagically. It would be important if this were going to be applied as part of a script to a whole catalog collection, which I won't be doing anytime soon. So thanks for the offer to improve the first pass (text to 'p-code' equivalent). I could send you the source code for this (Python 3) anytime; I would just have to document the intermediate form and you could make it more accurate. However, unless that is just academically interesting to you, it's probably not necessary. --Roger
participants (4)
-
Bowerbird@aol.com
-
James Simmons
-
Keith J. Schultz
-
Roger Frank