Re: [gutvol-d] seriously, feel free to ask any questions

because, you know, if this kind of text-to-html thing really _was_ "impossible" -- like you were once told -- then it seems we're now doing something "impossible". and you should be asking questions, like "how did you do that?", and "what kind of magic are you using there?" but i think we all know that this isn't really "impossible". and it never was. ok, i suppose before .html was invented, people would have looked at you strange if you would have told them that they could "convert" plain-text to "html", but still... *** indeed, you know that #013 post i sent out this morning? it had this in it:
now, i didn't specify that u.r.l. as a link. i didn't create any "markup" around it, or anything, i just copied it in. but take a look at that message on the pglaf website:
http://lists.pglaf.org/mailman/private/gutvol-d/2011-December/008447.html what you will see is that the u.r.l. to the wikipedia page was turned into a link. you can click on it, and it'll work. that is, [a href=] anchor markup was inserted around it, so that your web-browser would treat it as an active link. so _that_ is an example of a text-to-html conversion! indeed, the entire plain-text contents of that e-mail to this listserve were converted from text into .html, so that it could be mounted on the web. imagine that! and not only that, but the entire plain-text message of _every_ e-mail to this listserve are _routinely_ converted into .html, so the post can be mounted in the archive... do you want to have to code your e-mail in .html format? heck no! but do you want clickable links? heck yes! do you want archives available on the web? heck yes! and -- ever since august 2004 -- the archives _have_ been available on the web... (due to some sloppiness, the archives between 2006 and 2009 were _lost_, but e-mail during that time had been converted to .html.) the package which does this conversion is "pipermail" (the mailman edition), and it's open-source software. pipermail includes other "conversion" routines too...
for instance, because of the right angle-brackets that precede each line here, these lines will be rendered in italics, because the conversion routine presumes this should be treated as a blockquote, because that's a common convention in e-mail apps.
so it's not as if this is some exotic new capability which has only recently become manifest in our consciousness. on this very listserve, where people were insisting that conversion from plain-text into .html was "impossible", _the_archive_itself_ proved those people were _stupid_, and it used their own ignorant e-mails as its "pudding". so yeah, i think that all of this is fairly _obvious_, if you just think about it for half-a-second or so, but _really_, if you have _any_ questions, do feel free to ask them, ok? -bowerbird p.s. if you're interested in a rst-2-epub converter, try:

On 12/7/2011 4:47 PM, Bowerbird@aol.com wrote:
on this very listserve, where people were insisting that conversion from plain-text into .html was "impossible",
1. All z.m.l. can be easily converted to HTML 2. All z.m.l. is plain text Therefore: 3. All plain text can easily be converted to HTML 1. Socrates is a man 2. All men are mortal Therefore: 3. All men are Socrates Same, same. Even if I admit BowerBird's second postulate (which I do not), he has still constructed a logical fallacy. Now I don't claim that it is impossible to convert all of PG's e-texts to meaningful HTML. It is, however, very, very difficult probably requiring a Watson-like computer system (hardware + software). All BowerBird has done is demonstrate that one markup system can be converted to another markup system if the systems are "close enough" (the question of whether z.m.l. is "close enough" to HTML is left as an exercise for the reader). Here's my challenge to anyone who thinks this is an easy nut to crack. I will take a moderately complex e-book (containing at least as much markup as PG 31103, which BowerBird seems to have a great deal of respect for) but also containing lists and tables (I'm thinking maybe the Autobiography of Benjamin Franklin, or maybe Pudd'n'head Wilson). I will mark those books up with ReStructured Text, Markdown, and z.m.l. (just to keep the playing field level, I will start with identical HTML in all cases, and do an automated conversion). I will pay $100 to the first person who can come up with a computer program that can reconstruct substantially identical HTML from those three text formats plus the Project Gutenberg version, without knowing which file is which and without human intervention. When you have your program in a form that can be run by an independent third party, I will provide the files for testing.

Hi Lee, For $100??? You can be serious! Make that $1000 and I might think about it! If you said substantially identical screen output! That would be simple enough. But, substantially identical HTML. That is a very tall order. Then problem is not the algorithm as such, but the heuristics involved in transforming the the input into an intermediate form. Anyone who has used commercial web editors can easily see simple conversions can cause quite drastic changes in the mark-up with just subtle differences. The rendering in a browser will give you almost identical display, yet the mark-up elements may be completely different. What would have to be done is a style analysis of the input. Maybe one could render the input to PDF and run that through a style analyzer for OCR. Sounds promising. regards Keith. Am 08.12.2011 um 03:39 schrieb Lee Passey:
Here's my challenge to anyone who thinks this is an easy nut to crack.
I will take a moderately complex e-book (containing at least as much markup as PG 31103, which BowerBird seems to have a great deal of respect for) but also containing lists and tables (I'm thinking maybe the Autobiography of Benjamin Franklin, or maybe Pudd'n'head Wilson). I will mark those books up with ReStructured Text, Markdown, and z.m.l. (just to keep the playing field level, I will start with identical HTML in all cases, and do an automated conversion).
I will pay $100 to the first person who can come up with a computer program that can reconstruct substantially identical HTML from those three text formats plus the Project Gutenberg version, without knowing which file is which and without human intervention.
When you have your program in a form that can be run by an independent third party, I will provide the files for testing.

On Thu, December 8, 2011 12:14 am, Keith J. Schultz wrote:
Am 08.12.2011 um 03:39 schrieb Lee Passey:
Here's my challenge to anyone who thinks this is an easy nut to crack.
I will take a moderately complex e-book (containing at least as much markup as PG 31103, which BowerBird seems to have a great deal of respect for) but also containing lists and tables (I'm thinking maybe the Autobiography of Benjamin Franklin, or maybe Pudd'n'head Wilson). I will mark those books up with ReStructured Text, Markdown, and z.m.l. (just to keep the playing field level, I will start with identical HTML in all cases, and do an automated conversion).
I will pay $100 to the first person who can come up with a computer program that can reconstruct substantially identical HTML from those three text formats plus the Project Gutenberg version, without knowing which file is which and without human intervention.
When you have your program in a form that can be run by an independent third party, I will provide the files for testing.
Hi Lee,
For $100??? You can be serious! Make that $1000 and I might think about it!
Fair enough. $1000 donated in your name to the non-profit organization of your choice. I believe that this kind of program is possible, and $1000 is about all my wife will let me put up. Maybe others would like to help out with their own pledges to sweeten the pot? But one further condition: once validated, the source code must be placed into the public domain, and placed in an accessible location on the internet.
If you said substantially identical screen output! That would be simple enough. But, substantially identical HTML. That is a very tall order. Then problem is not the algorithm as such, but the heuristics involved in transforming the the input into an intermediate form.
True. And yet, that is exactly what BowerBird has said 1. is possible, 2. is easy, and 3. he has done it. If he has, and it can be verified, then the $1000 is his.
Anyone who has used commercial web editors can easily see simple conversions can cause quite drastic changes in the mark-up with just subtle differences. The rendering in a browser will give you almost identical display, yet the mark-up elements may be completely different.
But I'm not interested in display, I'm interested in document structure. Let me say right now, that an HTML document that consists entirely of <p> elements with style attributes will not satisfy the requirements. Headers must be constructed as headers, lists must be constructed as lists, tables must be constructed and tables and so on.
What would have to be done is a style analysis of the input. Maybe one could render the input to PDF and run that through a style analyzer for OCR. Sounds promising.
FineReader actually does a pretty good job of intuiting document structure from its OCRed images, so if you could duplicate the kind of heuristics that FineReader performs you might be well on your way. (FineReader has the additional advantage of being able to detect font sizes, weights and styles, which data is unavailable in most simplified text markup.) The problem as I see it is that for these kind of heuristics to be useful you must first normalize the display, which leads you back to the first stumbling block. I suppose one approach may be to develop heuristics to determine what kind of /markup/ had been applied to a file, and then branch into code specific to that markup (converting one variety of markup to another is a fairly straight-forward process). The kicker comes when that program has to convert the Project Gutenberg file which has /inconsistent/ markup, if it has any at all. But the whole goal of this exercise is to create a /generic/ program which can be used to convert /any/ markup, or /no/ markup to uniformly consistent HTML (which can then be used to create ePub, Kindle, eReader, etc. formats).

I will bet $100 against nothing for the first person on this forum to makes a new tool publically available source available past this date which actually helps the efforts of PG/DP volunteer transcribers be more fun and more successful. Incremental improvements on existing tools do not count. Winner of the bet simply makes such a new tool such as wins wide acclaim on PG and/or DP -- whether I agree with that tool or not. If I pay the $100 then I win -- because I want PG / DP transcribers to be more successful and have more fun. If I don't pay the $100 then I win -- because I make the point that there is a ton of hell and fury on this forum which amounts to *no-thing.*
participants (4)
-
Bowerbird@aol.com
-
James Adcock
-
Keith J. Schultz
-
Lee Passey