
[Same message as sent earlier, but without attachments just in case] On 1/24/2012 1:46 AM, Marcello Perathoner wrote:
On 01/24/2012 12:25 AM, Lee Passey wrote:
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently
Ok. Now put that into code, runnable on an ubuntu box,
I see. You don't just want me to show you how, you actually want me to do all the work. I'm not sure I want to go to that much effort simply to demonstrate to you that it's possible. What I will do for you is send you the C++ code that you can compile and install yourself. Attached are two zip files; one contains an early version of a C++ version of Tidy (circa 2002) the other additional files used to create the html2txt executable. html2txt.zip contains a file named "filelist.txt" which lists the files from each archive necessary to build the program. I neglected at that time to add the "readme.txt" to the zip file, so I am attaching it here separately. I don't know if the gutvol-d list software strips off attachments or not (if it doesn't, it should). If anyone else would like this code and it doesn't come through, contact me directly. The theory of operation of converting HTML to text is really quite simple, and there's plenty of ways to skin this particular cat. If I were doing this again I would probably use Java as it has all the DOM parsing and manipulating functions necessary, if not built in then readily available. With Java it could easily be done in a couple of hundred lines of code and would "run everywhere." The method is so simple and straight-forward that probably even BowerBird could do it in Python, and I'm sure it's doable as an XSL script as well.
and give it to the WWers to evaluate.
LOL! I'm not convinced that any of the white-washers could even spell ubuntu, let alone compile, install and use a Linux program. For them I've attached a MSWindows executable built from the attached code.
Take a hundred random samples from the archive and pipe the HTML file thru your device and see if something very close to the posted txt file comes out. (You may safely ignore where the lines break, but not the number of empty lines between blocks.)
This is an exercise left to the reader. Of course, the real test is not equivalency, but whether the output is something the white-washers would accept; no one can judge whether this requirement is met except the white-washers themselves.