Re: [gutvol-d] a review of some digitization tools -- 022

keith said:
Why do you not ask john if pandoc can do it without human intervention!
"without human intervention"? whatever in the world could that possibly mean? -bowerbird

Well, just submit pandoc, and will we will see what Lee has to say. regards Keith. Am 19.12.2011 um 17:31 schrieb Bowerbird@aol.com:
keith said:
Why do you not ask john if pandoc can do it without human intervention!
"without human intervention"?
whatever in the world could that possibly mean?
-bowerbird _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Mon, December 19, 2011 11:40 pm, Keith J. Schultz wrote:
Well, just submit pandoc, and will we will see what Lee has to say.
Nothing I have seen about pandoc suggests that it would be capable of satisfying my requirements; in two regards particularly: it has to be able to do the conversion without being told what kind of text input file it has been given, and it has to perform the conversion on a Gutentext file of my choice. In other words, given 4 text files using Markdown, ReStructured Text, s.m.l., and impoverished text: "pandoc File1.txt File1.html" "pandoc File2.txt File2.html" "pandoc File3.txt File3.html" "pandoc File4.txt File4.html" should all produce virtually the same output (identical is not necessary, I'll accept "close enough"), using HTML semantic structures, without the program having been told what the input format is for any particular file. Pandoc doesn't look very interesting to me. Converting from one markup language to another tends to be a very straight-forward task; the degree of success depends in large part on how closely the two markup languages match (PDF to HTML is a real pain, as PDF is almost exclusively presentational [how things look] and HTML is mostly semantic [what things are]). As near as I can tell, pandoc is just a collection of these one-to-one conversion routines. Because of what I view as a low probability of success, I'm not going to start looking at pandoc unless someone else comes to me an says, "I've tried it under your stipulations and it worked for me."

Keith > Well, just submit pandoc, and will we will see what Lee has to say. As a quick check of the value or lack thereof of pandoc I installed it, downloaded a recent hmtl (and images) posting on PG "at random" and attempted to convert it to epub format using pandoc as follows: pandoc -f html -t epub -o test.epub 38423-h.htm I have posted the result at: http://freekindlebooks.org/Dev/38423.epub By comparing to the similar PG epub posting, one can see that while pandoc does something credible, it is not nearly as useful in practice as that which Marcello is already producing.

On Wed, December 28, 2011 10:35 pm, James Adcock wrote:
As a quick check of the value or lack thereof of pandoc I installed it, downloaded a recent hmtl (and images) posting on PG "at random" and attempted to convert it to epub format using pandoc as follows:
pandoc -f html -t epub -o test.epub 38423-h.htm
Did you try the reverse process? That is, converting from HTML to one of the subtle markup languages, like ReStructured Text or Markdown? If so, does the output appear to be something that would pass the whitewashers' scrutiny? It seems to me that the greatest value in pandoc may be in using it to downgrade HTML to a structured text format rather than in using it to create an encapsulated HTML format like .epub or .mobi.

pandoc -f html -t epub -o test.epub 38423-h.htm
Lee> Did you try the reverse process? That is, converting from HTML to one of the subtle markup languages, like ReStructured Text or Markdown? If so, does the output appear to be something that would pass the whitewashers' scrutiny? At your request I tried the following output formats: RST: pandoc churns for literally a couple minutes and then dies a horrible death based on memory exhaustion. MARKDOWN: pandoc produces something which looks text-like and has forced linebreaks similar to PG txt70 but it doesn't particularly look like markdown to me. It would have to be extensively edited to pass muster with the WW. PLAIN: pandoc produces something which looks text-like and has forced linebreaks similar to PG txt70 but it doesn't particularly look like PG txt70. It would have to be extensively edited to pass muster with the WW. RTF: Produces gibberish which none of my rtf programs recognize as being RTF. In Summary "plain" output format might be vaguely useful to some people as an aid in moving from html-based development to txt70 in order to jump the hoop.

On 12/30/2011 12:44 PM, Jim Adcock wrote
In Summary "plain" output format might be vaguely useful to some people as an aid in moving from html-based development to txt70 in order to jump the hoop.
This was very interesting, thanks. (The developer of pandoc might be especially interested in your results for ReStructured Text). Rummaging around on my hard drive I came across html2txt.exe that I apparently wrote back in 2001-2002. Would you be interested in trying that as well? Just in case, I have posted the .exe to http://www.passkeysoft.com/~lee/html2txt.exe. If you're suspicious of my program (I know /I/ would be) I'd be happy to e-mail you the source as well. It's build on top of the Aug 2000 Tidy code base, so you'd have to have that as well. Thanks again for taking the time to do this.
participants (5)
-
Bowerbird@aol.com
-
James Adcock
-
Jim Adcock
-
Keith J. Schultz
-
Lee Passey