Re: [gutvol-d] a review of some digitization tools -- 022

keith said:

...
Why do you not ask john if pandoc can do it without human intervention!

"without human intervention"?

whatever in the world could that possibly mean?

-bowerbird _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Lee Passey

6:10 p.m.

New subject: a review of some digitization tools -- 022

On Mon, December 19, 2011 11:40 pm, Keith J. Schultz wrote:

...

Well, just submit pandoc, and will we will see what Lee has to say.

Nothing I have seen about pandoc suggests that it would be capable of satisfying my requirements; in two regards particularly: it has to be able to do the conversion without being told what kind of text input file it has been given, and it has to perform the conversion on a Gutentext file of my choice. In other words, given 4 text files using Markdown, ReStructured Text, s.m.l., and impoverished text: "pandoc File1.txt File1.html" "pandoc File2.txt File2.html" "pandoc File3.txt File3.html" "pandoc File4.txt File4.html" should all produce virtually the same output (identical is not necessary, I'll accept "close enough"), using HTML semantic structures, without the program having been told what the input format is for any particular file. Pandoc doesn't look very interesting to me. Converting from one markup language to another tends to be a very straight-forward task; the degree of success depends in large part on how closely the two markup languages match (PDF to HTML is a real pain, as PDF is almost exclusively presentational [how things look] and HTML is mostly semantic [what things are]). As near as I can tell, pandoc is just a collection of these one-to-one conversion routines. Because of what I view as a low probability of success, I'm not going to start looking at pandoc unless someone else comes to me an says, "I've tried it under your stipulations and it worked for me."

James Adcock

29 Dec 29 Dec

5:35 a.m.

New subject: a review of some digitization tools -- 022

Keith > Well, just submit pandoc, and will we will see what Lee has to say. As a quick check of the value or lack thereof of pandoc I installed it, downloaded a recent hmtl (and images) posting on PG "at random" and attempted to convert it to epub format using pandoc as follows: pandoc -f html -t epub -o test.epub 38423-h.htm I have posted the result at: http://freekindlebooks.org/Dev/38423.epub By comparing to the similar PG epub posting, one can see that while pandoc does something credible, it is not nearly as useful in practice as that which Marcello is already producing.

Lee Passey

6:18 p.m.

New subject: a review of some digitization tools -- 022

On Wed, December 28, 2011 10:35 pm, James Adcock wrote:

...

As a quick check of the value or lack thereof of pandoc I installed it, downloaded a recent hmtl (and images) posting on PG "at random" and attempted to convert it to epub format using pandoc as follows:

pandoc -f html -t epub -o test.epub 38423-h.htm

Did you try the reverse process? That is, converting from HTML to one of the subtle markup languages, like ReStructured Text or Markdown? If so, does the output appear to be something that would pass the whitewashers' scrutiny? It seems to me that the greatest value in pandoc may be in using it to downgrade HTML to a structured text format rather than in using it to create an encapsulated HTML format like .epub or .mobi.

Jim Adcock

30 Dec 30 Dec

7:44 p.m.

New subject: a review of some digitization tools -- 022

...

...
pandoc -f html -t epub -o test.epub 38423-h.htm

Lee> Did you try the reverse process? That is, converting from HTML to one of the subtle markup languages, like ReStructured Text or Markdown? If so, does the output appear to be something that would pass the whitewashers' scrutiny? At your request I tried the following output formats: RST: pandoc churns for literally a couple minutes and then dies a horrible death based on memory exhaustion. MARKDOWN: pandoc produces something which looks text-like and has forced linebreaks similar to PG txt70 but it doesn't particularly look like markdown to me. It would have to be extensively edited to pass muster with the WW. PLAIN: pandoc produces something which looks text-like and has forced linebreaks similar to PG txt70 but it doesn't particularly look like PG txt70. It would have to be extensively edited to pass muster with the WW. RTF: Produces gibberish which none of my rtf programs recognize as being RTF. In Summary "plain" output format might be vaguely useful to some people as an aid in moving from html-based development to txt70 in order to jump the hoop.

Lee Passey

31 Dec 31 Dec

11:01 p.m.

New subject: a review of some digitization tools -- 022

On 12/30/2011 12:44 PM, Jim Adcock wrote

...

In Summary "plain" output format might be vaguely useful to some people as an aid in moving from html-based development to txt70 in order to jump the hoop.

This was very interesting, thanks. (The developer of pandoc might be especially interested in your results for ReStructured Text). Rummaging around on my hard drive I came across html2txt.exe that I apparently wrote back in 2001-2002. Would you be interested in trying that as well? Just in case, I have posted the .exe to http://www.passkeysoft.com/~lee/html2txt.exe. If you're suspicious of my program (I know /I/ would be) I'd be happy to e-mail you the source as well. It's build on top of the Aug 2000 Tidy code base, so you'd have to have that as well. Thanks again for taking the time to do this.

4958

Age (days ago)

4970

Last active (days ago)

List overview

Download

6 comments

5 participants

participants (5)

Bowerbird＠aol.com
James Adcock
Jim Adcock
Keith J. Schultz
Lee Passey