Re: [gutvol-d] About the XML debate

I played a little with the ReaderWorks converter for HTML to LIT. The biggest limitation is that the LIT format supports a nice Table of Contents feature which a basic HTML to LIT conversion doesn't support. The LIT specs are supposedly free (and under a Free License) but I haven't checked into it any further than that. I supposed after TXT, HTML and PDF are working in the PG mainstream, I'll move on to other formats like the Palm and Reader formats. ----- Original Message ----- From: "Marcello Perathoner" <marcello@perathoner.de>
Joshua Hutchinson wrote:
Marcello had a Palm format working at one point, if I remember correctly.
I dropped it because pluckering the html file gives you a better experience at a smaller file size.
The same conversion should be possible for Pocket-PC formats, but I'm not going to buy one just to test this.
-- Marcello Perathoner webmaster@gutenberg.org
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Joshua Hutchinson wrote:
I played a little with the ReaderWorks converter for HTML to LIT. The biggest limitation is that the LIT format supports a nice Table of Contents feature which a basic HTML to LIT conversion doesn't support. The LIT specs are supposedly free (and under a Free License) but I haven't checked into it any further than that. I supposed after TXT, HTML and PDF are working in the PG mainstream, I'll move on to other formats like the Palm and Reader formats.
Plucker lets you download a web site (and conversely an html ebook) to your Palm. Links and images still work. Its GPLed. But its PalmOS only. AvantGo does the same for PocketPC. But it is payware. We need a reader for PocketPC (and Symbian) and an html converter that runs on (at least) linux. Both must be open source. Any suggestions? -- Marcello Perathoner webmaster@gutenberg.org

Plucker lets you download a web site (and conversely an html ebook) to your Palm. Links and images still work. Its GPLed. But its PalmOS only.
Incorrect. Plucker runs on PalmOS, PocketPC, Windows MObile, Linux and on non-PDA desktop machines. There are ports of the viewer for those platforms, many of which we carry in CVS.
AvantGo does the same for PocketPC. But it is payware.
AvantGo falls short of about 40 of Plucker's core features.
We need a reader for PocketPC (and Symbian) and an html converter that runs on (at least) linux. Both must be open source.
Plucker, Vade Mecum (the PocketPC viewer based on Plucker) are the tools you need. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

David A. Desrosiers wrote:
Incorrect. Plucker runs on PalmOS, PocketPC, Windows MObile, Linux and on non-PDA desktop machines. There are ports of the viewer for those platforms, many of which we carry in CVS.
2.1 What platforms does Plucker run on? The viewer should run on any PalmOSĀ® device running version 2.0.4 or higher of PalmOS, while the desktop tools are supported on Linux, Windows, Mac OS X, and OS/2. ---- http://www.plkr.org/faq/2.1 And, no, I won't tell Aunt Tillie that she just has to pull the sources from CVS and compile if she wants to read a book.
Plucker, Vade Mecum (the PocketPC viewer based on Plucker) are the tools you need.
Is this thing GPLed? Why don't I find any reference to this on the plucker site? -- Marcello Perathoner webmaster@gutenberg.org

The viewer should run on any PalmOSĀ® device running version 2.0.4 or higher of PalmOS, while the desktop tools are supported on Linux, Windows, Mac OS X, and OS/2.
As you know, the documentation is the last thing to be updated, and we can never track every single project out there using Plucker as an engine (there are now over 2-dozen of them, commercial and non).
And, no, I won't tell Aunt Tillie that she just has to pull the sources from CVS and compile if she wants to read a book.
Of course not, download the binaries provided on the other websites. In the case of Linux-based PDAs, use the reader packaged for those platforms (we don't provide packages for them, of course, thats not our job). The same goes for the PocketPC and WindowsMobile versions. I'm not sure about a Symbian version, but I know Plucker runs on that new Nokia/Linux tablet device.
Is this thing GPLed? Why don't I find any reference to this on the plucker site?
Perhaps you didn't look? Its been there for almost exactly 2 years: http://www.plkr.org/news/31 As for the "cobwebs" on the site, the Plucker site is being rewritten from the ground up, and that includes catching up on about 30 news articles that have to be made public as well. We all have day jobs and that takes away from our time to play with these kinds of things. I've recently been asking the community to help us bring the docs and FAQ and other bits up to date, but the response has been depressingly light. http://code.plkr.org/docwiki/ And some things I've been working on are over here: http://code.plkr.org/ David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Thanks to some back and forth with David Widger, we have posted a text to the PG archives that is basically the XML with its straight from conversion txt, html and pdf files. http://www.gutenberg.org/1/6/5/2/16523 For those interested: This book (Kitab-i-Aqdas) is a religious book from the Baha'i Faith. The text is freely available from the Baha'i website with a usage license that allows us to post the text to our archive as long as we don't make any content changes. I've basically converted it from the Microsoft Word format they posted in to a PGTEI based master and used that to create text in UTF-8, Latin-1 and 7-bit ASCII, html and pdf. Regarding the XML. The XML file can be found in the 16523-x subdirectory. These files are not designed to be read directly in a web browser like IE or Firefox. They are plain text files and open just fine in Notepad or vi or any other text editor of choice. For those wishing to play with the XML, our online validator and conversion tools can be found here: http://www.gutenberg.org/tei Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results? Thanks for your attention, Joshua Hutchinson

Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results?
Other than the unicode changes, what is the difference between 16523-0.txt and 16523-8.txt? They appear to contain identical content. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

David A. Desrosiers wrote:
Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results?
Other than the unicode changes, what is the difference between 16523-0.txt and 16523-8.txt? They appear to contain identical content.
16523-0.txt is UTF-8 encoding 16523-8.txt is Latin-1 encoding 16523-7.txt is ASCII encoding. The content should otherwise be identical.

Joshua Hutchinson wrote:
Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results?
1. The TEI files should better be named .tei and put into a 16523-tei/ directory. We have other types of XML files (MusicXML) and we don't want to get confused. Besides, TEI is the more specific appellation than XML. 2. The PDF shows some overly long page headlines. The page headline in pdf is taken from the toc entry ... Maybe I should change that to be taken from the pdf bookmarks, so you have a little more control over it. Personally I would just not include all "notes" into the toc. This is allowed by the license ("in whole or in part"). 3. PDF again. The "Synopsis and Codification" section is not indented like in TXT and HTML. That is probably a bug in the converter. I'll look into it. 4. PDF again. Some chapter names contain unicode characters like em-dash and pretty quotes. These are not supported by PDF bookmarks. You have to provide a `dumbed-down' title for the bookmark with: <index index="pdf"> before the <head>. -- Marcello Perathoner webmaster@gutenberg.org

On Sun, Aug 21, 2005 at 10:01:10AM -0400, Joshua Hutchinson wrote:
Thanks to some back and forth with David Widger, we have posted a text to the PG archives that is basically the XML with its straight from conversion txt, html and pdf files.
Thanks, Joshua. This is major!! I'm still ready to post Gilgamesh, too (and in fact, had been thinking of just "going for it"). I hope you'll be able to work on it soon. Today (yesterday?) will stand as a great day in Project Gutenberg history. XML as the base format for these "static" and forthcoming "dynamic" conversions is what we've been talking about for years. It's the key to many of the activities we've anticipated. Congratulations!!! -- Greg

Hurray, more XML! Some time ago (in February 2004), I'd already prepared The Einstein Theory of Relativity by H.A. Lorentz as http://www.gutenberg.org/etext/11335 This was also posted in XML with derived text and HTML. Jeroen. Joshua Hutchinson wrote:
Thanks to some back and forth with David Widger, we have posted a text to the PG archives that is basically the XML with its straight from conversion txt, html and pdf files.
http://www.gutenberg.org/1/6/5/2/16523
For those interested: This book (Kitab-i-Aqdas) is a religious book from the Baha'i Faith. The text is freely available from the Baha'i website with a usage license that allows us to post the text to our archive as long as we don't make any content changes. I've basically converted it from the Microsoft Word format they posted in to a PGTEI based master and used that to create text in UTF-8, Latin-1 and 7-bit ASCII, html and pdf.
Regarding the XML. The XML file can be found in the 16523-x subdirectory. These files are not designed to be read directly in a web browser like IE or Firefox. They are plain text files and open just fine in Notepad or vi or any other text editor of choice. For those wishing to play with the XML, our online validator and conversion tools can be found here:
Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results?
Thanks for your attention, Joshua Hutchinson _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Joshua wrote:
I played a little with the ReaderWorks converter for HTML to LIT. The biggest limitation is that the LIT format supports a nice Table of Contents feature which a basic HTML to LIT conversion doesn't support. The LIT specs are supposedly free (and under a Free License) but I haven't checked into it any further than that. I supposed after TXT, HTML and PDF are working in the PG mainstream, I'll move on to other formats like the Palm and Reader formats.
LIT is essentially an encapsulated OEBPS 1.0.1 Publication. What ReaderWorks does is take HTML and "conforms" it internally to OEBPS, then converts it to LIT using Microsoft's litgen.dll. Microsoft has a Reader SDK which includes a "demo" to convert OEBPS 1.0.1 into LIT. I've taken that demo and tweaked the C++ code some and then compiled it to generate a "production" level converter which I use for my publishing business. ReaderWorks has some bugs not allowing using the full power of OEBPS which LIT supports. The LIT format supports the OEBPS Tours and "out-of-spine" feature (where "out-of-spine" content is presented in "pagelets".) Most publishers who produce LIT (using either ReaderWorks or, heaven forbid, Word HTML as the input) are totally unaware of these cool features. I use Tours and "out-of-spine" content a lot in my ebooks (e.g., I put all footnotes into popup pagelets.) Joshua, I'd be happy to share my OEBPS to LIT converter, as well as a sample OEBPS Publication. You can use the Package supplied in the sample Publication as a template to build your own Packages and implement Tours and "out-of-spine" content. Let me know... Jon Noring

Jon Noring wrote:
LIT is essentially an encapsulated OEBPS 1.0.1 Publication. What ReaderWorks does is take HTML and "conforms" it internally to OEBPS, then converts it to LIT using Microsoft's litgen.dll.
I'll add new formats to the PGTEI converter on the condition that: 1. all components of the converter MUST be open source, 2. all components of the converter MUST run under linux, 3. the new format SHOULD be documented and be an open standard, 4. there SHOULD be at least one free as in beer reader. Ad 1. The converter must run on servers at ibiblio. We cannot afford server licenses. Besides, I'm a narrow-minded free software bigot bastard and proud of it. Ad 2. The converter must run on ibiblio servers which run on linux. Ad 3. Ideally the format should be an open standard like HTML. I personally won't do any work on undocumented formats. But if anybody else takes the trouble I'm not going to stand in their way. Ad 4. Ideally the viewer should be open source, but I'll settle for a free beer one. It just feels wrong to make people pay for a viewer to read free books on. -- Marcello Perathoner webmaster@gutenberg.org

Marcello wrote:
Jon Noring wrote:
LIT is essentially an encapsulated OEBPS 1.0.1 Publication. What ReaderWorks does is take HTML and "conforms" it internally to OEBPS, then converts it to LIT using Microsoft's litgen.dll.
I'll add new formats to the PGTEI converter on the condition that:
1. all components of the converter MUST be open source, 2. all components of the converter MUST run under linux, 3. the new format SHOULD be documented and be an open standard, 4. there SHOULD be at least one free as in beer reader.
Well, that pretty much leaves LIT out of the picture (essentially by 3 and 4). However, OEBPS 1.0.1 would be a viable format to produce (and quite easy if the books documents will validate in XHTML 1.0 Strict.) Then end-users can produce LIT if they so choose. (I'd also produce an OEBPS 1.2 Publication version as well -- there are subtle differences between the two.) As a format, OEBPS fulfills all the openness requirements. There are a couple primitive viewers (still under development) for OEBPS 1.0.1 and 1.2. This includes the "OpenBerg" project. OpenReader (the format) is planning on embracing OEBPS 1.2 and later a selected subset of TEI (PGTEI?). Jon Noring
participants (6)
-
David A. Desrosiers
-
Greg Newby
-
Jeroen Hellingman (Mailing List Account)
-
Jon Noring
-
Joshua Hutchinson
-
Marcello Perathoner