Re: [gutvol-d] jeroen's even-handed analysis

Steve Thomas writes:
Most users of PG don't go around grumbling about the lack of XML or the ability to output as PDF. They're just stoked to be able to find the text online.
That's why they're users of PG. If they needed XML or PDF, they go elsewhere. And frankly, I've heard many complaints about how hard it is to process PG texts and how much information is lost. I've personally found it a pain to produce good printed versions of the PG etexts. I think it a bad idea to start saying that "Most users ... don't go around grumbling", because many of those who would grumble will go elsewhere, and many of those who do grumble don't do so where we can here. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm

At 07:10 PM 10/19/2004 -0800, you wrote:
Steve Thomas writes:
Most users of PG don't go around grumbling about the lack of XML or the ability to output as PDF. They're just stoked to be able to find the text online.
That's why they're users of PG. If they needed XML or PDF, they go elsewhere. And frankly, I've heard many complaints about how hard it is to process PG texts and how much information is lost.
I don't want to add to the flame war here, but I can say this, which has been said here before. Sometimes I will find a PG text which I would like more information about, so I will go to google and search for it. In almost all cases, I have found tons of sites which somehow convert the books into html or a similar format. blackmask.com immediately comes to mind but there are lots of others. Many don't give credit to PG at all. My point is that yes, I agree with gutenberg9443 in that I would much rather have plain text first and worry about the rest later, but many people don't need to complain to PG about plain text only for the simple reason that they can look for almost anything on google and find a nicer formatted version. I would like to see PG eventually go to xml not because I particularly like the format but because the new DAISY standard for digital talking books for the blind uses a form of xml. It should, in theory, be possible to convert html to DAISY, but how well that would work I don't know. If anyone wants to analyze a set of DAISY files, go to http://bookshare.org/ and search for an early PG title. I say "early" because they apparently quit adding the newer titles. I think there might be a demo link on there just for public domain books. I will make one other comment on accents. Yes, I can see the importance of 8-bit files. I have a local mirror of almost all of PG on my system and I finally switched to getting 8-bit files only of works in non-English. However, since I am blind and I read with speech, the accents really don't matter since the synthesizer doesn't pronounce them anyway. If it sees a letter in the high ASCII range, it skkips it. This is especially bad, for example, with the works of Tolkien because accents are used so heavily.

At 07:10 PM 10/19/2004 -0800, somebody wrote:
Steve Thomas writes:
Most users of PG don't go around grumbling about the lack of XML or the ability to output as PDF. They're just stoked to be able to find the text online.
That's why they're users of PG. If they needed XML or PDF, they go elsewhere.
That's not the point. People don't go to PG thinking, "hmmm, I wonder if they have any XML files". They go looking for a book. If you want the text of a particular book, you'll use it whatever format it comes in, so long as you have the software to handle that format. Nobody "needs" XML or PDF. They "need" the words of the book. Formats are secondary. One of the original ideals of PG was that there had to be a plain text version, on the basis that everyone had at least the tools to handle plain text. Now-a-days, almost everyone has a web browser, so HTML comes second on the accessibility list. Very few people, I imagine, have the necessary tools to work with a TEI or SGML file. Now, there's nothing wrong with the notion of converting all PG texts to some XML master format, and then exporting that to umpteen other formats on demand. Practically though, that's a lot of work -- a *lot* of work -- and I don't yet see any signs that progressing. Commercially (if one were to do this commerically -- this is a hypothetical), I'd estimate such a conversion task, for 10,000 books, to cost around $1,000,000 in salaries alone. Of course, there's always volunteer effort. But if volunteers are busy converting plain texts to XML so that they can be output as plain text (or HTML/PDF/...), does that reduce the effort put into scanning/OCR/proof-reading? Could it be better to put the PG effort into getting plain text editions out, and leave it to others to do the extra conversion to XML etc.? This is a model that has worked really very well for quite a few years, without complaint from any but a few tech-enthusiasts. -- Stephen Thomas, Senior Systems Analyst, Adelaide University Library ADELAIDE UNIVERSITY SA 5005 AUSTRALIA Tel: +61 8 8303 5190 Fax: +61 8 8303 4369 Email: stephen.thomas@adelaide.edu.au URL: http://staff.library.adelaide.edu.au/~sthomas/

Steve Thomas wrote:
Nobody "needs" XML or PDF. They "need" the words of the book.
Nobody "needs" television or cars. All they "need" is a newspaper and a pair of shoes.
Very few people, I imagine, have the necessary tools to work with a TEI or SGML file.
TEI is not intended as end-user format. End-users should grab the generated HTML file.
Now, there's nothing wrong with the notion of converting all PG texts to some XML master format, and then exporting that to umpteen other formats on demand. [...] I'd estimate such a conversion task, for 10,000 books, to cost around $1,000,000 in salaries alone.
So think of what great value we would donate to the world.
Could it be better to put the PG effort into getting plain text editions out, and leave it to others to do the extra conversion to XML etc.? This is a model that has worked really very well for quite a few years, without complaint from any but a few tech-enthusiasts.
The main downside is that they mark up a *copy* of the text. When the original gets updated, the marked up copy falls out of sync and so all the generated formats. This problem can only be obviated if PG is to markup the original. -- Marcello Perathoner webmaster@gutenberg.org
participants (4)
-
D. Starner
-
Marcello Perathoner
-
Steve Thomas
-
Tony Baechler