Re: [gutvol-d] jeroen's even-handed analysis

older
Re: [gutvol-d] author lookup [was:...

Gutenberg9443＠aol.com

19 Oct 2004 19 Oct '04

6:49 p.m.

In a message dated 10/19/2004 12:16:07 AM Mountain Standard Time, Bowerbird@aol.com writes: Doing fully automatic convertion to good paged PDFs for

...

printing nice copies (and I mean good, as different from workable) will probably always remain a dream

This is A goal. It is not, and cannot be, THE goal. It would be great to have everything in printable PDF for people who want printable PDF. If you want to keep ten thousand books on your computer, printable PDF isn't worth the end product of bovine digestion. I loathe PDF. I'm sure I'm not the only person who uses Gutenberg who is in my situation: I'm going blind--slowly, fortunately, unlike a neighbor who went blind overnight--and I can't get PDF documents on my Rocket, which means that as my vision continues to deteriorate I'm going to have to read sitting in front of my computer if I want to read something that is not available in a format I can convert to text or HTML in order to convert it to Rocket. I agree with Michael. Post everything in TXT first AND THEN do anything else you want to do with it. I believe that is one of the goals of of the DP team, which has all the scanned pages on computer to work from. HTML, even the "Save As" kind of HTML, can maintain formatting if you tell it to; I know because I've done it often. A basic problem in this entire discussion is that there are a lot of people here who are program-happy, as opposed to computer-happy. I'm computer- happy, but like the vast majority of people who use Gutenberg, I'm really not interested in umpteen different programs. I just want a book I can read. As a scholar, I might at times need the specific coding which will tell me what used this punctuation mark or that whatever that doesn't come across on txt, but if I need that, I can obtain the book someway and reinsert the punctuation and formatting and whatever. The village schoolmaster in a third world village, who has two hours of electricity a day, one cellular phone for the entire village, and an obsolete laptop donated to him by a first world company with a connection from the phone to the laptop cobbled together by a gadget-minded Peace Corps volunteer or church or UN aid worker, doesn't give a squiddly about umlauts and grave accents. He just wants BOOKS that he can READ to his students during the two hours a day that the electricity is on. The cowboy who's going to be stuck all winter in a back-country cabin looking after a herd of cattle in a snowed-in high pasture, or the astronaut, or the submariner, or the scientists in a South Pole research station, or the kids going to bush- school on the radio in Australia or Alaska--these people don't need pretty pages. THEY NEED BOOKS. They need good books. That's all. If we go back to the very basics, this is the goal of Project Gutenberg. It is no mistake that the very first things Michael posted were the most important documents of freedom. An educated populace can be kept enslaved for only so long, and then the privy hits the fan. We are the world's free public library. We do not serve, nor do we even NEED to serve, the few people in elite professions who want, and need, to be able to account for every comma and every umlaut. People who are arguing their heads off about ten different ways to format are losing sight of the goal. It is hard to remember that your goal was draining the swamp if you are up to your a** in alligators. Stop creating alligators. If YOU--whoever YOU happens to be--want to create all kinds of pretty formats, do it. That's grand. But don't try to inflict your vision on all of PGLAF. The TXT versions MUST come first. Then people can be joyfully reading the new books, while other people create other formats for those nice new books. Now can we go back to draining the swamp? Notice I said "can," not "may." We Ph.D.s in English know our grammar. I MEANT "can." Anne

Attachments:

attachment.html (text/html — 5.5 KB)

Show replies by date

Jeroen Hellingman

19 Oct 19 Oct

7:51 p.m.

New subject: jeroen's even-handed analysis

Gutenberg9443@aol.com wrote:

...

The TXT versions MUST come first. Then people can be joyfully reading the new books, while other people create other formats for those nice new books.

...

Anne

For a novel, you may be right, but for complex texts, like most teaching materials require something more than that: illustrations, tables, maybe formulas, etc. HTML is really the bottom line here. Also, since it is normally easier to throw something away than to add, I prefer to go to XML first, and then create HTML and Text from that. Jeroen.

Marcello Perathoner

7:52 p.m.

New subject: jeroen's even-handed analysis

Gutenberg9443@aol.com wrote:

...

The village schoolmaster in a third world village, who has two hours of electricity a day, one cellular phone for the entire village, and an obsolete laptop donated to him by a first world company with a connection from the phone to the laptop cobbled together by a gadget-minded Peace Corps volunteer or church or UN aid worker, doesn't give a squiddly about umlauts and grave accents.

True, if he happens to speak or teach English. If he happens to speak or teach any other language of the world he will care very much for accents, squigglies and umlauts. I wouldn't want to teach my pupils eg. French from a book without accents. And don't start me about schoolmasters who speak or teach Chinese, Korean, Japanese, Hebrew, Arab, Vietnamese, Thai etc. Actually the 7bit craze at PG at some point went so far as to convert Chinese etexts to 7bit, completely mangling the text. Now suppose some Chinese reader did actually download one of those garbled texts and tried to make it work. Not good for the image of PG. Fortunately those bogus files have been tossed out since. Personally I hold that all the 7bit files of foreign books are useless and dangerous because people may get hold of them instead of the 8bit files they can use. Notice I said "can," not "could."

...

He just wants BOOKS that he can READ to his students during the two hours a day that the electricity is on.

In this case he should trade the notebook against a PDA with solar cell charger. Then he could read 24 hours a day. Ah, and, of course, he would have to download the HTML or PDB file, because the hard-wrapped TXT files are very hard to read on a PDA.

...

The cowboy who's going to be stuck all winter in a back-country cabin looking after a herd of cattle in a snowed-in high pasture, or the astronaut, or the submariner, or the scientists in a South Pole research station,

Acually the South Pole research stations have a pretty fat pipe and plenty of the latest and greatest in computer gadgets. Wish I had.

...

If we go back to the very basics, this is the goal of Project Gutenberg. It is no mistake that the very first things Michael posted were the most important documents of freedom.

You are very America-centric, aren't you? The most important documents of freedom are those of the French revolution (with accents). And if I were Chinese I would probably hold that the most important etc. etc. is Maos Red Book (unicode). Most importancy is relative. Michael posted those first because the computer he was on couldn't hold any longer texts.

...

An educated populace can be kept enslaved for only so long, and then the privy hits the fan.

Don't kid yourself. You can fool almost all of the people almost all of the time. The rest you shoot. Or, how do you explain that the most "civilized" countries of this world still use war as an instrument for "solving" international conflicts. -- Marcello Perathoner webmaster@gutenberg.org

Steve Thomas

20 Oct 20 Oct

12:40 a.m.

New subject: jeroen's even-handed analysis

Thanks Anne for reminding us all about the original objective of PG -- making texts available for people to read, wherever, and with whatever equipment. OK, you've somewhat overstated the case, and I think by now we'd all agree that "8-bit" characters are important. But it is a shame that most of the geeks -- no offence, I count myself as one -- on this list, immediately skipped your main point to whine about the need for accents and foreign scripts. You guys can't seem to see the wood for the trees. Personally, I've seen the debate about XML (not to mention z.m.l.) somewhere before -- oh, wait up, it was on THIS list, what, about eight months back? And didn't Jon go and set up a pgxml list for that discussion to continue? And didn't that list go strangely quiet shortly thereafter? You can draw your own conclusions from that. Me? I decided that my own project -- building a library of high-quality HTML "web books" was more important than trying to get a room full of experts to agree on even basic things like should we use TEI-lite or invent our own DTD. Basically, Anne is right -- who cares about this stuff? Only the few enthusiasts on this list. Most users of PG don't go around grumbling about the lack of XML or the ability to output as PDF. They're just stoked to be able to find the text online. And on the subject of PDF, I agree with Anne -- it sucks. Why? Well, apart from being too fuzzy to read on screen, it locks the user into a format that's chosen by the engine which created it. Want a different font or type size? Too bad, whoever wrote the XSLT decided that for you. But create an HTML file, properly, and then the user can do what they like with it. Want to print it out in Georgia 24pt? No problem. Your choice. Anyway ... think I'll go and convert a few more books now. Steve Gutenberg9443@aol.com wrote:

...

...

A basic problem in this entire discussion is that there are a lot of people here who are program-happy, as opposed to computer-happy. I'm computer- happy, but like the vast majority of people who use Gutenberg, I'm really not interested in umpteen different programs. I just want a book I can read. ...

-- Stephen Thomas, Senior Systems Analyst, Adelaide University Library ADELAIDE UNIVERSITY SA 5005 AUSTRALIA Tel: +61 8 8303 5190 Fax: +61 8 8303 4369 Email: stephen.thomas@adelaide.edu.au URL: http://staff.library.adelaide.edu.au/~sthomas/

Karl Eichwalder

1:38 a.m.

New subject: jeroen's even-handed analysis

Steve Thomas <stephen.thomas@adelaide.edu.au> writes:

...

But create an HTML file, properly, and then the user can do what they like with it. Want to print it out in Georgia 24pt? No problem. Your choice.

HTML isn't the best choice if you are interested in printing. Define "properly created" :) XML plus a customizable stylesheet (XSL or DSSSL) is better. For those who do not want to create a printable PDF file on their own offer a pre-generated PDF file. -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

Skippi

2:41 a.m.

New subject: jeroen's even-handed analysis

Hello Karl! Wednesday, October 20, 2004, 3:38:52 AM, you wrote:

...

HTML isn't the best choice if you are interested in printing.

IMO: HTML isn't the best choice if you are interested in anything. This obsolete format should not be considered if even mentioned. Use XHTML + CSS instead, if you are allergic to XML. With properly written XHTML and customizable CSS user can do what ever he wishes with the files and the things are still as they should be. -- Skippi mailto:skip@nextra.sk

David A. Desrosiers

3 a.m.

New subject: jeroen's even-handed analysis

...

IMO: HTML isn't the best choice if you are interested in anything.

...

This obsolete format should not be considered if even mentioned. Use XHTML + CSS instead, if you are allergic to XML.

XHTML is HTML 4.0 designed to work as an XML application. In fact, there aren't a lot of differences between HTML and XHTML. Of course, XHTML, HTML, and XML are all "children" of SGML anyway, so we're all talking about generalized markup in some form or another. You can't set up a rigid set of rules that will apply across all past, present and future versions of printed works in electronic format. Whatever format you choose to use, must be extensible enough to scale for future capabilities, as well as the ability to handle the capabilities of documents created in the past.

...

With properly written XHTML and customizable CSS user can do what ever he wishes with the files and the things are still as they should be.

Almost whatever s/he wishes. There are limitations in every format, depending on how broadly you want to consider using it. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Holden McGroin

9:57 a.m.

New subject: jeroen's even-handed analysis

Skippi wrote:

...

This obsolete format should not be considered if even mentioned. Use XHTML + CSS instead, if you are allergic to XML. With properly written XHTML and customizable CSS user can do what ever he wishes with the files and the things are still as they should be.

I've heard some mobile devices with limited memory aren't able to parse XHTML files. Someone using an older browser may not either, plus the CSS may not even be of any use to them. I don't see why we should exclude these people. HTML has its uses just like XHTML, XML and PDF do. The ideal would be to have the texts in an XML-based format so transforming to standards-compliant HTML (and XHTML/PDF) is trivial. An XML->HTML converter could then be written by anyone who cares enough about it to do so. Cheers, Holden

Steve Thomas

4:54 a.m.

New subject: jeroen's even-handed analysis

As usual, people have missed the point of the original post (Anne's) which was that we need to remember the *user* -- that guy in Africa with only 2 hours of electricity each day. Anne suggested (I think) that he uses a laptop, but more likely he's using a worn-out IBM 486 running Windows 3, so all this geek-talk about XML and XSLT etc. is irrelevant to him -- he'll be lucky if he can run a standard web browser. [I can't believe that people still think they're doing good by shipping old 486's to Africa -- but apparently its true. I recently donated some old Pentium II's to a charity, and they couldn't believe their luck.] Anyway: Karl Eichwalder wrote:

...

Steve Thomas <stephen.thomas@adelaide.edu.au> writes:

...
But create an HTML file, properly, and then the user can do what they like with it. Want to print it out in Georgia 24pt? No problem. Your choice.

HTML isn't the best choice if you are interested in printing. Define "properly created" :) XML plus a customizable stylesheet (XSL or DSSSL) is better. For those who do not want to create a printable PDF file on their own offer a pre-generated PDF file.

My definition of "properly created" HTML would be HTML4 strict, plus CSS. I was trying to avoid obvious detail. And HTML is the *best* choice for printing if you don't have the in-depth knowledge about XML/XSL etc. or the tools to make that happen. Anyone with IE6 can make a pretty good print of my HTML books, straight from the browser. Skippi wrote:

...

This obsolete format should not be considered if even mentioned. Use XHTML + CSS instead, if you are allergic to XML. With properly written XHTML and customizable CSS user can do what ever he wishes with the files and the things are still as they should be.

XHTML is -- for practical purposes -- the same as HTML 4 strict, except that it enforces good practice, whereas HTML allows the author some latitude. The important difference, to a user, is that HTML is pretty much guaranteed to work in all browsers, whereas XHTML can be "difficult" in some circumstances -- e.g. if you include the <!xml> header, it can fould up IE6. A while ago, I started converting all my ebooks to XHTML, but immediately ran into problems that were'nt worth my time to fix. One day, browsers will commonly deal correctly with XML of any type, with an appropriate style sheet. Right now, HTML is the format that works best. This is, I know, very boring to those of us who like playing with the latest gizmos and formats. But the reality is, if you want the widest possible audience, you've got to give them the format that's easiest for them. For a more detailed discussion of this topic, see the archives, about January this year if memory serves, where you'll probably find me -- and you -- saying much the same things. -- Stephen Thomas, Senior Systems Analyst, Adelaide University Library ADELAIDE UNIVERSITY SA 5005 AUSTRALIA Tel: +61 8 8303 5190 Fax: +61 8 8303 4369 Email: stephen.thomas@adelaide.edu.au URL: http://staff.library.adelaide.edu.au/~sthomas/

Karen Lofstrom

21 Oct 21 Oct

3:49 a.m.

New subject: Aside on old computers

On Wed, 20 Oct 2004, Steve Thomas wrote:

...

[I can't believe that people still think they're doing good by shipping old 486's to Africa -- but apparently its true. I recently donated some old Pentium II's to a charity, and they couldn't believe their luck.]

My Linux users group installs thin client computer labs for schools. We happily accept PIIs, but turn down 486s. We use PIIs and PIIIs as thin clients, removing the hard drives and installing bootable NIC cards, and connect them to a fast server running K12LTSP Linux. We can create a usable 30 client computer lab for $3000 or so, since the clients are all donations. Currently we're preparing clients for our first foreign lab, to be run by Peace Corps Volunteers in Western Samoa. If that works -- then perhaps Africa :) -- Karen Lofstrom Zora on DP

Brad Collins

22 Oct 22 Oct

6:34 a.m.

New subject: Aside on old computers

Karen Lofstrom <lofstrom@lava.net> writes:

...

On Wed, 20 Oct 2004, Steve Thomas wrote:

...
[I can't believe that people still think they're doing good by shipping old 486's to Africa -- but apparently its true. I recently donated some old Pentium II's to a charity, and they couldn't believe their luck.]

My Linux users group installs thin client computer labs for schools. We happily accept PIIs, but turn down 486s. We use PIIs and PIIIs as thin clients, removing the hard drives and installing bootable NIC cards, and connect them to a fast server running K12LTSP Linux. We can create a usable 30 client computer lab for $3000 or so, since the clients are all donations.

I can't speak for Africa, but I have spent the last 14 years living in the deepest parts of China, Laos, and Cambodia. In the last 7 years I have not seen anything older than a PII except in a few old government systems and ancient bank computer networks running OS/2. And I'm not talking about the big cities like Vientiene, I'm talking about villages which barely have electricity and other odd corners with flakey old generators grumbling in blackend soot covered back sheds. Just because the electricity is only on for a few hours a day doesn't mean that people don't have access to okay technology. Hell, I've seen rice farmers along the Mekong River using picture phones to send pictures of babies to relatives in Bangkok. The third world ain't always as backward as people in the first world think. There are large areas that are that bad, but then ebooks will not be an option for them until they have bridges connecting them to settled areas, or proper water, bottled gas for cooking.... b/ -- Brad Collins <brad@chenla.org>, Bangkok, Thailand

Carlo Traverso

8:33 a.m.

New subject: Aside on old computers

...

...
...
...
...
"Brad" == Brad Collins <brad@chenla.org> writes:

Brad> Karen Lofstrom <lofstrom@lava.net> writes: >> On Wed, 20 Oct 2004, Steve Thomas wrote: >> >>> [I can't believe that people still think they're doing good by >>> shipping old 486's to Africa -- but apparently its true. I >>> recently donated some old Pentium II's to a charity, and they >>> couldn't believe their luck.] >> My Linux users group installs thin client computer labs for >> schools. We happily accept PIIs, but turn down 486s. We use >> PIIs and PIIIs as thin clients, removing the hard drives and >> installing bootable NIC cards, and connect them to a fast >> server running K12LTSP Linux. We can create a usable 30 client >> computer lab for $3000 or so, since the clients are all >> donations. >> Brad> I can't speak for Africa, but I have spent the last 14 years Brad> living in the deepest parts of China, Laos, and Cambodia. Brad> In the last 7 years I have not seen anything older than a Brad> PII except in a few old government systems and ancient bank Brad> computer networks running OS/2. Brad> And I'm not talking about the big cities like Vientiene, I'm Brad> talking about villages which barely have electricity and Brad> other odd corners with flakey old generators grumbling in Brad> blackend soot covered back sheds. When I started a EU-financed international research project on symbolic computation, some 12 years ago, the computers that we were using were 486. And they were running linux, (slackware) X, and I was able to run TeX, and view high quality output. I am still using and developing the software that we wrote in this project. There should be something wrong if in 12 years what was good for a half-million-dollar research project isn't even good for a forest village. Not only this, but also the following generation of processors (Pentium-I). Carlo

Brad Collins

20 Oct 20 Oct

7:25 a.m.

New subject: jeroen's even-handed analysis

Ack! This is a looong post.... and I'd promised myself I wouldn't get dragged into this flame-fest :( Steve Thomas <stephen.thomas@adelaide.edu.au> writes:

...

OK, you've somewhat overstated the case, and I think by now we'd all agree that "8-bit" characters are important. But it is a shame that most of the geeks -- no offence, I count myself as one -- on this list, immediately skipped your main point to whine about the need for accents and foreign scripts. You guys can't seem to see the wood for the trees.

You're right, it's not just about accents, and it's not just about consistently converting texts into different formats, though these are both important issues in their own right. This aside, it's you who have it backwards. You keep talking about the end-use of the text, which is opening up a file and reading it. But it's far from being this simple. XML is not meant for humans, it is meant for software. The XML will be converted to plain-text, HTML and PDF for humans but mostly the XML will be used by applications humans need to find texts and determine if they are worth reading in the first place. If you have a small library with 10,000 books in it, and the library is shelved roughly by category you can easily get to know it just by glancing over the spines. You could even have a rough list that breaks down the books by title, author and category. But if you have 100,000 or a 1,000,000 books in your library your job of finding things becomes a lot more difficult. Keyword searching ala Google fill never cut it. Google gives you a means of finding your car keys -- you know what you are looking for and you ask it to look for places which it thinks might have them. Ask Google for a list of the works by Charles Dickens and you will get a list of web pages it thinks has lists of Charles Dickens' works. Ask the LOC Catalog the same question and it will return you a list of items in their catalog which claim to have been written by Charles Dickens. But this list would be huge because of duplicate editions of individual works. A Christmas Carol alone turns up a couple hundred items. But what if you could ask this same question and it would return a list of works (not web pages, or different editions) by Charles Dickens organized in any way you want? But this is not a good example. Can you ask for a list of all the characters in Great Expectations? Can you search for all contemporary obituaries of Charles Dickens? To build applications which answer these types of questions requires more than a good cataloging system (though the FRBR approach goes a long ways in this regard) you need the table of contents of each work (a TOC is description of the structure of a text) and you need to have a good index of what is in the text. A back of book index is more than just a matter of keywords, it is a form of semantic markup. It maps concepts, people, places and events to the text itself. By combining the catalog metadata, the table of contents, and a good quality index we have the basic tools for finding a book and determining if it is worth reading. We do this today in libraries but it is a slow laborious task which requires you going to a catalog looking for possible candidates, then retrieving each candidate and scanning it's TOC, preface, dust-jacket blurb or introduction or index to determine if it's worth reading. Traditional libraries are restricted by the physical medium that books are published in. But if you could pull all of these elements together into a consistent framework, you would have a remarkable resource which would transform an archive of books into a repository of knowledge which is far more valuable and powerful than the sum of its parts. Semantic markup like TEI is needed not only for creating this kind of library, but for creating services which will be needed as the amount of information on the Net grows beyond what even monster search services like Google can handle. You talk about missing the forest for the trees but you forget that a large part of the forest is a tangled root system deep underground which the end user will never see. Without that root system the forest will die. Structured, semantic markup and rich cataloging are the root system of a library. Anyone who says -- I don't care about the technical stuff just give me what I want, doesn't understand that it's the technical stuff which enables them to get the stuff they want. Is this hard work? Hell yes, and it should be. Understanding, evaluating and making sense of the world around us is the most difficult thing humans do. But saying that it's not worth doing because it's hard is simply pathetic. Look at works like the OED. Would they have been created if their attitude was, oh, it's too hard to build a dictionary based on historical principles and I don't read the quotes much anyway, so just give me a list of words. Even if you don't read the quotes, the unabridged OED and the unabridged Websters, or Century Dictionary were used to create brilliant concise works like Merriam-Websters Collegiate, or the Concise Oxford English Dictionary. The OED and the massive collection of research and material that was created to write is the root system for all dictionaries Oxford produces. The more important question we should be asking is, what is the role that PG and even DP should be playing in all of this. It's reasonable to ask that PG produce basic structured markup which shows the basic structure and important elements in each text. This is no more difficult than HTML. I believe that a new group needs to be established who will then take the simple TEI produced by PG and DP and then doing more complex cataloging, indexing and semantic markup which will then be sent back to PG to be released as new editions. The TEI documentation (which is 1,400 pages -- not 14,000 as Bowerbird exaggerated) recommends that markup be done in several passes. Start with simple structural markup (as I said, is about the same as HTML), and then pass it onto another team which can do a second more detailed pass, and so on until its complete. In this way you have a means of creating texts which will be gradually woven into the library but everyone will be using a consistent and interoperable format which can be as simple or as complex as anyone requires. If everything is in basic TEI-Lite, it will be easy for smaller specialized groups to come along to do this additional markup. A group could form around a single author like Mark Twain, or around a category of works like mathematics. Then it will be easy for them to donate back their work to PG, making the texts richer, rather than their work becoming a separate branch of the texts which aren't interoperable with the PG editions. Plain-text, HTML and PDF can't do this because they are display formats for human consumption. Each have their uses and their markets. TEI is used for the root system which needs to be grown, tended and cared for as the forest grows, even if 90% of people aren't even aware it's there or don't understand that the applications they depend on to find any particular tree in the forest and see if it's the tree they need, wouldn't work without it. To understand where I'm coming from on all of this, I should mention (plug plug plug) that I've been working on just such a system (http://www.chenla.org) which is divided into two parts -- the Burr Metadata Framework (BMF) which is meant to be sort of a Wiki markup for both integration of and export to TEI and MARC. The second part is the Librarium which uses BMF to integrate the catalog with the works in a library. We have recently put up our first experimental record (an authority record Charles Dickens) which has been converted into html, and plain text. Conversion to TEI and MARC is coming. Taken together, the system can be used to integrate library catalogs with books and other texts and reference works all together with authority data for persons and groups, geographic locations, events and concepts. We don't intend to be a service for the general public but rather create a catalog and content for other use in other libraries and web sites. The site is hosted at ibiblio. In the next few weeks we should have enough documentation and another 30 or more records (which we call Burrs) online to make a general announcement of the project. I'm still ironing out some bugs in the version control software and still need to do a lot of work to complete a general introduction to the design but it's all getting there. At the moment we can convert BMF to Emacs-Wiki format which I then use to publish to Blosxom which delivers basic HTML. BMF was designed with conversion to TEI in mind, though this might seem hard to believe when you look at the BMF source the first time (there is a link to a pretty-print version of the Dickens source). So what's in it for PG? The Librarium will be developing detailed authority and bibliographic records for all PG material and it's hoped that PG can eventually draw on our catalog material for it's own authority records and catalog. This should be a help both for books already in PG's collection but also for copyright clearance for new books and free up resources for putting out more books, with better metadata. b/ -- Brad Collins <brad@chenla.org>, Bangkok, Thailand

Marcello Perathoner

10:41 a.m.

New subject: jeroen's even-handed analysis

Steve Thomas wrote:

...

Basically, Anne is right -- who cares about this stuff?

That is the exact same answer Tim Berners-Lee got when he first presented his stuff. :-) "I can view a text file with "more", I just hit the space bar until I get to the right page. With your new-fangled format I need a -- what? -- browser? I don't have one. Why should I need a `browser' just to read some text?"

...

Only the few enthusiasts on this list. Most users of PG don't go around grumbling about the lack of XML or the ability to output as PDF. They're just stoked to be able to find the text online.

I can assure you that some do. Many start their own projects to markup PG texts, most of them dont go very far, though. One example: http://gutenberg.hwg.org/

...

And on the subject of PDF, I agree with Anne -- it sucks.

The only format we have today to bring mathematics to the unsophisticated user. If you don't want to install TeX, PDF is the only way. PDF is not so bad. It is widely accepted, well documented, free tools exist to generate PDFs. It has all the limitations of paper books, though. You cannot resize a printed book, or change the font, etc. Well same limitations for PDF. It hasn't stopped people from buying paper books. -- Marcello Perathoner webmaster@gutenberg.org

Jonathan Ingram

11 a.m.

New subject: jeroen's even-handed analysis

--- Marcello Perathoner <marcello@perathoner.de> wrote:

...

Steve Thomas wrote:

...
Basically, Anne is right -- who cares about this stuff?

That is the exact same answer Tim Berners-Lee got when he first presented his stuff. :-)

Indeed. Over at DP we're progressing, in small baby-steps, toward producing decently marked up editions of all new material we produce. And when we at DP find a markup format we're comfortable with, then PG had better get comfortable with it as well, because we are now produce the vast majority of all PG material. -- Jon Ingram _______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com

Marcello Perathoner

11:17 a.m.

New subject: jeroen's even-handed analysis

Jonathan Ingram wrote:

...

Indeed. Over at DP we're progressing, in small baby-steps, toward producing decently marked up editions of all new material we produce. And when we at DP find a markup format we're comfortable with, then PG had better get comfortable with it as well, because we are now produce the vast majority of all PG material.

A simple XSLT will convert your format into TEI. -- Marcello Perathoner webmaster@gutenberg.org

Jonathan Ingram

1:57 p.m.

New subject: XML won't eat your children (was Re: jeroen's even-handed analysis)

--- Marcello Perathoner <marcello@perathoner.de> wrote:

...

A simple XSLT will convert your format into TEI.

I'm not sure any use of XSLT can be called simple :). I've tried reading the spec, and I'm still recovering from the headaches. Fortunately there are easier ways to style (rather than transform) XML, using CSS. This is very well supported in all the Mozilla derivatives. While XSLT is something I'm going to have to look at eventually, for the moment I'm happy with CSS :). If you want an example of what I'm playing with at the moment: recently I and another DP volunteer have been kicking around some ideas for semantic markup of drama. While initially we were working with straight HTML, this quickly gets annoying, due to the amount of messing around with divs involved, and the need to consider how the output will be displayed on browsers with poor support for CSS. I've found it much easier to investigate options by working with an 'HTML+extra tags' markup. You can see my current working by looking at the blah.* files here: http://www.pgdp.net/phpBB2/viewtopic.php?p=94734 Save each file to the name given in its post subject heading. Any Mozilla derivative should show the .xml file styled in a way which almost exactly replicates the .html file. The source for the XML edition is much easier to read. Those of you who know TEI can probably tell that 'my' markup is very similar to TEI markup (although a little more verbose). Much of it was arrived at independently, which makes me more confident that this styling approach is relatively sensible. The example demonstrates markup of drama and poetry, with decent handling of line continuations and line numbers in poetry, and stage directions in drama. I've used the HTML 'edition' of this poetry markup for quite a while now in texts I've PPed for PG. Note that this is still a work in progress, so resist the tempation to criticise the minutiae of my CSS :). One of the other reasons I think a simple XML-style is useful is that we're currently planning to seperate the proofreading rounds from the markup rounds at DP. Every page of a DP project currently goes through two 'rounds' of processing. In each round proofers are expected to not only detect OCR errors, but add inline markup for italic, bold, material in non-Latin alphabets, etc., and add block markup for poetry, tables, and so on. This will be split into an initial two rounds only concerned with the text, plus an extra procedure to mark the text correctly. At the moment the markup we use is homegrown and kludgy -- we have a great opportunity at the moment to move to something more sensible, and I strongly believe that some simple XML-derivative is the markup we need. I'm even more convinced of the utility of XML for DP now that I've seen how easy it is to style it. One of the problems of relying on something like XSLT is that it can be hard to go backwards from errors in the output to find the corresponding error in the original XML input. Being able to get direct feedback by viewing a styled version of the XML makes life much easier. -- Jon Ingram _______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com

Marcello Perathoner

3:25 p.m.

New subject: XML won't eat your children (was Re: jeroen's even-handed analysis)

Jonathan Ingram wrote:

...

I'm not sure any use of XSLT can be called simple :). I've tried reading the spec, and I'm still recovering from the headaches. Fortunately there are easier ways to style (rather than transform) XML, using CSS. This is very well supported in all the Mozilla derivatives. While XSLT is something I'm going to have to look at eventually, for the moment I'm happy with CSS :).

CSS, while simpler, is less powerful and gives you only HTML.

...

If you want an example of what I'm playing with at the moment: recently I and another DP volunteer have been kicking around some ideas for semantic markup of drama.

What I've done with Faust is to reformat the text file in a sensible way and then use perl to automatically add TEI markup. I advise to use a perl script to add the basic markup and to refine the markup in a second markup-proofing step.

...

Those of you who know TEI can probably tell that 'my' markup is very similar to TEI markup (although a little more verbose). Much of it was arrived at independently, which makes me more confident that this styling approach is relatively sensible.

Why do people keep reinventing the wheel? TEI is perfectly good and designed explicitly for the task we have at hand. And it is a standard that is already in use in many e-libraries worldwide. I don't think we'll get PG to post texts in non-standard cooked-up formats. They are already making enough fuzz over perfectly valid TEI files. -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

4:03 p.m.

New subject: XML won't eat your children (was Re: jeroen's even-handed analysis)

Marcello wrote:

...

Jonathan Ingram wrote:

...

...
I'm not sure any use of XSLT can be called simple :). I've tried reading the spec, and I'm still recovering from the headaches. Fortunately there are easier ways to style (rather than transform) XML, using CSS. This is very well supported in all the Mozilla derivatives. While XSLT is something I'm going to have to look at eventually, for the moment I'm happy with CSS :).

...

CSS, while simpler, is less powerful and gives you only HTML.

CSS can be applied to any XML markup for viewing on web standards browsers, but because current CSS is limited there are certain HTML functions (tags) it simply won't be able to enable, or to enable cleanly. In CSS 'display', for instance, there is no value for identifying when some XML element represents an object/image, nor a hypertext link/anchor. Obviously, there is no 'display' value for an inline note because HTML never supported this. (We find markup for inline notes in the TEI and DocBook vocabularies. Note that CSS can move a span of inline text to the side in its own box, I've tested it out myself, but IE6 unfortunately does not recognize the needed CSS so in IE6 the inline note stays inline, not a good thing.) Of course, one can use XLink for object/image embedding and anchors (and XLink makes more sense anyway than using CSS since it is a vocabulary-independent means to embed objects and enable links), but then current web browsers are very deficient in XLink support (Mozilla has very limited XLink support -- haven't tested FireFox yet -- while IE and Opera have zero XLink support.) The OpenReader System 1.0, should it become a reality (and we are working on it -- we've made great strides in the last few weeks in garnering fairly high-level support), intends to fully support the more important parts of the XLink specification in version 1.0. We may also add one or more custom CSS values to 'display' to emulate links/ anchors, objects/images and inline notes (OpenReader will include a facility to open 'booklets' to display non-inline content, in part to support OEBPS which enables this cool ebook feature.) We also plan to investigate a future version of OpenReader to *natively* support TEI-Lite or some subset of TEI (including handling inline notes which will be trivial for OpenReader to handle.) We may even develop a next-generation styling language to address the deficiencies of current CSS2 and CSS3 but which doesn't have the complexity of XSLT/XSL-FO. The problem with CSS is its ties to the HTML paradigm and legacy support. In OpenReader, we are freeing ourselves from these legacy issues and thus can think outside the box and move on to the next generation web browser -- in essence to go beyond HTML. Jon Noring OpenReader: http://www.openreader.org/

Marcello Perathoner

5:29 p.m.

New subject: XML won't eat your children (was Re: [gutvol-d] jeroen's even-handed analysis)

Jon Noring wrote:

...

The OpenReader System 1.0, should it become a reality (and we are working on it -- we've made great strides in the last few weeks in garnering fairly high-level support), intends to fully support the more important parts of the XLink specification in version 1.0. We may also add one or more custom CSS values to 'display' to emulate links/ anchors, objects/images and inline notes (OpenReader will include a facility to open 'booklets' to display non-inline content, in part to support OEBPS which enables this cool ebook feature.) We also plan to investigate a future version of OpenReader to *natively* support TEI-Lite or some subset of TEI (including handling inline notes which will be trivial for OpenReader to handle.) We may even develop a next-generation styling language to address the deficiencies of current CSS2 and CSS3 but which doesn't have the complexity of XSLT/XSL-FO. The problem with CSS is its ties to the HTML paradigm and legacy support. In OpenReader, we are freeing ourselves from these legacy issues and thus can think outside the box and move on to the next generation web browser -- in essence to go beyond HTML.

May I ask how many people are working on this and what the time frame may be? -- Marcello Perathoner webmaster@gutenberg.org

Jonathan Ingram

4:03 p.m.

New subject: XML won't eat your children (was Re: jeroen's even-handed analysis)

--- Marcello Perathoner <marcello@perathoner.de> wrote:

...

I don't think we'll get PG to post texts in non-standard cooked-up formats.

Neither do I, and I don't want them to. Hopefully we'll either use TEI, or a markup which can easily and losslessly transformed into TEI. However, there are a lot of people out there, including a lot of DP volunteers, who are unconvinced about the utility of XML, and one of the best ways to *fail* to change their mind is to plonk 1400 pages of documentation in front of them and say 'here's what you should be using, come back when you've finished reading' -- this is true even of TEI-lite, which has some foibles you have to see past (overly terse tags, for example -- at least to my mind :) ). I used to be one of the members of the 'undecided about XML' camp myself. I've gradually changed my mind, and I'm working on helping to change the minds of those I'm working with. I also don't just automatically accept the TEI-way as being best for the applications I wish to use it for -- so I've been developing my own structured markup which I'm happy with, and which happens to have converged very closely to the corresponding TEI markup. As I said in my previous email, this makes me much more confident to accept the use of TEI-style markup in areas where I haven't had the time to investigate alternatives. Those of you who aren't involved in DP will probably see nothing more about this until the day that 85% of new PG ebooks come with a TEI edition :). -- Jon Ingram __________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail

Marcello Perathoner

5:37 p.m.

New subject: XML won't eat your children (was Re: [gutvol-d] jeroen's even-handed analysis)

Jonathan Ingram wrote:

...

and one of the best ways to *fail* to change their mind is to plonk 1400 pages of documentation in front of them and say 'here's what you should be using,

Then don't do that. You don't plonk the IBM PC Technical Reference Manual (5000 pages) in front of your secretary if you want her to type a few pages in M$-Word. You just give her a "Word for Dummies" book and that is all she needs. She don't need to know about the difference between AGP and PCI-X bus. The full TEI spec explains the DTD and what not. Nobody needs that except the implementors. There are many gentle introductions to TEI-Lite floating around. And thats another advantage of using a standard. You don't have to write that stuff yourself. -- Marcello Perathoner webmaster@gutenberg.org

Jonathan Ingram

9:24 p.m.

New subject: XML won't eat your children (was Re: jeroen's even-handed analysis)

--- Marcello Perathoner <marcello@perathoner.de> wrote:

...

Jonathan Ingram wrote:

...
and one of the best ways to *fail* to change their mind is to plonk 1400 pages of documentation in front of them and say 'here's what you should be using,

Then don't do that.

You don't plonk the IBM PC Technical Reference Manual (5000 pages) in front of your secretary if you want her to type a few pages in M$-Word. You just give her a "Word for Dummies" book and that is all she needs. She don't need to know about the difference between AGP and PCI-X bus.

You're quite right. I let the current confrontational 'vibe' of this mailing list get the better of me. Sorry. The point I was trying to make is that there are many people, myself included, who need to be given real arguments in favour of using something like TEI, and who won't accept that TEI does things the right way just because it's been around for a while :). There's quite a few people like me at DP, and I imagine there are quite a few more reading gutvol-d. As I've convinced myself that, at least in the areas I've investigated, TEI's methods seem quite sensible, I'm more open to 'trusting' the rest of it... and I thought some people would be interested in joining me on this journey. As gutvol-d is being a little too confrontational for me at the moment, I'll probably go back to exhibiting my enthusiasm in the more congenial atmosphere of DP. -- Jon Ingram _______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com

Jim Tinsley

5:35 p.m.

New subject: XML won't eat your children (was Re: jeroen's even-handed analysis)

On Wed, 20 Oct 2004 17:25:29 +0200, Marcello Perathoner <marcello@perathoner.de> wrote:

...

I don't think we'll get PG to post texts in non-standard cooked-up formats. They are already making enough fuzz over perfectly valid TEI files.

That last is, if not inaccurate, at least misleading. And I think you mean, by "PG" and "they" above, the WWs. So let's get down to it. Nobody has an objection to valid TEI texts, but valid TEI texts alone _are not enough_. An XML file that cannot be read (by an actual human) is as useful as a lock with no key. We need the key as well as the lock. I really no longer give any headroom at all to the approach "Post XML Now Because That Is The One True Way And We'll Figure Out How To Read It Later." If for no other reason, then because the most important part of the WW job is to check the texts before posting, and if we can't read it, we can't find the errors, and if we can't find the errors, we can't fix 'em. We WWs would all LOVE to have only one format (XML) uploaded, and generate all posting files from that. It would cut out an amazing amount of work and uncertainty. Further dowwn the line, we can get to looking at posting just the XML, and generate other formats on the fly, but let's take one step at a time. Considering that this step to date has already taken three years or so, that's not overly cautious! The first thing we need to do is get substantial agreement on a flavor of XML -- not ruling out the addition of future flavors, you understand, but we need to get at least one of them bedded down before we attack others. Teixlite seems to be the majority choice among those relatively few volunteers who are enthusiastic about XML, so let's say, for the purpose of this discussion, that that's the one we're working on. Next, we need a process for adding the header and footer for PG texts for the selected flavor. That shouldn't be a problem; if we can agree how to tag them, we can automate that. (We don't actually _have_ agreement about tagging them, but I can't believe that could end up being a problem, once we settle on the rest.) Next, we need a process, using open-source, cross-platform tools -- the standarder the better -- to convert that XML into, at a minimum, plain text and HTML. Other formats are welcome but optional. That process must work for _all_ teixlite files, not just ones that are specially cooked, using constraints not specified within the chosen DTD. Here's where we hit the rocks today. I give considerable credit to you, Marcello, and to Jeroen, as the only people I know of who have come up with at least partial answers and approaches to this. Maybe you have refined your processes, but the last time I tried, I couldn't put Jeroen's files through your process, and get the expected results. I think you have most of it down, though. Is it close enough to try again? I don't want to imply specific means from which this process is to be constructed. Obviously XSLT is one possible approach, but I certainly do not want to imply limitations on what that process should use. The only things we must have -- both for our own internal practical purposes and for the use of future readers -- is that it should work reliably on _all_ texts that conform to the XML DTD chosen, be open source, and be cross-platform. A reader needs to be able to tweak the transform and re-run on her own desktop. And just re-reading that last, when I say "must work reliably on ALL texts" I do not mean to imply that the same XSLT must be used for all texts, though obviously that would be of benefit, if we can manage it. I've held just about every position on XML at one time or another, and I'm all XMLed out. I no longer believe it is worth spending my time on, until somebody (else!) solves the issues I've just laid out. jim

Marcello Perathoner

7:35 p.m.

New subject: Posting TEI

Jim Tinsley wrote:

...

Nobody has an objection to valid TEI texts, but valid TEI texts alone _are not enough_. An XML file that cannot be read (by an actual human) is as useful as a lock with no key.

Not so. Having a TEI text posted would enable third-party developers to come up with their own converter solutions eve if we didn't get very far with ours. There are a lot of people around who already convert the text files into other formats. Their jobs would get much easier.

...

I really no longer give any headroom at all to the approach "Post XML Now Because That Is The One True Way And We'll Figure Out How To Read It Later." If for no other reason, then because the most important part of the WW job is to check the texts before posting, and if we can't read it, we can't find the errors, and if we can't find the errors, we can't fix 'em.

A TEI text is basically a text file. So you can read it in any editor. If you use emacs you can also validate the TEI file against the DTD without leaving the editor. A perfectly valid TEI file with no spelling errors should be good enough to post. What you expect from us TEI developers is that we produce the 150% perfect solution before you even consider starting to post files. That is not the way software development works. And this attitude is in my opinion the main cause why we have gotten nowhere with TEI in the last 3 years. Lets start now with a version 0.0.1 of the TEI process. Of course at some later time we'll have to do all the posted files over again. Probably more than once. But its better than sitting here and playing with bowerbird because we are bored.

...

Next, we need a process, using open-source, cross-platform tools -- the standarder the better -- to convert that XML into, at a minimum, plain text and HTML. Other formats are welcome but optional. That process must work for _all_ teixlite files, not just ones that are specially cooked, using constraints not specified within the chosen DTD. Here's where we hit the rocks today.

TEI defines a standard way to extend the DTD. I used this standard way to extend the TEI DTD into what I called PGTEI. This still is a perfectly valid TEI DTD according to the TEI specs.

...

I don't want to imply specific means from which this process is to be constructed. Obviously XSLT is one possible approach, but I certainly do not want to imply limitations on what that process should use. The only things we must have -- both for our own internal practical purposes and for the use of future readers -- is that it should work reliably on _all_ texts that conform to the XML DTD chosen, be open source, and be cross-platform. A reader needs to be able to tweak the transform and re-run on her own desktop.

You misunderstand what a DTD is. It just gives you syntactical correctness. I can cook up a perfectly valid XHTML file which is semantically bogus: <div><h6>1</h6> <div><h5>1.1</h5> <div><h4>1.1.1</h4> ... </div> </div> </div> This is valid HTML (didn't bother to check) but will render not so well. You cannot build a conversion tool that will produce good results on all syntactically valid TEI files, like you cannot build a browser that will make sense out of semantically bogus HTML files. Furthermore TEI is geared towards marking up existent texts, so scholars can study the text without having to get the physical book. It is not so good as a master format for print processing. That's why I had to add some more tags and attributes to my DTD. (Which doesn't make any text that uses my DTD less standard, because TEI is expressly designed to be extensible. But I'm repeating myself.)

...

And just re-reading that last, when I say "must work reliably on ALL texts" I do not mean to imply that the same XSLT must be used for all texts, though obviously that would be of benefit, if we can manage it.

So why not start posting texts marked up in PGTEI, which will by definition work well in my conversion chain? And at the same time start posting Jeroens texts, which will convert fine in his chain? This way we could both start putting up an automatic online conversion chain. (The guy who did this already in Java has somehow vanished, so I think we have to start over again.) For the start I will act as interim Post-Processor for people wanting to post PGTEI and pass on to you only the perfectly good ones. You'll just have to stick in the etext number where I put 5 asterisks. I claim the .pgtei file extension, Jeroen can claim what extension he sees fit for his files. So we can have bith an alice30.pgtei and an alice30.jtei. -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

7:47 p.m.

New subject: Posting TEI

Marcello wrote:

...

TEI defines a standard way to extend the DTD. I used this standard way to extend the TEI DTD into what I called PGTEI. This still is a perfectly valid TEI DTD according to the TEI specs.

I probably missed it from one of your prior messages, but do you have your PGTEI documented anywhere? Have you put together an actual Schema/DTD which can be used to validate documents for validity to PGTEI? And a list of your custom vocabulary extensions? Also, another question to ask is if it is documented anywhere how Jeroen's version of TEI compares with your PGTEI? Thanks! Jon

Marcello Perathoner

7:55 p.m.

New subject: Posting TEI

Jon Noring wrote:

...

I probably missed it from one of your prior messages, but do you have your PGTEI documented anywhere? Have you put together an actual Schema/DTD which can be used to validate documents for validity to PGTEI? And a list of your custom vocabulary extensions?

Start here: http://www.gutenberg.org/tei/

...

Also, another question to ask is if it is documented anywhere how Jeroen's version of TEI compares with your PGTEI?

No. -- Marcello Perathoner webmaster@gutenberg.org

Jim Tinsley

8:59 p.m.

New subject: Posting TEI

On Wed, Oct 20, 2004 at 09:35:11PM +0200, Marcello Perathoner wrote:

...

Jim Tinsley wrote:

...
Nobody has an objection to valid TEI texts, but valid TEI texts alone _are not enough_. An XML file that cannot be read (by an actual human) is as useful as a lock with no key.

Not so. Having a TEI text posted would enable third-party developers to come up with their own converter solutions eve if we didn't get very far with ours. There are a lot of people around who already convert the text files into other formats. Their jobs would get much easier.

I really do not mean to be disrepectful when I -- speaking for myself -- say that I'm not interested in spending my time making developers' jobs easier. That's not what I'm here for. We have text, and HTML, both proven and well-supported formats that we know how to work with and for which we know there is a demand. I'll stick to those until we can see a way clear through to making successful XML.

...

...
I really no longer give any headroom at all to the approach "Post XML Now Because That Is The One True Way And We'll Figure Out How To Read It Later." If for no other reason, then because the most important part of the WW job is to check the texts before posting, and if we can't read it, we can't find the errors, and if we can't find the errors, we can't fix 'em.

A TEI text is basically a text file. So you can read it in any editor. If you use emacs you can also validate the TEI file against the DTD without leaving the editor.

A perfectly valid TEI file with no spelling errors should be good enough to post.

Correct spelling is necessary but not sufficient. I don't know about other people, but I most commonly find errors by skimming the text. I can't do that with XML. Also, the validity of the XML gives me no comfort at all that, say, paragraphs are sensibly separated. I can do that with text or HTML to a high degree of accuracy, because I can read them naturally in a viewer program. There are many such types of problems that I can detect by eye quite quickly -- provided I am seeing the text laid out in a natural way.

...

What you expect from us TEI developers is that we produce the 150% perfect solution before you even consider starting to post files. That is not the way software development works.

Not 150%, surely! :-) And it may not be the way software development works, but then we're not a software development project. HTML already works. TeX already works. I've spent enough of my hours trying to get XML to work; I now leave that to others.

...

And this attitude is in my opinion the main cause why we have gotten nowhere with TEI in the last 3 years.

Lets start now with a version 0.0.1 of the TEI process. Of course at some later time we'll have to do all the posted files over again. Probably more than once. But its better than sitting here and playing with bowerbird

. . . or vice-versa? :-) . . .

...

because we are bored.

Anyway, I disagree with your substantive point above. I say that until we have (or SOMEBODY has) a . . . . OK, a 90% solution, we should not post.

...

...
Next, we need a process, using open-source, cross-platform tools -- the standarder the better -- to convert that XML into, at a minimum, plain text and HTML. Other formats are welcome but optional. That process must work for _all_ teixlite files, not just ones that are specially cooked, using constraints not specified within the chosen DTD. Here's where we hit the rocks today.

TEI defines a standard way to extend the DTD. I used this standard way to extend the TEI DTD into what I called PGTEI. This still is a perfectly valid TEI DTD according to the TEI specs.

...
I don't want to imply specific means from which this process is to be constructed. Obviously XSLT is one possible approach, but I certainly do not want to imply limitations on what that process should use. The only things we must have -- both for our own internal practical purposes and for the use of future readers -- is that it should work reliably on _all_ texts that conform to the XML DTD chosen, be open source, and be cross-platform. A reader needs to be able to tweak the transform and re-run on her own desktop.

You misunderstand what a DTD is. It just gives you syntactical correctness. I can cook up a perfectly valid XHTML file which is semantically bogus:

<div><h6>1</h6> <div><h5>1.1</h5> <div><h4>1.1.1</h4> ... </div> </div> </div>

This is valid HTML (didn't bother to check) but will render not so well.

You cannot build a conversion tool that will produce good results on all syntactically valid TEI files, like you cannot build a browser that will make sense out of semantically bogus HTML files.

I think one of us is not understanding the other, or perhaps both. I'm pretty sure I did not misunderstand what a DTD is. I do understand that an XML file that is valid just means that it is syntactically correct. This is actually the same point I made above: the fact that the XML is valid does not mean that paragraph breaks are in the right place -- which is one of the reasons why I must be able to convert it to something I can read in order to check it. I certainly do not require a conversion tool that will correct misplacement of paragraph marks (though it would be nice! :-) -- I just require that the process for, say, teixlite will work reliably on all teixlite files; that it will produce syntactically valid HTML, and, I suppose you might reasonably say "syntactically valid" text. Actually, now that I say that, I recall a case where syntactically valid XML made invalid HTML through a bug. Anyway, that's not the problem. If the process we agree for teixlite is, say, run it through Saxon, then I expect to be able to run all teixlite files through Saxon, and not have a submitter say "oh, no, you must use Xalan for this file, and not just any Xalan, but one with my patch in it." I have no objection to requiring, say, a patched version of Saxon, but if so I expect that patched version to be stable, to work for all teixlite files submitted, to be open-source, and to be cross-platform.

...

Furthermore TEI is geared towards marking up existent texts, so scholars can study the text without having to get the physical book. It is not so good as a master format for print processing. That's why I had to add some more tags and attributes to my DTD. (Which doesn't make any text that uses my DTD less standard, because TEI is expressly designed to be extensible. But I'm repeating myself.)

...
And just re-reading that last, when I say "must work reliably on ALL texts" I do not mean to imply that the same XSLT must be used for all texts, though obviously that would be of benefit, if we can manage it.

So why not start posting texts marked up in PGTEI, which will by definition work well in my conversion chain?

I think we were very close to that a year and a half ago. I had a request in to you to fix the "blockquote" thing, Greg had laid down the requirements for the license. And if anyone has followed up any of that, they didn't copy me on it. Does anyone apart from you favor using PGTEI? In principle, of course, it doesn't matter, but in practice, we really couldn't cope with multiple XSLT conversion methods all happening at the same time. Your chain was, at least, rather difficult to implement. I haven't checked to see whether it still is. Can it be implemented on a Mac? on Win32? Is there a stable tarball somewhere? You see, we appear to differ very fundamentally on one point. It's my lock and key analogy again. I do not want to start down the road of producing posted files from an XML if the transform, will be, for any reason, not repeatable in a year's time, or five, or ten. I do not want to start down the road of producing posted files from XML if an end-user who wants to -- on whatever platform -- cannot replicate the process. I think that you don't care about this, or at least, it's not a priority for you, but it is one for me.

...

And at the same time start posting Jeroens texts, which will convert fine in his chain?

What we said last year still holds: we need somebody -- who is not me, not any of us WWs -- to create the process. The one that I defined in my earlier posting today. When we've got that, stable and documented, or at least understood, I really think we can proceed. But _I_, at least, have not got the time to spend experimenting, and I _know_ that David Widger doesn't.

...

This way we could both start putting up an automatic online conversion chain. (The guy who did this already in Java has somehow vanished, so I think we have to start over again.)

For the start I will act as interim Post-Processor for people wanting to post PGTEI and pass on to you only the perfectly good ones. You'll just have to stick in the etext number where I put 5 asterisks.

No; I, at least, don't want to work with an experimental process in which each text is an exception. I want a process in which the text comes in, I add the header, I run the conversion process and I check the resulting files. If we can't get to that point, I don't, as I said before, want to spend time on it. If _you_ can do this, then there is no reason, given a stable process, why _I_ can't. When somebody gets to this point, please let me know.

...

I claim the .pgtei file extension, Jeroen can claim what extension he sees fit for his files. So we can have bith an alice30.pgtei and an alice30.jtei.

Why can't we just name them .xml? I see no reason to invent extensions. _Is_ there one? Not that it matters much, just curious why you would think this a good idea. jim

Jonathan Ingram

9:14 p.m.

New subject: Posting TEI

--- Jim Tinsley <jtinsley@pobox.com> wrote:

...

Correct spelling is necessary but not sufficient. I don't know about other people, but I most commonly find errors by skimming the text. I can't do that with XML.

As my post earlier on today indicates, this isn't true. Assume that PG starts accepting some TEI-related schema. All you need is a relatively simple CSS stylesheet, and you can open the XML and view it perfectly directly. See http://faculty.washington.edu/dillon/xml/ for some examples where you can view styled (XML-conformant) TEI directly in your browser, with no intermediate transformations required. -- Jon Ingram _______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com

Jim Tinsley

9:32 p.m.

New subject: Posting TEI

On Wed, Oct 20, 2004 at 02:14:31PM -0700, Jonathan Ingram wrote:

...

--- Jim Tinsley <jtinsley@pobox.com> wrote:

...
Correct spelling is necessary but not sufficient. I don't know about other people, but I most commonly find errors by skimming the text. I can't do that with XML.

As my post earlier on today indicates, this isn't true.

If I may nit-pick, I think it more correct to say that it isn't _always_ true. That is, it is not true when there exists a CSS that works with the XML. Jeroen provided XML like this, which I thought was very good indeed. For any of you who haven't seen it, please point your browsers to http://www.gutenberg.org/dirs/1/1/3/3/11335/11335-x/11335-x.xml which is an absolute pleasure to read. (Well, if you're a geek, that is, and if you ain't, whatcha doin. here?? :-) I said before, and I say again, that where such an XML is provided, HTML is probably redundant. ("Probably" because a significant use of HTML is as input to PDA readers like, say, Mobipocket, and I'm not sure if they would swallow this XML without requiring a Heimlich.) I know of no CSS for Marcello's PGTEI. Perhaps one could be crafted for it.

...

Assume that PG starts accepting some TEI-related schema. All you need is a relatively simple CSS stylesheet, and you can open the XML and view it perfectly directly.

See http://faculty.washington.edu/dillon/xml/ for some examples where you can view styled (XML-conformant) TEI directly in your browser, with no intermediate transformations required.

It does still leave the plain-text question hanging, but I do think that XML+CSS is a Good Thing, even if the XML is also destined to go through XSLT as well. jim

Jim Tinsley

9:50 p.m.

New subject: Posting TEI

I was just reading over my last posting, hoping I wasn't the one sending bad vibes to Jonathan, who is exactly the kind of person we _need_ in a discussion like this, when I came across something else that Marcello said, that I didn't comment on first time round:

...

Lets start now with a version 0.0.1 of the TEI process. Of course at some later time we'll have to do all the posted files over again.

Now, please don't take this as a policy statement or anything, but I really, really HATE doing anything KNOWING that it's wrong and will have to be done again. I mean, bone-deep HATE it. Factor that in however you will. An argument against setting up an experiment in a production environment, or a personal foible? I report -- you decide! :-) jim

Andrew Sly

11:30 p.m.

New subject: Posting TEI

I've read almost every that's been sent to the gutvol-d list in the recent burst of messages. I think it may be worthwhile trying to place everything that's been said in the larger perspective... Throughout much of its history, PG as an organization has been open to posting texts with formatting details or additional file formats done as volunteers wished to contribute them. Some examples of closed, propriatory formats (such as .prc and .lit) can be even found. This freedom has led to a wonderful array of inconsistencies and differences of approach which are probably most fully realized only by those who try to analyze, or convert large portions of the PG collection. (At least a few people involved in these recent discussions fall into that catagory.) I would argue that if we go about posting various people's implementations of markup using XML, we risk forming an increasingly incompatible jumble of formats. Andrew

Marcello Perathoner

21 Oct 21 Oct

2:06 p.m.

New subject: Posting TEI

Jim Tinsley wrote:

...

...
Lets start now with a version 0.0.1 of the TEI process. Of course at some later time we'll have to do all the posted files over again.

Now, please don't take this as a policy statement or anything, but I really, really HATE doing anything KNOWING that it's wrong and will have to be done again. I mean, bone-deep HATE it.

*Why* is it wrong? If, having to redo something is an indication of premature start, we shouldn't have posted the first 10.000 books, because we have to repost them all. I have been architecturing and programming for 20 years now and I cannot remember one single instance I got it 100% right the first time. -- Marcello Perathoner webmaster@gutenberg.org

Joshua Hutchinson

12:46 a.m.

New subject: Posting TEI

Jim Tinsley wrote:

...

On Wed, Oct 20, 2004 at 02:14:31PM -0700, Jonathan Ingram wrote:

...
--- Jim Tinsley <jtinsley@pobox.com> wrote:

Jeroen provided XML like this, which I thought was very good indeed. For any of you who haven't seen it, please point your browsers to http://www.gutenberg.org/dirs/1/1/3/3/11335/11335-x/11335-x.xml which is an absolute pleasure to read. (Well, if you're a geek, that is, and if you ain't, whatcha doin. here?? :-)

First off, let me say that ... is a beautiful e-text. I really like the look and thanks to Jeroen for producing it and Jim for point it out! And next, let me make a modest proposal. Jon (in the DP forums) is making some progress toward a XML/CSS standard of sorts. I'm going to be watching closely (and helping as much as I can). One of the things I'm going to be pushing for is TEI-Lite compliance as much as possible. Since Marcello has his PGTEI document guidelines on the web site, I'll be looking through that for ideas and such. I'll be going over this with Jon when I can, but my early idea is that we work on a couple of DP e-texts (the two of us have TONS to choose from!) and improve the XML markup standard enough for basic work. In a few weeks or so, I'd like to get a few projects posted to PG that use XML (TEI) with a CSS style sheet in place of the normal HTML that we always produce on our projects. The normal text file will of course be created. Once we have a canon of TEI to work with, hopefully the developers out there can start working on tools to help produce HTML or TEXT or PDF directly from the master. It seems to me that the XML/CSS process is the best method to incrementally approach a XML master. Marcello has a point that if we wait until we have a 100% solution, we may never get there... But a XML/CSS process is doable now and it gets us closer. Now... everyone let me know where my logic fails. (Everyone but bowerbird... don't even bother to respond, please... I'm trying to actually get something going besides a diverting flame war!) Josh

Jonathan Ingram

8:09 a.m.

New subject: Posting TEI

--- Joshua Hutchinson <joshua@hutchinson.net> wrote:

...

I'll be going over this with Jon when I can, but my early idea is that we work on a couple of DP e-texts (the two of us have TONS to choose from!) and improve the XML markup standard enough for basic work. In a few weeks or so, I'd like to get a few projects posted to PG that use XML (TEI) with a CSS style sheet in place of the normal HTML that we always produce on our projects. The normal text file will of course be created. Once we have a canon of TEI to work with, hopefully the developers out there can start working on tools to help produce HTML or TEXT or PDF directly from the master.

Just to put people's minds at rest, I don't believe we should post XML+CSS without (at the very least) an HTML edition -- certainly not until we have agreement on a common base of XML to use, and well tested tools to convert from this to (at the minimum) an HTML edition that displays acceptably on a wide range of browses. Even if we do end-up using a 'nonstandard' XML markup at DP, I agree with Joshua that we should try as hard as possible to ensure it can be converted easily to TEI (derivatives of which seem to be in favour around here). People at PG will not see the DP-internal markup, only our output, which will conform to the standards we will hopefully agree on at some point :). -- Jon Ingram __________________________________ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail

David A. Desrosiers

1:21 p.m.

New subject: Posting TEI

...

Even if we do end-up using a 'nonstandard' XML markup at DP, I agree with Joshua that we should try as hard as possible to ensure it can be converted easily to TEI (derivatives of which seem to be in favour around here). People at PG will not see the DP-internal markup, only our output, which will conform to the standards we will hopefully agree on at some point :).

It doesn't matter if it is "non-standard XML" (of course, there is no such thing, as long as it is a well-formed XML document). Once the format is in something like XML, we are all free to create our own output from that base, including "correcting" the XML to output a different form of XML from it. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Marcello Perathoner

10:59 a.m.

New subject: Posting TEI

Jim Tinsley wrote:

...

I know of no CSS for Marcello's PGTEI. Perhaps one could be crafted for it.

It already works pretty well with Jeroens XSL: http://www.gutenberg.org/tei/examples/css/lmiss.xml I had to replace all named entities (like —) with numeric ones. I did that manually, so maybe I got some of them wrong. All quotation signs are missing because I replace quotation signs with <q> </q> and Jeroen does not. But this should be very easy to add to Jeroens XSL. -- Marcello Perathoner webmaster@gutenberg.org

Jeroen Hellingman

8:18 p.m.

New subject: Posting TEI

Jim Tinsley wrote:

...

If I may nit-pick, I think it more correct to say that it

isn't _always_ true. That is, it is not true when there exists a CSS that works with the XML.

Jeroen provided XML like this, which I thought was very good indeed. For any of you who haven't seen it, please point your browsers to http://www.gutenberg.org/dirs/1/1/3/3/11335/11335-x/11335-x.xml which is an absolute pleasure to read. (Well, if you're a geek, that is, and if you ain't, whatcha doin. here?? :-)

I said before, and I say again, that where such an XML is provided, HTML is probably redundant. ("Probably" because a significant use of HTML is as input to PDA readers like, say, Mobipocket, and I'm not sure if they would swallow this XML without requiring a Heimlich.)

I know of no CSS for Marcello's PGTEI. Perhaps one could be crafted for it.

One additional note, before the XML of this text is rendered on your browser, it is fed through an XSLT stylesheet, which turns it into HTML, and then, to that HTML, CSS is applied. The entire process is done for you by your browser. The XML follows TEILite, and validates on a validating parser; the HTML should validate on a validating HTML parser. Mercello's PGTEI is close enough to TEI that this will probably give very decent results on his files too. He basically added a few small extentions to TEILite, which are "documented" in his well commented DTD or XSLT sheets. (But that is stuff for specialists really) Jeroen.

Marcello Perathoner

10:31 a.m.

New subject: Posting TEI

Jim Tinsley wrote:

...

I really do not mean to be disrepectful when I -- speaking for myself -- say that I'm not interested in spending my time making developers' jobs easier. That's not what I'm here for.

One one hand you wonder why we developers are not able to come up with a solution on the other hand you are not disposed to get one inch back on your developer-unfriendly position. Your policy of not posting TEI files is at present the main roadblock. It's like requesting the final release of the product before allowing a beta test. I have been doing other (hopefully useful) work and have not looked at the TEI code for about a year now because I don't see a way to get it to work with this `moratorium' in place.

...

We have text, and HTML, both proven and well-supported formats that we know how to work with and for which we know there is a demand. I'll stick to those until we can see a way clear through to making successful XML.

You sure know how to work with PLAIN ALL CAPS ASCII TEXT FILES but that's not a reason to shun all progress since.

...

Correct spelling is necessary but not sufficient. I don't know about other people, but I most commonly find errors by skimming the text. I can't do that with XML.

After a few weeks you'll skim thru TEI like you skim thru plain text. (Use an editor that highlights the tags and use a low contrast color for the tags.)

...

And it may not be the way software development works, but then we're not a software development project.

But you depend on software. DP is 250.000 lines of code. If it was not for software you wouldn't have much to do.

...

that's not the problem. If the process we agree for teixlite is, say, run it through Saxon, then I expect to be able to run all teixlite files through Saxon, and not have a submitter say "oh, no, you must use Xalan for this file, and not just any Xalan, but one with my patch in it."

You have to use PGTEI stylesheets to convert PGTEI text. You can use them with any XSLT 1.0 compliant processor.

...

You see, we appear to differ very fundamentally on one point. It's my lock and key analogy again. I do not want to start down the road of producing posted files from an XML if the transform, will be, for any reason, not repeatable in a year's time, or five, or ten.

This amounts to the same as: never start at all. Remember: the first files were uppercase ascii. We *had* to do them over again. We *are* doing all pre-10K texts over again. We *will* have to do the TEI files over again, maybe more than once. That's only being realistic.

...

I do not want to start down the road of producing posted files from XML if an end-user who wants to -- on whatever platform -- cannot replicate the process.

Then you should also post all the scanned pages so a user can redo the OCR on her platform if she wants to. I think we can postpone this, because the user can grab the converted files. And if converting at home is an issue with him, hey!, the tools are Free Software. He can change them until they work on his platform and and submit the patches to us.

...

...
For the start I will act as interim Post-Processor for people wanting to post PGTEI and pass on to you only the perfectly good ones. You'll just have to stick in the etext number where I put 5 asterisks.

No; I, at least, don't want to work with an experimental process in which each text is an exception.

Is there some qualifying exam to become a whitewasher? I ask, because by now I'm so desperate that I'm quite willing to become a whitewasher myself just to see some TEI texts posted.

...

Why can't we just name them .xml? I see no reason to invent extensions. _Is_ there one? Not that it matters much, just curious why you would think this a good idea.

Because there ain't such a thing as an XML file. XML is just a framework for building applications. XHTML is an XML application, SVG is an XML application, TEI is an XML application, OpenOffice file format is an XML application ... Labelling a file .xml is like labelling a Word file .bytes -- Marcello Perathoner webmaster@gutenberg.org

Greg Newby

3:02 p.m.

New subject: barriers to XML posting

On Thu, Oct 21, 2004 at 12:31:59PM +0200, Marcello Perathoner wrote:

...

Jim Tinsley wrote:

...
I really do not mean to be disrepectful when I -- speaking for myself -- say that I'm not interested in spending my time making developers' jobs easier. That's not what I'm here for.

One one hand you wonder why we developers are not able to come up with a solution on the other hand you are not disposed to get one inch back on your developer-unfriendly position.

Your policy of not posting TEI files is at present the main roadblock.

It's like requesting the final release of the product before allowing a beta test.

I have been doing other (hopefully useful) work and have not looked at the TEI code for about a year now because I don't see a way to get it to work with this `moratorium' in place.

I don't understand this limitation, so will rephrase what we're waiting for. It was among the first messages in this thread. ** What we want is an automatic means of generating canonical ** documents from an XML master. The minimums are: XML --> HTML and XML --> text (yes, it's ok to go via HTML) Displaying XML directly in a browser is not a requirement, but is nice to have. There are a few subsidiary requirements, like incorporating the header materials in a sanely marked up way (trivial with teixlite.dtd, but not unambiguous). Both Marcello and Jeroen have demonstrated techniques for these, but neither is quite ready. We do have several XML documents online, we also have this list (also gutvol-p), and there are a couple of demonstration pages. Your claim that we need to start posting more stuff in XML in order to achieve the ** goal above does not make sense to me. I do not see the logic. I'm personally not strongly opposed to doing all sorts of experimentation, and do NOT feel the urge to get it right from the start. I also am certain that there is not going to be a one-size-fits-all technical solution for all of our content. I've asked both Marcello & Jeroen for updates & ideas in the past months. Maybe they did not get my messages. My belief is that there is a definite commitment at PG (including DP) in creating XML masters. I also believe that TEI-lite encoding will work well for the majority of our content.

...

From my point of view, I'd rather see the gutvol-d group of highly motivated & talented individuals focused on solving the remaining challenges for the solutions that Marcello and Jeroen have in place already. Arguing about whether we'll use XML is a waste of time: we will. The challenges before us are primarily technical, not policy. -- Greg

PS: No, I have not read every message in the threads over the past few days. If there's another solution somewhere, I hope someone can point it out to me.

Karl Eichwalder

3:21 p.m.

New subject: barriers to XML posting

Greg Newby <gbnewby@pglaf.org> writes:

...

Your claim that we need to start posting more stuff in XML in order to achieve the ** goal above does not make sense to me. I do not see the logic.

I am very interested in XML files. Please post them even if they look useless to you. -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

Marcello Perathoner

4:54 p.m.

New subject: barriers to XML posting

Greg Newby wrote:

...

I don't understand this limitation, so will rephrase what we're waiting for. It was among the first messages in this thread.

** What we want is an automatic means of generating canonical ** documents from an XML master.

The minimums are: XML --> HTML and XML --> text (yes, it's ok to go via HTML)

You already got that. There are 2 different ways to do this, both of them mature enough for beta testing. The roadblock is that a 100% correct and complete solution was requested by Jim before he considered starting to post TEI texts. Now, we don't have a toolchain for the whitewashers that is equivalent to the one already in place for TXT and HTML files. That's why I volunteered to act as "interim" whitewasher: to manually go thru the steps needed to post a TEI file and derivative formats, to understand how this toolchain needs to be built. I will only take a few texts (maybe a dozen) from a few selected sources Some of the objections raised by Jim will not go away real soon. He says he cannot skim thru a TEI file like thru a TXT or HTML file. But there are at present no readers that accept TEI as native file format. If we had to build that first (Jon Noring is trying), we will likely never start posting. I feel Jim is raising artificial objections he knows we cannot overcome. If he doesn't want to learn TEI and he doesn't feel like proofing a TEI text in emacs, fine. But then, he should step aside and let other people do this work. Now for another thing. Jim fears that we will end up with a lot of files marked up in differing TEI dialects. OTOH, the moratorium has actively encouraged this. People being eager to try TEI and there being no official place to post TEI files, everybody has posted the files they have marked up in a different place. I have been working on my dialect, Jeroen on his and DP is cooking up another one. There is no central "clearing house" where we can see the other guys work. I don't say it would be impossible for me to obtain a glimpse of the TEI texts the folks at DP are working on, it would just be much easier if I could get them from the archive. At this point we need to set a signal that the TEI era has started. We don't need more discussion about whether TEI is the right language, I think we are all agreed on that. pgxml.org is dead and ZML is good for laughs. What we need now is to compare notes, all who have been doing TEI and get to an agreement of which dialect to use. That can be best reached if we all post samples of our work and try to run the other guys markup thru our XSL etc. etc. -- Marcello Perathoner webmaster@gutenberg.org

Jeroen Hellingman

8:33 p.m.

New subject: barriers to XML posting

...

People being eager to try TEI and there being no official place to post TEI files, everybody has posted the files they have marked up in a different place. I have been working on my dialect, Jeroen on his and DP is cooking up another one. There is no central "clearing house" where we can see the other guys work. I don't say it would be impossible for me to obtain a glimpse of the TEI texts the folks at DP are working on, it would just be much easier if I could get them from the archive.

Personally, I try to stick as closely to TEILite as possible. I can add extentions to it, but then can easily produce XSLT to pull out those extentions before posting. I think a few of Marcello's extentions for PGTEI are not needed, as elements exists to encode the same information, or alternative mechanisms can be divised within TEILite -- but even if you stick to pure TEILite, you will need to agree on conventions, for example, I leave in quotation marks (as I have numerous old works that deal with these in a very irregular way, turning them to <q> and </q> would be difficult. Marcello leaves them out. We can fix an XSLT to re-supply them, and even an XSLT to supply them only if they are removed (given we agree on a standard way of documenting this fact) -- and that is what you need if you're working on a certain project using TEI -- a gentle intruduction and some guidelines. A few very nice ones are on the Net. If people wish, I can set up a website with TEI versions of _all_ my posted texts, both in my original master SGML, and converted to XML. Gives everybody something to experiment with. Jeroen.

Jim Tinsley

22 Oct 22 Oct

1:20 a.m.

New subject: barriers to XML posting

On Thu, 21 Oct 2004 18:54:35 +0200, Marcello Perathoner <marcello@perathoner.de> wrote:

...

I feel Jim is raising artificial objections he knows we cannot overcome. If he doesn't want to learn TEI and he doesn't feel like proofing a TEI text in emacs, fine. But then, he should step aside and let other people do this work.

I find this very offensive. I came home, and was reading happily enough through the threads until this. I differ with you quite profoundly about the implementation of XML, and, I'm sure, several other issues. But my opinions are honest, and based on what I believe is best for PG as a whole. I do not "raise artificial objections" -- these are the expectations I have had for XML as far back as I can remember, and they are expectations regularly assumed, if not met, by people who evangelize XML. I "learned TEI" (not all of it, of course) with the hope of using it in PG, in late 2001/early 2002, and I marked up my first book in XML in February, 2002, which was long before I ever heard your name. If you can't accept that I am debating these issues in good faith, there is no point in continuing this discussion. jim

Marcello Perathoner

1:41 p.m.

New subject: barriers to XML posting

Jim Tinsley wrote:

...

...
I feel Jim is raising artificial objections he knows we cannot overcome. If he doesn't want to learn TEI and he doesn't feel like proofing a TEI text in emacs, fine. But then, he should step aside and let other people do this work.

I find this very offensive.

I came home, and was reading happily enough through the threads until this.

I am sorry if I spoilt your evening and I apologize for that. I said "I feel" and that's the truth. Maybe it's just my fault.

...

these are the expectations I have had for XML as far back as I can remember, and they are expectations regularly assumed, if not met, by people who evangelize XML.

Some of your expectation cannot be met. Some would imply an enourmous expense of time on the developers part to save relatively little time on your part.

...

I "learned TEI" (not all of it, of course) with the hope of using it in PG, in late 2001/early 2002, and I marked up my first book in XML in February, 2002, which was long before I ever heard your name.

I have marked up 25 books, prose, lyrics and plays. And I transformed all of them successfully to HTML, TXT, PDF and PalmDoc. That was a year ago. I could have done more but I felt that it was better to go public with what I had, to get comments and suggestions from other people. I thought if PG posted some of those files I would get comments. Since then I have been waiting. I think I have done my part. My files are done better than many I see posted. Even if we had to fix them later, the philosophy of PG did at some point expressly allow the posting of preliminary files. I cannot see why this simple request should cause so much trouble and fear today. These are some of your expectations that you should reconsider:

...

I really no longer give any headroom at all to the approach "Post XML Now Because That Is The One True Way And We'll Figure Out How To Read It Later." If for no other reason, then because the most important part of the WW job is to check the texts before posting, and if we can't read it, we can't find the errors, and if we can't find the errors, we can't fix 'em.

You can read a TEI file in an editor. You can spell-check it. You can validate it. You can find the errors. The process is just a bit different from what you have now, and will always be until there crop up some native TEI readers.

...

That process must work for _all_ teixlite files, not just ones that are specially cooked, using constraints not specified within the chosen DTD. Here's where we hit the rocks today.

Impossible. There are things you cannot specify in a DTD but still must be followed to get a semantically correct file. (This holds for every XML application not just for PGTEI.) You always have to obey some extra rules besides validity. These are put down in the PGTEI guide.

...

The only things we must have -- both for our own internal practical purposes and for the use of future readers -- is that it should work reliably on _all_ texts that conform to the XML DTD chosen, be open source, and be cross-platform. A reader needs to be able to tweak the transform and re-run on her own desktop.

Same as above. The DTD is not strict enough (RelaxNG will be better, but it's still early). There will always be valid TEI files that do not transform to `correct' output files. I don't see why it is necessary for the conversion tools to run on everybodies desktop before we can start posting files. If the tools run on pglaf.org and gutenberg.org that is more than enough for a start. The tools can be fixed later. That won't make posted valid TEI files invalid. -- Marcello Perathoner webmaster@gutenberg.org

Jim Tinsley

11:51 p.m.

New subject: barriers to XML posting

On Fri, Oct 22, 2004 at 03:41:18PM +0200, Marcello Perathoner wrote:

...

Jim Tinsley wrote:

...
...
I feel Jim is raising artificial objections he knows we cannot overcome. If he doesn't want to learn TEI and he doesn't feel like proofing a TEI text in emacs, fine. But then, he should step aside and let other people do this work.

I find this very offensive.

I came home, and was reading happily enough through the threads until this.

I am sorry if I spoilt your evening and I apologize for that.

You didn't spoil my evening; just my participation in the thread. Having been so accused of evilly blocking the righteous progress of destiny because of my own hidden agenda and neuroses, it's hard to say constructive things. But I will say one more thing: if you read what I actually _said_ you will realize that almost everyone in this thread -- I would think -- could fairly easily create an XML and transform that meets the criteria I laid down. I could myself. So could you, or Jeroen, for sure. Josh and Jon, no problem. Anyone I've left out? Of course, none of us could do it for ALL texts. Not yet. But it doesn't need to be done for all texts; that was explicitly stated. If somebody wants to set up a standard that works for prose texts containing Title, Author, Chapter Heads, Paragraphs, Verses, Letter Headings and Signatures -- plus emphasis and languages, and try to work with that for a while, that would do. And which of us could NOT do that with just Xalan or Saxon, a simple XSLT, and quite a limited HTML-to-text converter? Of course, it wouldn't handle Alice. It wouldn't handle footnotes or tables. But for books that don't need these features it would work fine. There would be some details to work out in how the PG header works with them, and maybe the XML file itself should contain a description of how the HTML and text formats were derived, so that when we fixed the texts we would know how to remake them, or that some future reader could re-do the transform to their own tastes. And it would be good if, having got all that straight, we could set it up and document it as a standard so that other people wouldn't need to reinvent that wheel. It may be limited, but nobody said that we have to have a standard to cover all cases before tackling any. And then the people who are interested could go on to add more features, enlarging the standard. I'm surprised, after last year, that nobody has done this already. I'm surprised that you and Jeroen, who, in your different ways, had the best shot at XML didn't get together on it. Certainly, Greg has been asking you both about it. It _would_ be nice if we had a few people working together on it, so we get a shared understanding and consensus. Frankly, what is going to happen is that a few people at DP are going to forge a workable standard between them. Others will take it up, and then everyone will be doing it, so personally I'm just waiting for it to happen. jim

James Linden

23 Oct 23 Oct

6:13 p.m.

New subject: barriers to XML posting

Two points to mention before you read my actual reply: 1) My apologies for the delay in responding. A configuration issue has caused all my email to bounce from pglaf.org's mail server for the past couple weeks, and only yesterday was I able to get it rectified. I will be posting many delayed replies for the next few days. 2) While the original email that I'm replying to here was written by Marcello, my replies are not directed at him personally, but rather, at everyone in the community. ---------------------------- Of the many assumptions being made on the list these days, here are three of the most erroneous -- and they all came in a single paragraph.

...

We don't need more discussion about whether TEI is the right language, I think we are all agreed on that.

Not everyone agrees with the use of TEI. I'm not even going to begin my arguments again -- there would no point. Some of us, such as myself, foresee issues, based on our own experiences, with using TEI (or varient) for PG work. I'm resigned to the fact that it may be the only way to get XML into PG at all, so I'll just deal with the issues on my own time at that point. Simply put, TEI is one of the most verbose markup vocabularies available, and using it for PG is going to turn off a LOT of people to XML. A simpler, more concise vocabulary would be less intimidating!

...

pgxml.org is dead

PGXML is not dead at all. Just like other XML stuff in PG, it's been on hiatus, mostly for two reasons: 1) lack of agreement in the community, and 2) lack of personal time to work on it. I will be meeting with the other co-founder of pgxml.org (Ben Crowder) in November, after which time, we hope to present a definitive plan for pgxml.org to the community.

...

and ZML is good for laughs.

You can laugh at ZML all you want, but from the examples and personal discussion with Bowerbird, I have learned that ZML is not at all what most people think it is. From the examples that I have seen, ZML is basically PG vanilla text format, but cleaned up and normalized. ---------------------------- The entire rest of this email is a rant, so please feel free to skip it. You have been given the choice! ---------------------------- Maybe if you learned to listen to other people, you'd not make such erroneous assumptions. Maybe, just maybe, other people do have a clue, and you aren't the only one that knows something. Yes, I have blocked Bowerbird from joining the PGXML list, but he is the _only_ person that I've blocked. The only reason for this is because of other people's reactions to him, not because of Bowerbird himself. While he can be irritating and annoying, Bowerbird does have a clue about some of the issues we have to deal with in PG work, particuarly in converting to other formats, etc. If you don't like his attitude, ignore his posts. You can at least try to extract the useful information that he does give from the flame wars they often come in. This way, you might actually LEARN something. More mud is slung on this list than on ANY other list that I'm subscribed to, but I have to admit, I'm only subscribed to about 200 active lists, so I may be missing the mud-slinging ones. If you don't like the way something is being done in PG, don't throw a hissy-fit. Get off your arse and do something about it, or sit down and shut up, and let other people do what they think should be done. I've made no secret of my personal opinions of PG: 1) the website is a disgrace 2) the archive is poorly organized 3) the catalog system is a hack job done by unqualified people 4) the PG text format is extremely disgusting 5) PG makes volunteers work uphill to get anything done 6) the lack of quality in our content offsets any gain from it Just because I have the opinions doesn't automatically mean I A) know what I'm talking about, or B) have an alternative solution. Some of the biggest technological innovations cames not from people who had a better idea, but from people who knew the current idea wasn't that good, and were open minded when the better idea came along. As a whole, I find that PG is not a very open-minded community. As a community, we reguarly discourage people from volunteers, mostly because we don't support them well. At times, not only do we not support them, but we actively, and publically, bash their skulls. We lie to the general public about PG on a regular basis. When posting ebooks, we ignore the wishes of the volunteers who made the texts. We don't even provide well-suited tools for the volunteers to use to improve PG, because, oh my god, maybe the tool isn't 100% open-source! Maybe the tool has been offered to PG on a perpetual right to use for PG status, but oh, lordy, that's just not good enough. We reguarly tell some of our hardest working volunteers that they are full of crap (more or less). True, it's usually a paragraph long description of what they are doing "wrong", but it's basically telling them their work is unwanted. Do you realize that out of over 300 librarians I've talked to personally, the first thing that came to mind for "ebooks" was the University of Virginia's eText Library? PG simply is not a mover or a shaker in the ebook world, regardless of the hocus pocus you might hear to the contrary. ---------------------------- And, to all these things I've said, I'm a victim of some, but a perpetrator of others. I'm just as guilty as you are. I have no soapbox, only a conscience and a desire to change things. ---------------------------- So, dangit all to heck, we don't have to be like this. PG _could_ be great. PG SHOULD be great. -- James

Marcello Perathoner

7:39 p.m.

New subject: barriers to XML posting

James Linden wrote:

...

...
pgxml.org is dead

PGXML is not dead at all.

I stand corrected. Sometime ago I noticed the domain was expired, but somebody re-registered it just yesterday. But I gather it wasn't you. Who is this ?

...

$ whois pgxml.org Domain ID:D105038884-LROR Domain Name:PGXML.ORG Created On:22-Oct-2004 20:47:27 UTC Last Updated On:22-Oct-2004 20:47:48 UTC Expiration Date:22-Oct-2005 20:47:27 UTC Sponsoring Registrar:Go Daddy Software, Inc. (R91-LROR) Status:TRANSFER PROHIBITED Registrant ID:GODA-08593143 Registrant Name:Registration Private Registrant Organization:Domains by Proxy, Inc. Registrant Street1:15111 N Hayden Rd., Suite 160 Registrant Street2:PMB353 Registrant Street3: Registrant City:Scottsdale Registrant State/Province:Arizona Registrant Postal Code:85260 Registrant Country:US Registrant Phone:+1.4806242599 Registrant Phone Ext.: Registrant FAX: Registrant FAX Ext.: Registrant Email:PGXML.ORG@domainsbyproxy.com ...

-- Marcello Perathoner webmaster@gutenberg.org

James Linden

8:43 p.m.

New subject: barriers to XML posting

Oh geez... I'm going to ask Ben what happened. :-( Ok, so as far as I know pgxml.org doesn't belong to us anymore, but that doesn't make it dead, just the domain will have to change. -- James

...

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org]On Behalf Of Marcello Perathoner Sent: Saturday, October 23, 2004 3:39 pm To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] barriers to XML posting

James Linden wrote:

...
...
pgxml.org is dead

PGXML is not dead at all.

I stand corrected. Sometime ago I noticed the domain was expired, but somebody re-registered it just yesterday. But I gather it wasn't you. Who is this ?

...
$ whois pgxml.org Domain ID:D105038884-LROR Domain Name:PGXML.ORG Created On:22-Oct-2004 20:47:27 UTC Last Updated On:22-Oct-2004 20:47:48 UTC Expiration Date:22-Oct-2005 20:47:27 UTC Sponsoring Registrar:Go Daddy Software, Inc. (R91-LROR) Status:TRANSFER PROHIBITED Registrant ID:GODA-08593143 Registrant Name:Registration Private Registrant Organization:Domains by Proxy, Inc. Registrant Street1:15111 N Hayden Rd., Suite 160 Registrant Street2:PMB353 Registrant Street3: Registrant City:Scottsdale Registrant State/Province:Arizona Registrant Postal Code:85260 Registrant Country:US Registrant Phone:+1.4806242599 Registrant Phone Ext.: Registrant FAX: Registrant FAX Ext.: Registrant Email:PGXML.ORG@domainsbyproxy.com ...

-- Marcello Perathoner webmaster@gutenberg.org

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Joshua Hutchinson

8:09 p.m.

New subject: barriers to XML posting

James Linden wrote: <snipped a lot of stuff> There are two things I wanted to say to this: One was that other than bowerbird, the rest of the discussions, while sometimes passionate, are not typically mean spirited. We just have people who feel passionate about their particular vision of what PG should be (as it looks like you feel as well, from your post). In the end, most of the people who write the most respect each other and each other's opinions. And discussion of the topics at hand is the only method we have of moving forward. The second is that while a more simple XML markup, like what you loosely described as PGXML, sounds wonderful on the surface ... it requires, once again, largely reinvented the wheel AND not being compatible with a standard that seems to be gaining momentum out there. Granted, the very nature of XML makes converting from a home-grown markup to TEI a possibility, removing the need to convert would seem to be the wiser path. Josh

Scott Lawton

9:20 p.m.

New subject: barriers to XML posting

...

The second is that while a more simple XML markup, like what you loosely described as PGXML, sounds wonderful on the surface ... it requires, once again, largely reinvented the wheel

"Reinventing the wheel" is often something to be avoided, but I'm not sure it's a compelling issue here. First, there are other models to use, e.g. XHTML. Second, the most important standard is XML itself. That's what enables an incredible variety of tools and platforms; the specific DTD is much less important. (In fact, XML's designers made sure it was useful even without a DTD.) Third, TEI was created for a very different world: scholarly publishing. If PG's markup was going to be done by paid experts, TEI would probably be the best choice. But I'm not convinced it's appropriate for a volunteer organization. XML can be much simpler than HTML, yet TEI is (IMHO) more complex not less. I just finished converting The Wonderful World of Oz to PGTEI. (I'll post it on Classicosm.com once I have a chance to write up my impressions.) During my learning process, I came across an interesting comparison of Shakespeare marked up using TEI and an "ad hoc" markup used by Jon Bosak (a key inventor of XML). Though the comparison was done by a TEI advocate, I think Jon's is a much better model for our purpose. http://www.tei-c.org.uk/Sample_Manuals/mueller-main.htm A very gentle introduction to the TEI (the comparison is near the end -- look for the garish background colors)

...

Granted, the very nature of XML makes converting from a home-grown markup to TEI a possibility, removing the need to convert would seem to be the wiser path.

The whole point of a master format is that PG is going to convert to other useful formats. If TEI is useful in and of itself, that can be just another conversion. -- Scott Practical Software Innovation (tm), http://ProductArchitect.com/

Marcello Perathoner

8:29 p.m.

New subject: barriers to XML posting

James Linden wrote:

...

You can laugh at ZML all you want, but from the examples and personal discussion with Bowerbird, I have learned that ZML is not at all what most people think it is. From the examples that I have seen, ZML is basically PG vanilla text format, but cleaned up and normalized.

Because Bowerbird is more enamoured in hearing himself talking than in doing any research whatsoever, ZML doesn't address the simplest issues of text markup. Bowerbird just somehow found the DP rules for formatting proofed texts and amplified them with some rather sub-optimal ad hoc extensions, like the use of tabs for marking centered text etc. Altogether ZML is not much better (more likely worse) than what DP is outputting right now.

...

Yes, I have blocked Bowerbird from joining the PGXML list, but he is the _only_ person that I've blocked. The only reason for this is because of other people's reactions to him, not because of Bowerbird himself.

That is an original way of seeing things ...

...

I've made no secret of my personal opinions of PG:

1) the website is a disgrace 3) the catalog system is a hack job done by unqualified people

That's what I've fixed in the last year.

...

4) the PG text format is extremely disgusting

That's what I tried to fix. But ran against 5)

...

5) PG makes volunteers work uphill to get anything done

-- Marcello Perathoner webmaster@gutenberg.org

James Linden

8:48 p.m.

New subject: barriers to XML posting

...

...
I've made no secret of my personal opinions of PG:

1) the website is a disgrace 3) the catalog system is a hack job done by unqualified people

That's what I've fixed in the last year.

Yes, I have to admit that you've been working VERY hard on the site, but in my own opinion, it is still not up to par. That's not really any fault of your's tho. I wouldn't call your new catalog system "fixed", but it IS better than the old CGI one. :-)

...

...
4) the PG text format is extremely disgusting

That's what I tried to fix. But ran against 5)

...
5) PG makes volunteers work uphill to get anything done

Hey, I'm with you there. :-| -- James

Dave Fawthrop

2 Nov 2 Nov

8:13 p.m.

New subject: Test from Dave F

Test -- Dave F

7543

Age (days ago)

7557

Last active (days ago)

List overview

Download

53 comments

20 participants

participants (20)

Andrew Sly
Brad Collins
Carlo Traverso
Dave Fawthrop
David A. Desrosiers
Greg Newby
Gutenberg9443＠aol.com
Holden McGroin
James Linden
Jeroen Hellingman
Jim Tinsley
Jon Noring
Jonathan Ingram
Joshua Hutchinson
Karen Lofstrom
Karl Eichwalder
Marcello Perathoner
Scott Lawton
Skippi
Steve Thomas