re: [gutvol-d] About the XML debate

brad said:
As has already been mentioned, ASCII is an encoding and plaintext is a format.
i fail to see how this distinction has any importance to the original point. the user wants the words free of markup.
And ASCII is being replaced with Unicode. Some decades from now ASCII will gradually go the way of the Dodo.
well, if you want to get into this kind of doubletalk -- which i don't because, as i just said, it has no importance -- then it is inaccurate to say that ascii is "being replaced" by unicode, since the bottom 127 characters of unicode are the same 127 ascii characters we've come to know. if we give the original poster a unicode-aware text-editor, and a file that contains no heavy markup, he will be happy. he wants the words, all the words, and nothing but the words.
As for plaintext, one of the core design goals for XML is that it you'll be able to open it in any text editor and read it.
ok, and now here you seem to be trying to say that an x.m.l. file is a plain-text file. it's not. it might consist of nothing more than those 127 ascii characters, but it is decidedly not a plain-text file. the original poster knows it's not plain-text. so does michael hart. most people do. including, i suspect, you. why confuse the issue?
If a file is human readable when it's opened in a text editor then it's a type of plain text.
again, this subterfuge is dishonest. first, it's inaccurate to say that an x.m.l. file is "human readable". and second, it's misleading to say it is "a type of plain text". it might be an ascii file, but it's decidedly _not_ "plain-text".
All XML does is place tags around text in order to give the text a structure that machines can understand.
you give machines far too little credit. they can be made to be far smarter than a dirt-dumb x.m.l. processor, which can _only_ be made to "understand" the structure of text _if_ it is tagged.
As long as you have a text editor, you'll be able to read XML.
let's give the original poster an x.m.l. file, and have _him_ say whether he is able to "read it". just because you can load a file into a text-editor doesn't mean you'll actually be able to figure out _how_ to edit the darn thing in the way you want. and _that_ is the real topic at hand here... these semantic games do nothing but cloud the discussion.
A good text editor can clean out all of the tags with a simple regular expression like "<.*[^>]*>".
ok, well at least now you're starting to talk about _issues_. but of course, you're glossing over the reality even here. the inference you are trying to get us to make is that "cleaning out all the tags" will convert an x.m.l. file into a plain-text file, magically. it won't. not in all cases anyway. not unless the x.m.l. file was created -- carefully -- with that specific conversion in mind. i've been writing a separate post that will give details how this careful consideration and crafting must be done. (some hints: whitespace, quotemarks, and tables.)
Script languages like perl, python, ruby or any other language likely to come down the pike will be able to process XML and convert it into whatever comes along in the future.
it's telling how all of the hype about x.m.l. is in the present-tense, but when you focus down to particulars, it moves to future-tense. pay attention to this, lurkers! it's a sure sign of vapor-ware!
Very few applications render XML directly (except perhaps word processors), everyone else converts it into html, pdf or other formats for display.
ask yourself why this is the case. the answer is interesting.
SGML (XML's older sister) has been around for, what, twenty years or more? And all SGML documents are easily converted into XML. XML is simplier and designed to be around as an archive format for far longer than that.
in its day, s.g.m.l. made all the same promises as x.m.l. does now. it couldn't keep them, so s.g.m.l. people had to invent a variant, so they could regenerate all their hype from scratch and reuse it. and sure enough, the public is gullible enough to believe it all again. of course, the same difficulties that thwarted s.g.m.l. back in the day -- sabotaging all their hype -- will return and bite x.m.l. in the butt. but by the time we figure out how we've been had this time around, all the x.m.l. proponents will have carted off their consultant cash...
Most people will never know about the master version in XML, they only will see the file formats they use to read books.
they'll "know about" that x.m.l. version indirectly; it will be the reason their books are so expensive. due to all that cash those consultants carted away.
XML is only a long term and safe archive format
hype and marketing.
Once you understand that XML is just plain text, you can use any software for processing text to work with it.
you can save a spreadsheet in "plain-text" form too, and then "use any software for processing" that too. but you're going to find yourself coming up short. likewise when working with an x.m.l. file in a plain-text editor; yes, it can be done, but you will find yourself coming up short. but x.m.l. people will continue telling us this untruth, because they want us to believe that x.m.l. is really simple. but it's not.
As long as there is a text editor, an XML documment will never be lost.
of course, if it ain't human-readable in that form, it doesn't really matter if it "will never be lost". it won't need to be "lost" once it has been "tossed"... *** i will repeat: make x.m.l. work if you want us to respect it. don't come and _tell_ us how wonderful it will be; show us. the proof is in the pudding. not in the hype and marketing. -bowerbird

I do understand that unicode is the next generation of ascii, which as bowerbird pointed out includes the standard ascii characters and is enhanced from there. While we could make XML the standard, why shouldn't we just include it alongside the plaintext human readable revision without markup tags. If the work is so easy it would be negligble to keep both revisions around from initial editing. I remember years ago when PG first started (I was just an outside reader completely then) that they chose plain ascii text to not get mired in any particular format that may be lost. Hence the stone tablets of the computer world. To one of the people that commented on my email - yes I want everyone to eat rocks. I beleive in PG as an archival society. The reason the format is so successful, even in this day and age is that it ubiquitous. It is this commonality that makes it flexible. do I really care if there is a seperate XML revision from the plaintext? No I do not. I don't care if we make adobe pagemaker versions. I just don't want to lose the plaintext. PG hasn't been futurists in the sense of betting what is going to be common in the coming decades. In any discussion that we could consider replacnig the the plaintext revisions with XML it needs to be asked if PG is a futurist or archival society. If I ever manage ot get my hands on the first edition of Dumas's Count of Monte Cristo like I want, I am not going to complain that it is in French, and I can not read french. In that sense it is no usable to me (it wouldn't mater in book form or any other) since I don't read french. But for some reason the desire to pretty up the archives with a replacement format is just that. I would have a book that as survived 150 years (give or take a decade or 2) that is still able to be translated and worked from easyily. Unicode wiull last at least that. I guarantee whichever XML revision is chosen it will be replaced in 150 years and be made obsolete. Unicode on the other hand will still be around because it is the workhorse of a computer society. Finally let's leave this will a bit of my own dealing with XML. I work for a company that produced a major aplication we moved from standard plaintext config files and plaintext logfiles to XML based. This in turn made tweaking and troubleshooting much more difficult than it was worth. THere is also other problems that arose with that. The cumbersome activities of our dewvelopment staff turned alot of people away, and gained alot of new customers at great cost (I don't want to go into too much detail about my company or product). The main difference though is my compnay is supposed ot be forward thinking and is trying to keep up with the jones's. PG has no jones's to compete with it is a single entity above that petty bickering. It is a beautifl idea of preserving civilization for the future to generations. To survive copyright laws and make the works available. I also believe though that new books should be edited now by PG and locked in a storage vault for release a a later date, so the books themselves survive even if the print copies don't. I add that in to show I don't follow a straight PG line, but i'm all about keeping the existence of this information alive for future generations. I'm about the information and the access and survival for it You can beautify it all you want but I strongly feel the essence and soul should be maintained as it is now before we lose what makes us special.

While we could make XML the standard, why shouldn't we just include it alongside the plaintext human readable revision without markup tags.
I agree. Use the XML version as the base format, and transform that XML into plain text (or pdf, jpg, postscript, etc.) from there. Great solution and I believe that is what this discussion is leading to.
do I really care if there is a seperate XML revision from the plaintext? No I do not. I don't care if we make adobe pagemaker versions. I just don't want to lose the plaintext.
Exactly. That's what the XML version provides: one consistent base format through which all others are derived, making the final text, Adobe PageMaker, whatever... versions identical in content to the original XML version. Plain text is one of those formats, and if you prefer to read it in that format, you can do so.
Finally let's leave this will a bit of my own dealing with XML. I work for a company that produced a major aplication we moved from standard plaintext config files and plaintext logfiles to XML based. This in turn made tweaking and troubleshooting much more difficult than it was worth.
Why did your company move from plain text to XML? What tools were you using to process the XML? Moving to XML "Just Because(tm)", is not a good reason to move that direction. There's a lot of "XML is the Future" FUD flying around, and too many people are believing it. Without a solid reason for migrating to XML (as for config files in your case), then its the wrong solution. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Brent Gueth wrote:
I do understand that unicode is the next generation of ascii, which as bowerbird pointed out includes the standard ascii characters and is enhanced from there.
While we could make XML the standard, why shouldn't we just include it alongside the plaintext human readable revision without markup tags.
Let me try to give the reasons why *I* started pursuing XML and see if that doesn't help allay some of your concerns. My main involvement with PG texts comes from a DP background. I'm one of the folks that help put the PG texts in place. So my perspective is not as much from the point of reading the texts and it is producing the texts. This isn't to say I don't consider the reader, but everyone tries to scratch their own itches first, and my itches are from a producer's point of view. When you create a PG text now a days, most people create multiple "versions." At the most basic, people usually create the text version and a HTML version. Text is because that is the minimum required at PG, and HTML because there is a lot of information that cannot be well represented by a plain text file opened in Notepad. Images are the first example that come to mind. Then, there are some texts which require/practically beg for additional "versions". We have scientific texts that really need a latex master document that is rendered to PDF. Languages Other Than English (LOTE) texts that require a larger character set than ASCII, so you might do a UTF-8 encoded text. The problem is, once you've create the first version (let's say it is the UTF-8 encoded plaintext format), you now have to do the manual work for the other formats. Sometimes this is trivial, sometimes it is not. But to make matters worse, it is not uncommon to notice a typo in the HTML that you didn't fix earlier. Now, you have to go back to the other versions and make the same "fix". This very quickly becomes an organizational nightmare as I'm sure you can imagine. XML solves this to a large extent. I create one "master" document and then literally click a button and I get a UTF-8 encoded .txt file, a Latin-1 encoded .txt file, an ASCII encoded .txt file, a HTML encoded file, and a PDF file. I post all of them to the ww'ers in a fraction of the time. Plus, if someone down the road finds a problem in the text, the fix can be applied to the master XML and the others files can be regenerated. We are not doing away with the .txt files you want. We are coming up with a more efficient way to create it (along with the many other document formats people want). Oh, and yes, it is possible to create conversion routines for other formats as well. Marcello had a Palm format working at one point, if I remember correctly. A MS reader .LIT is possible (the specs are freely available and under a free license, we just need someone to take the time to create the converter). Rocket ebook reader and others should all be possible as long as the spec for the format is freely available. Please feel free to ask any questions you want on the subject. I'll be happy to run at the mouth all you want! ;) Josh

Joshua wrote: [keeping his whole reply intact]
My main involvement with PG texts comes from a DP background. I'm one of the folks that help put the PG texts in place. So my perspective is not as much from the point of reading the texts and it is producing the texts. This isn't to say I don't consider the reader, but everyone tries to scratch their own itches first, and my itches are from a producer's point of view.
When you create a PG text now a days, most people create multiple "versions." At the most basic, people usually create the text version and a HTML version. Text is because that is the minimum required at PG, and HTML because there is a lot of information that cannot be well represented by a plain text file opened in Notepad. Images are the first example that come to mind.
Then, there are some texts which require/practically beg for additional "versions". We have scientific texts that really need a latex master document that is rendered to PDF. Languages Other Than English (LOTE) texts that require a larger character set than ASCII, so you might do a UTF-8 encoded text.
The problem is, once you've create the first version (let's say it is the UTF-8 encoded plaintext format), you now have to do the manual work for the other formats. Sometimes this is trivial, sometimes it is not. But to make matters worse, it is not uncommon to notice a typo in the HTML that you didn't fix earlier. Now, you have to go back to the other versions and make the same "fix". This very quickly becomes an organizational nightmare as I'm sure you can imagine.
XML solves this to a large extent. I create one "master" document and then literally click a button and I get a UTF-8 encoded .txt file, a Latin-1 encoded .txt file, an ASCII encoded .txt file, a HTML encoded file, and a PDF file. I post all of them to the ww'ers in a fraction of the time. Plus, if someone down the road finds a problem in the text, the fix can be applied to the master XML and the others files can be regenerated.
We are not doing away with the .txt files you want. We are coming up with a more efficient way to create it (along with the many other document formats people want).
Oh, and yes, it is possible to create conversion routines for other formats as well. Marcello had a Palm format working at one point, if I remember correctly. A MS reader .LIT is possible (the specs are freely available and under a free license, we just need someone to take the time to create the converter). Rocket ebook reader and others should all be possible as long as the spec for the format is freely available.
Please feel free to ask any questions you want on the subject. I'll be happy to run at the mouth all you want! ;)
Kudos! This is by far the best reply I've yet seen on the practical benefits of XML for producing structured digital texts. Cogent, simple, and to the point, backed up by real-world experience. Joshua, you might consider submitting what you wrote to David Rothman's TeleRead blog as a guest blog article (his blog is one of the more popular blogs on the Internet, and by far the most read blog regarding ebooks and digital libraries.) Let me know -- I will be glad assist. Jon

Joshua Hutchinson wrote:
Marcello had a Palm format working at one point, if I remember correctly.
I dropped it because pluckering the html file gives you a better experience at a smaller file size. The same conversion should be possible for Pocket-PC formats, but I'm not going to buy one just to test this. -- Marcello Perathoner webmaster@gutenberg.org

On Sat, 20 Aug 2005, Joshua Hutchinson wrote:
The problem is, once you've create the first version (let's say it is the UTF-8 encoded plaintext format), you now have to do the manual work for the other formats. Sometimes this is trivial, sometimes it is not. But to make matters worse, it is not uncommon to notice a typo in the HTML that you didn't fix earlier. Now, you have to go back to the other versions and make the same "fix". This very quickly becomes an organizational nightmare as I'm sure you can imagine.
XML solves this to a large extent. I create one "master" document and then literally click a button and I get a UTF-8 encoded .txt file, a Latin-1 encoded .txt file, an ASCII encoded .txt file, a HTML encoded file, and a PDF file. I post all of them to the ww'ers in a fraction of the time. Plus, if someone down the road finds a problem in the text, the fix can be applied to the master XML and the others files can be regenerated.
I'll add this to Josh's well-worded message. For the white washers and anyone doing maintenance on the PG files, having a variety of file formats to deal with does sometimes make quite a headache. Recently, I was making some corrections in a text that was in the collection in txt, htm, and rtf formats, and I can tell you that editing rtf manually is not fun. Also a note that for the example Josh mentioned above, after he submits the files, a white washers will review them with some automatic checking before being posted, and any corrections being made will need to be done individually to each file format. Andrew

Brent Gueth wrote:
While we could make XML the standard, why shouldn't we just include it alongside the plaintext human readable revision without markup tags.
That's just what we were going to do.
To one of the people that commented on my email - yes I want everyone to eat rocks.
Tip: don't open a restaurant. -- Marcello Perathoner webmaster@gutenberg.org

i fail to see how this distinction has any importance to the original point. the user wants the words free of markup.
[snip]
if we give the original poster a unicode-aware text-editor, and a file that contains no heavy markup, he will be happy. he wants the words, all the words, and nothing but the words.
[snip]
ok, and now here you seem to be trying to say that an x.m.l. file is a plain-text file. it's not. it might consist of nothing more than
You spelled XML incorrectly again.
again, this subterfuge is dishonest.
first, it's inaccurate to say that an x.m.l. file is "human readable". and second, it's misleading to say it is "a type of plain text". it might be an ascii file, but it's decidedly _not_ "plain-text".
Are graphical buttons that contain letters "human readable"? What about product labels? Billboard signs? None of those are "human readable" (at least in the capacity that say... an OCR application could be able to decipher their meaning).
let's give the original poster an x.m.l. file, and have _him_ say whether he is able to "read it".
Sure, you can read the XML file with a browser, if you have the appropriate stylesheet that goes with it. A text editor does nothing more than "render" the text to the user's screen. Markup is the semantic instructions that describe exactly how that text is going to be rendered. A "text editor" that understands XML can easily make those tags invisible to the end user, or fold the sections, etc. This is all just a silly argument, and by your definition, your own wacky ZML format is not human readable either. What exactly is your point with this diatribe anyway? You're not going to save the world from XML, and you're certainly not going to convince others here who use it in their daily jobs. So what exactly is your point?
just because you can load a file into a text-editor doesn't mean you'll actually be able to figure out _how_ to edit the darn thing in the way you want.
First it was about giving a 'user' the XML file to read, and now it's about editing the file? Which is it? If you're trying to edit the file, you should be expected to have the necessary tools and skills to do so. "Users" shouldn't be expected to build software on their machines without the proper development tools and environment set up to do so. Which brings me to another point: Is source code "human readable"? Its marked up in a way that provides instructions to the user's editor and compiler. By your definition, it too can't be considered "human readable" unless we remove those instructions. Removing them however... fundamentally changes how the "text file" is handled by the reader. And also by your definition, since an XML file is not "human readable", it must fail the test for GPL compliance. How would you provide a person with the "human readable" format of the source, to remain in compliance with that license? Would you consider XML the "machine readable" source instead?
these semantic games do nothing but cloud the discussion.
By trying to assert that XML isn't plain text, you are the one confusing the issue. Since < and > are within the 0-127 character limit, XML is actually ascii text. That means it is "plain text". You lose this argument based on your own conclusions.
the inference you are trying to get us to make is that "cleaning out all the tags" will convert an x.m.l. file into a plain-text file, magically. it won't.
It won't, because it already is a "plain-text" file. Cleaning them out just removes some of the plain text, leaving other plain text behind. There is nothing different from removing <this> and <that> from the text, just like removing (this) and (that) from the text.
i've been writing a separate post that will give details how this careful consideration and crafting must be done. (some hints: whitespace, quotemarks, and tables.)
Does it pass an XML validator? Is it well-formed? If not, then it isn't XML, and it is some other plain-text format with whitespace, quotemarks and tables.
pay attention to this, lurkers! it's a sure sign of vapor-ware!
What is the vaporware? I haven't seen it yet. XML exists, its not vaporware. I use it quite heavily to store Palm records with pilot-link. Its a great medium for atomic, record-level data in that specific case. But I'm seeing that your argument is full of hot air... or vapor, if you wish to use proper semantics. ;)
in its day, s.g.m.l. made all the same promises as x.m.l. does now. it couldn't keep them, so s.g.m.l. people had to invent a variant, so they could regenerate all their hype from scratch and reuse it.
No, SGML is completely different in goal and purpose from XML.
and sure enough, the public is gullible enough to believe it all again.
When you believe the hype that XML has anything at all to do with the "Web", then you're the gullible one. XML is an empty bucket, nothing more. It simply "holds". That's it. This whole "XML is the future of the web" business is all just hype pushed by companies trying to sell you products based on XML that intersect with the web.
of course, the same difficulties that thwarted s.g.m.l. back in the day -- sabotaging all their hype -- will return and bite x.m.l. in the butt.
You're spelling SGML and XML incorrectly again. For someone who is trying to defend what is, and what is not "plain text" or "ascii" or "unicode", you certainly don't know how to use grammar and spelling correctly. You would add significant weight to your arguments if you were able to articulate them using proper English.
they'll "know about" that x.m.l. version indirectly; it will be the reason their books are so expensive. due to all that cash those consultants carted away.
Excuse me? How does storing a textual work in XML in any way increase its price? In fact, it should dramatically decrease the "price", because it requires less handling to convert to any of a dozen or more formats. Having to recreate a work in Word, pdf, XML, text, and so on is much more "interactive" work if your base format is something other than XML. It requires much more "carbon-based" handling to maintain in those formats (not to mention additional storage and processing and maintenance at update time).
XML is only a long term and safe archive format
hype and marketing.
And your solution is what? Your wacky ZML answer? Please.
you can save a spreadsheet in "plain-text" form too, and then "use any software for processing" that too. but you're going to find yourself coming up short.
Not by your definition of "plain text".
likewise when working with an x.m.l. file in a plain-text editor; yes, it can be done, but you will find yourself coming up short.
Funny, not a single anti-XML argument I've ever read (and I've read hundreds) has ever said "XML is hard to work with because its not plain text". Except here of course.
but x.m.l. people will continue telling us this untruth, because they want us to believe that x.m.l. is really simple. but it's not.
Because you're the only one who doesn't seem to grasp the means by which XML can be used, edited and converted, does not mean the format suffers or is lacking in any way. The "X" in XML stands for Extensible. So extend it to suit your needs, or use something else. Nobody is twisting your arm.
of course, if it ain't human-readable in that form, it doesn't really matter if it "will never be lost".
Right, since XML is plain and simple and human readable, the documents contents will never be lost or buried in an unparsable format or a format that requires specialized tools to edit or maintain.
i will repeat: make x.m.l. work if you want us to respect it. don't come and _tell_ us how wonderful it will be; show us. the proof is in the pudding. not in the hype and marketing.
Have you ever read an XML file that is properly styled, in an editor that properly renders it with that styling intact? XML was not meant to be "read" by human eyes. Its a bucket, it "holds". You process it to turn it into something that can be read by humans or other machines or whatever. It is "source code" in that respect, to the "compiler" (XSLT, DOM, parsers) that is used to read it. And as much as I hate to bring it up, how many times have you openly exclaimed that you were leaving for good, and failed to do so? More hype and marketing? David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com
participants (7)
-
Andrew Sly
-
Bowerbird@aol.com
-
Brent Gueth
-
David A. Desrosiers
-
Jon Noring
-
Joshua Hutchinson
-
Marcello Perathoner