
On Wed, Oct 20, 2004 at 09:35:11PM +0200, Marcello Perathoner wrote:
Jim Tinsley wrote:
Nobody has an objection to valid TEI texts, but valid TEI texts alone _are not enough_. An XML file that cannot be read (by an actual human) is as useful as a lock with no key.
Not so. Having a TEI text posted would enable third-party developers to come up with their own converter solutions eve if we didn't get very far with ours. There are a lot of people around who already convert the text files into other formats. Their jobs would get much easier.
I really do not mean to be disrepectful when I -- speaking for myself -- say that I'm not interested in spending my time making developers' jobs easier. That's not what I'm here for. We have text, and HTML, both proven and well-supported formats that we know how to work with and for which we know there is a demand. I'll stick to those until we can see a way clear through to making successful XML.
I really no longer give any headroom at all to the approach "Post XML Now Because That Is The One True Way And We'll Figure Out How To Read It Later." If for no other reason, then because the most important part of the WW job is to check the texts before posting, and if we can't read it, we can't find the errors, and if we can't find the errors, we can't fix 'em.
A TEI text is basically a text file. So you can read it in any editor. If you use emacs you can also validate the TEI file against the DTD without leaving the editor.
A perfectly valid TEI file with no spelling errors should be good enough to post.
Correct spelling is necessary but not sufficient. I don't know about other people, but I most commonly find errors by skimming the text. I can't do that with XML. Also, the validity of the XML gives me no comfort at all that, say, paragraphs are sensibly separated. I can do that with text or HTML to a high degree of accuracy, because I can read them naturally in a viewer program. There are many such types of problems that I can detect by eye quite quickly -- provided I am seeing the text laid out in a natural way.
What you expect from us TEI developers is that we produce the 150% perfect solution before you even consider starting to post files. That is not the way software development works.
Not 150%, surely! :-) And it may not be the way software development works, but then we're not a software development project. HTML already works. TeX already works. I've spent enough of my hours trying to get XML to work; I now leave that to others.
And this attitude is in my opinion the main cause why we have gotten nowhere with TEI in the last 3 years.
Lets start now with a version 0.0.1 of the TEI process. Of course at some later time we'll have to do all the posted files over again. Probably more than once. But its better than sitting here and playing with bowerbird
. . . or vice-versa? :-) . . .
because we are bored.
Anyway, I disagree with your substantive point above. I say that until we have (or SOMEBODY has) a . . . . OK, a 90% solution, we should not post.
Next, we need a process, using open-source, cross-platform tools -- the standarder the better -- to convert that XML into, at a minimum, plain text and HTML. Other formats are welcome but optional. That process must work for _all_ teixlite files, not just ones that are specially cooked, using constraints not specified within the chosen DTD. Here's where we hit the rocks today.
TEI defines a standard way to extend the DTD. I used this standard way to extend the TEI DTD into what I called PGTEI. This still is a perfectly valid TEI DTD according to the TEI specs.
I don't want to imply specific means from which this process is to be constructed. Obviously XSLT is one possible approach, but I certainly do not want to imply limitations on what that process should use. The only things we must have -- both for our own internal practical purposes and for the use of future readers -- is that it should work reliably on _all_ texts that conform to the XML DTD chosen, be open source, and be cross-platform. A reader needs to be able to tweak the transform and re-run on her own desktop.
You misunderstand what a DTD is. It just gives you syntactical correctness. I can cook up a perfectly valid XHTML file which is semantically bogus:
<div><h6>1</h6> <div><h5>1.1</h5> <div><h4>1.1.1</h4> ... </div> </div> </div>
This is valid HTML (didn't bother to check) but will render not so well.
You cannot build a conversion tool that will produce good results on all syntactically valid TEI files, like you cannot build a browser that will make sense out of semantically bogus HTML files.
I think one of us is not understanding the other, or perhaps both. I'm pretty sure I did not misunderstand what a DTD is. I do understand that an XML file that is valid just means that it is syntactically correct. This is actually the same point I made above: the fact that the XML is valid does not mean that paragraph breaks are in the right place -- which is one of the reasons why I must be able to convert it to something I can read in order to check it. I certainly do not require a conversion tool that will correct misplacement of paragraph marks (though it would be nice! :-) -- I just require that the process for, say, teixlite will work reliably on all teixlite files; that it will produce syntactically valid HTML, and, I suppose you might reasonably say "syntactically valid" text. Actually, now that I say that, I recall a case where syntactically valid XML made invalid HTML through a bug. Anyway, that's not the problem. If the process we agree for teixlite is, say, run it through Saxon, then I expect to be able to run all teixlite files through Saxon, and not have a submitter say "oh, no, you must use Xalan for this file, and not just any Xalan, but one with my patch in it." I have no objection to requiring, say, a patched version of Saxon, but if so I expect that patched version to be stable, to work for all teixlite files submitted, to be open-source, and to be cross-platform.
Furthermore TEI is geared towards marking up existent texts, so scholars can study the text without having to get the physical book. It is not so good as a master format for print processing. That's why I had to add some more tags and attributes to my DTD. (Which doesn't make any text that uses my DTD less standard, because TEI is expressly designed to be extensible. But I'm repeating myself.)
And just re-reading that last, when I say "must work reliably on ALL texts" I do not mean to imply that the same XSLT must be used for all texts, though obviously that would be of benefit, if we can manage it.
So why not start posting texts marked up in PGTEI, which will by definition work well in my conversion chain?
I think we were very close to that a year and a half ago. I had a request in to you to fix the "blockquote" thing, Greg had laid down the requirements for the license. And if anyone has followed up any of that, they didn't copy me on it. Does anyone apart from you favor using PGTEI? In principle, of course, it doesn't matter, but in practice, we really couldn't cope with multiple XSLT conversion methods all happening at the same time. Your chain was, at least, rather difficult to implement. I haven't checked to see whether it still is. Can it be implemented on a Mac? on Win32? Is there a stable tarball somewhere? You see, we appear to differ very fundamentally on one point. It's my lock and key analogy again. I do not want to start down the road of producing posted files from an XML if the transform, will be, for any reason, not repeatable in a year's time, or five, or ten. I do not want to start down the road of producing posted files from XML if an end-user who wants to -- on whatever platform -- cannot replicate the process. I think that you don't care about this, or at least, it's not a priority for you, but it is one for me.
And at the same time start posting Jeroens texts, which will convert fine in his chain?
What we said last year still holds: we need somebody -- who is not me, not any of us WWs -- to create the process. The one that I defined in my earlier posting today. When we've got that, stable and documented, or at least understood, I really think we can proceed. But _I_, at least, have not got the time to spend experimenting, and I _know_ that David Widger doesn't.
This way we could both start putting up an automatic online conversion chain. (The guy who did this already in Java has somehow vanished, so I think we have to start over again.)
For the start I will act as interim Post-Processor for people wanting to post PGTEI and pass on to you only the perfectly good ones. You'll just have to stick in the etext number where I put 5 asterisks.
No; I, at least, don't want to work with an experimental process in which each text is an exception. I want a process in which the text comes in, I add the header, I run the conversion process and I check the resulting files. If we can't get to that point, I don't, as I said before, want to spend time on it. If _you_ can do this, then there is no reason, given a stable process, why _I_ can't. When somebody gets to this point, please let me know.
I claim the .pgtei file extension, Jeroen can claim what extension he sees fit for his files. So we can have bith an alice30.pgtei and an alice30.jtei.
Why can't we just name them .xml? I see no reason to invent extensions. _Is_ there one? Not that it matters much, just curious why you would think this a good idea. jim