Re: [gutvol-d] Posting TEI

20 Oct 2004

      On Wed, Oct 20, 2004 at 09:35:11PM +0200, Marcello Perathoner wrote:
...
Jim Tinsley wrote:
...
Nobody has an objection to valid TEI texts, but valid TEI texts alone
_are not enough_. An XML file that cannot be read (by an actual human)
is as useful as a lock with no key.
Not so. Having a TEI text posted would enable third-party developers to 
come up with their own converter solutions eve if we didn't get very far 
with ours. There are a lot of people around who already convert the text 
files into other formats. Their jobs would get much easier.
I really do not mean to be disrepectful when I -- speaking for myself --
say that I'm not interested in spending my time making developers' jobs
easier. That's not what I'm here for. We have text, and HTML, both
proven and well-supported formats that we know how to work with and for
which we know there is a demand. I'll stick to those until we can see
a way clear through to making successful XML.
...
...
I really no longer give any headroom at all to the approach "Post XML
Now Because That Is The One True Way And We'll Figure Out How To Read
It Later." If for no other reason, then because the most important
part of the WW job is to check the texts before posting, and if we
can't read it, we can't find the errors, and if we can't find the
errors, we can't fix 'em.
A TEI text is basically a text file. So you can read it in any editor. 
If you use emacs you can also validate the TEI file against the DTD 
without leaving the editor.
A perfectly valid TEI file with no spelling errors should be good enough 
to post.
Correct spelling is necessary but not sufficient. I don't know about
other people, but I most commonly find errors by skimming the text.
I can't do that with XML. Also, the validity of the XML gives me
no comfort at all that, say, paragraphs are sensibly separated. I can
do that with text or HTML to a high degree of accuracy, because I can
read them naturally in a viewer program. There are many such types of
problems that I can detect by eye quite quickly -- provided I am seeing
the text laid out in a natural way.
...
What you expect from us TEI developers is that we produce the 150% 
perfect solution before you even consider starting to post files. That 
is not the way software development works.
Not 150%, surely! :-)

And it may not be the way software development works, but then we're not
a software development project. HTML already works. TeX already works.
I've spent enough of my hours trying to get XML to work; I now leave
that to others.
...
And this attitude is in my opinion the main cause why we have gotten 
nowhere with TEI in the last 3 years.
Lets start now with a version 0.0.1 of the TEI process. Of course at 
some later time we'll have to do all the posted files over again. 
Probably more than once. But its better than sitting here and playing 
with bowerbird
. . . or vice-versa? :-) . . .
...
because we are bored.
Anyway, I disagree with your substantive point above. I say that until we
have (or SOMEBODY has) a . . . . OK, a 90% solution, we should not post.
...
...
Next, we need a process, using open-source, cross-platform tools --
the standarder the better -- to convert that XML into, at a minimum,
plain text and HTML. Other formats are welcome but optional. That
process must work for _all_ teixlite files, not just ones that are
specially cooked, using constraints not specified within the chosen
DTD. Here's where we hit the rocks today.
TEI defines a standard way to extend the DTD. I used this standard way 
to extend the TEI DTD into what I called PGTEI. This still is a 
perfectly valid TEI DTD according to the TEI specs.
...
I don't want to imply specific means from which this process is to be
constructed. Obviously XSLT is one possible approach, but I certainly
do not want to imply limitations on what that process should use. The
only things we must have -- both for our own internal practical
purposes and for the use of future readers -- is that it should work
reliably on _all_ texts that conform to the XML DTD chosen, be open
source, and be cross-platform. A reader needs to be able to tweak the
transform and re-run on her own desktop.
You misunderstand what a DTD is. It just gives you syntactical 
correctness. I can cook up a perfectly valid XHTML file which is 
semantically bogus:
<div><h6>1</h6>
   <div><h5>1.1</h5>
      <div><h4>1.1.1</h4>
        ...
      </div>
   </div>
 </div>
This is valid HTML (didn't bother to check) but will render not so well.
You cannot build a conversion tool that will produce good results on all 
syntactically valid TEI files, like you cannot build a browser that will 
make sense out of semantically bogus HTML files.
I think one of us is not understanding the other, or perhaps both. I'm pretty
sure I did not misunderstand what a DTD is. I do understand that an XML file
that is valid just means that it is syntactically correct. This is actually
the same point I made above: the fact that the XML is valid does not mean 
that paragraph breaks are in the right place -- which is one of the reasons
why I must be able to convert it to something I can read in order to check it.

I certainly do not require a conversion tool that will correct misplacement
of paragraph marks (though it would be nice! :-) -- I just require that the
process for, say, teixlite will work reliably on all teixlite files; that it
will produce syntactically valid HTML, and, I suppose you might reasonably
say "syntactically valid" text. Actually, now that I say that, I recall a
case where syntactically valid XML made invalid HTML through a bug. Anyway,
that's not the problem. If the process we agree for teixlite is, say, run
it through Saxon, then I expect to be able to run all teixlite files 
through Saxon, and not have a submitter say "oh, no, you must use Xalan for
this file, and not just any Xalan, but one with my patch in it."

I have no objection to requiring, say, a patched version of Saxon, but if so 
I expect that patched version to be stable, to work for all teixlite files 
submitted, to be open-source, and to be cross-platform.
...
Furthermore TEI is geared towards marking up existent texts, so scholars 
can study the text without having to get the physical book. It is not so 
good as a master format for print processing. That's why I had to add 
some more tags and attributes to my DTD. (Which doesn't make any text 
that uses my DTD less standard, because TEI is expressly designed to be 
extensible. But I'm repeating myself.)
...
And just re-reading that last, when I say "must work reliably on ALL
texts" I do not mean to imply that the same XSLT must be used for all
texts, though obviously that would be of benefit, if we can manage it.
So why not start posting texts marked up in PGTEI, which will by 
definition work well in my conversion chain?
I think we were very close to that a year and a half ago. I had a 
request in to you to fix the "blockquote" thing, Greg had laid
down the requirements for the license. And if anyone has followed
up any of that, they didn't copy me on it.

Does anyone apart from you favor using PGTEI? In principle, of 
course, it doesn't matter, but in practice, we really couldn't
cope with multiple XSLT conversion methods all happening at the
same time.

Your chain was, at least, rather difficult to implement. I haven't
checked to see whether it still is. Can it be implemented on a Mac?
on Win32? Is there a stable tarball somewhere?

You see, we appear to differ very fundamentally on one point. It's
my lock and key analogy again. I do not want to start down the road
of producing posted files from an XML if the transform, will be, for
any reason, not repeatable in a year's time, or five, or ten. I do
not want to start down the road of producing posted files from XML
if an end-user who wants to -- on whatever platform -- cannot 
replicate the process. I think that you don't care about this, or
at least, it's not a priority for you, but it is one for me.
...
And at the same time start posting Jeroens texts, which will convert 
fine in his chain?
What we said last year still holds: we need somebody -- who is not me,
not any of us WWs -- to create the process. The one that I defined in
my earlier posting today. When we've got that, stable and documented,
or at least understood, I really think we can proceed. But _I_, at
least, have not got the time to spend experimenting, and I _know_ that
David Widger doesn't.
...
This way we could both start putting up an automatic online conversion 
chain. (The guy who did this already in Java has somehow vanished, so I 
think we have to start over again.)
For the start I will act as interim Post-Processor for people wanting to 
post PGTEI and pass on to you only the perfectly good ones. You'll just 
have to stick in the etext number where I put 5 asterisks.
No; I, at least, don't want to work with an experimental process in which
each text is an exception. I want a process in which the text comes in,
I add the header, I run the conversion process and I check the resulting
files. If we can't get to that point, I don't, as I said before, want
to spend time on it. If _you_ can do this, then there is no reason,
given a stable process, why _I_ can't.

When somebody gets to this point, please let me know.
...
I claim the .pgtei file extension, Jeroen can claim what extension he 
sees fit for his files. So we can have bith an alice30.pgtei and an 
alice30.jtei.
Why can't we just name them .xml? I see no reason to invent extensions.
_Is_ there one? Not that it matters much, just curious why you would 
think this a good idea.

jim