Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 22)

24 Aug 2005

      Joshua Hutchinson <joshua@hutchinson.net> wrote:
...
There is a concept that Marcello and I have discussed of markup 
"levels".  When it comes to something like TEI, there are so many ways 
you can add meta data it is completely daunting at times.  In this 
example, yes, a more specific markup could have been used.  But, in 
the final render, it works just fine as <p> blocks.
But, you see, it doesn't. On my PocketPC, and in all dedicated e-book 
programs on my desktop computer, <p> elements (in HTML of course) have 
the first line indented, and there is no blank space between paragraphs. 
In this case your glossary just looks odd: there is an indented word or 
phrase, then another indented word or phrase, then another indented word 
or phrase, most of which are not complete sentences (although every once 
in a while there is a complete sentence thrown in just to confuse me). 
There is no typographical convention that indicates which is the term 
and which is the gloss. Of course I can figure it out relatively easily, 
but to do so I have to exit "immersive reading" mode and go into "copy 
editing" mode, which is disruptive to my reading experience.

Had you created the list as a "<list 
type="gloss"><label>word</label><item>definition</item>", your XSL 
script could have transformed it into "<ul><li><strong>word</strong>: 
definition</li>", and preserved all the typographically conventions I 
have come to expect. As it is, these fragments are identified as 
paragraphs, and XSL scripts and CSS style sheets have to treat them in 
the same way they treat real paragraphs.

I have never understood the resistance to the notion that text blocks 
with indeterminable semantics should be identified as <div> rather than 
<p>. It's a simple change in mindset. When in doubt, use <div>. Look up 
the definition of the word 'paragraph' in a dictionary. If you don't 
think you could convince your English (or any other language for that 
matter) teacher that the block of text satisfies the definition, use 
<div> (or some other more appropriate element), not <p>.

This, I think, is a good example of the distinction between correctness 
and validity. I could mark up a phrase as "Four <term 
id="score">score</term> and <gloss target="score">seven</gloss> years 
ago ..." and it would be valid, although it would not be correct, as the 
word 'seven' is not really a gloss for the word 'score'.

There are times when it is valuable to know that a certain block of text 
is, in fact, a paragraph. Suppose, for example, that someone might want 
to create an annotation to accompany <title>The Kitáb-i-Aqdas</title>. 
He or she might want to preface some text with "if you look at the 
second paragraph following header 77 ..." If the user has a dt edition, 
finding this passage is fairly easy: flip through the book for something 
that looks like a header and is numbered 77, and count the paragraphs 
that follow. If there are only a few paragraphs after header 77 and 
before header 78 this is quite easy. If you're looking for the 935th 
paragraph following header 77 it can quickly become tedious. Luckily for 
us, tedious is something that computers do very well. Unluckily for us, 
Bowerbird has not yet released his algorithm for determining whether a 
block of unstructured text is a paragraph. So for today, we must rely on 
the coders to correctly identify which blocks are paragraphs, and, just 
as importantly, which blocks are not paragraphs. If every indeterminate 
block of text is marked as a paragraph, then the value of the <p> tag is 
lost; it has just become a synonym for <div>, and is redundant. As the 
pointed man in the pointless forest said to Oblio, "A point in every 
direction is as good as no point at all."

So, if being conscientious about only identifying as a paragraph that 
text which really _is_ a paragraph adds value to the file (perhaps not 
to you, but to someone, and if you didn't want this file to be useful to 
someone else you wouldn't be doing it in the first place), and if it is 
just as easy to be discriminating about paragraphs as it is _not_ to be, 
why not do it?
...
Another example is a text with foreign words interspersed throughout.  
Often, those words would be printed in italics in the original book.  
Now, the simplest markup in TEI would be to put <hi 
rend="italics">around</hi> the word.  But you could also mark the word 
with a <foreign lang="en">foreign</foreign> tag.  In the final render, 
it would look exactly the same, but the second option provides more 
specific metadata.  You could even go further by provide a translation 
of the foreign word inside the attribute (the markup escapes me at the 
moment).
The markup that would cover what PG currently has would be want I 
would call a "level one markup" and that is the minimum, obviously, 
that a TEI could be marked to.  Level two would be given a little more 
metadata, but nothing drastic.  Maybe marking certain words as foreign 
instead of italics.  Marking a letter as such instead of just a block 
of indented paragraphs.  etc. etc.
Level three would be going the extra, extra mile.  It's the kind of 
markup I don't expect to see, but is possible in TEI.
I can completely agree with this notion of markup levels, but it seems 
to me that the thing that should distinquish the levels is completeness, 
not correctness. Documents at every level should be correct, even if not 
complete. In your example, the use of the <hi> tag tells the user (or 
more accurately, his or her software agent) "this text was italicized in 
the original text, but I am unable or unwilling to tell you why." The 
markup is incomplete, but it is not incorrect. If you were to mark up a 
block quotation with the <div> tag you are telling the user agent "this 
text was set aside as a block in the original text, but I am unable or 
unwilling to tell you why." I can live with that. But if you mark up a 
block of text with the <p> tag you are telling the user agent "this 
block of text contains one or more compete sentences, and deals with a 
single thought or topic or quotes one speaker's continuous words." 
Marking up a definition term as a paragraph is as incorrect as marking 
up the word 'seven' as a gloss for the word 'score.'

Please don't let the reasonable need to tolerate incompleteness become 
an excuse for incorrectness.
...
I expect most TEI documents we post will fall in level one or level two.
Josh

Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 22)

Lee Passey