
Joshua Hutchinson <joshua@hutchinson.net> wrote:
There is a concept that Marcello and I have discussed of markup "levels". When it comes to something like TEI, there are so many ways you can add meta data it is completely daunting at times. In this example, yes, a more specific markup could have been used. But, in the final render, it works just fine as <p> blocks.
But, you see, it doesn't. On my PocketPC, and in all dedicated e-book programs on my desktop computer, <p> elements (in HTML of course) have the first line indented, and there is no blank space between paragraphs. In this case your glossary just looks odd: there is an indented word or phrase, then another indented word or phrase, then another indented word or phrase, most of which are not complete sentences (although every once in a while there is a complete sentence thrown in just to confuse me). There is no typographical convention that indicates which is the term and which is the gloss. Of course I can figure it out relatively easily, but to do so I have to exit "immersive reading" mode and go into "copy editing" mode, which is disruptive to my reading experience. Had you created the list as a "<list type="gloss"><label>word</label><item>definition</item>", your XSL script could have transformed it into "<ul><li><strong>word</strong>: definition</li>", and preserved all the typographically conventions I have come to expect. As it is, these fragments are identified as paragraphs, and XSL scripts and CSS style sheets have to treat them in the same way they treat real paragraphs. I have never understood the resistance to the notion that text blocks with indeterminable semantics should be identified as <div> rather than <p>. It's a simple change in mindset. When in doubt, use <div>. Look up the definition of the word 'paragraph' in a dictionary. If you don't think you could convince your English (or any other language for that matter) teacher that the block of text satisfies the definition, use <div> (or some other more appropriate element), not <p>. This, I think, is a good example of the distinction between correctness and validity. I could mark up a phrase as "Four <term id="score">score</term> and <gloss target="score">seven</gloss> years ago ..." and it would be valid, although it would not be correct, as the word 'seven' is not really a gloss for the word 'score'. There are times when it is valuable to know that a certain block of text is, in fact, a paragraph. Suppose, for example, that someone might want to create an annotation to accompany <title>The Kitáb-i-Aqdas</title>. He or she might want to preface some text with "if you look at the second paragraph following header 77 ..." If the user has a dt edition, finding this passage is fairly easy: flip through the book for something that looks like a header and is numbered 77, and count the paragraphs that follow. If there are only a few paragraphs after header 77 and before header 78 this is quite easy. If you're looking for the 935th paragraph following header 77 it can quickly become tedious. Luckily for us, tedious is something that computers do very well. Unluckily for us, Bowerbird has not yet released his algorithm for determining whether a block of unstructured text is a paragraph. So for today, we must rely on the coders to correctly identify which blocks are paragraphs, and, just as importantly, which blocks are not paragraphs. If every indeterminate block of text is marked as a paragraph, then the value of the <p> tag is lost; it has just become a synonym for <div>, and is redundant. As the pointed man in the pointless forest said to Oblio, "A point in every direction is as good as no point at all." So, if being conscientious about only identifying as a paragraph that text which really _is_ a paragraph adds value to the file (perhaps not to you, but to someone, and if you didn't want this file to be useful to someone else you wouldn't be doing it in the first place), and if it is just as easy to be discriminating about paragraphs as it is _not_ to be, why not do it?
Another example is a text with foreign words interspersed throughout. Often, those words would be printed in italics in the original book. Now, the simplest markup in TEI would be to put <hi rend="italics">around</hi> the word. But you could also mark the word with a <foreign lang="en">foreign</foreign> tag. In the final render, it would look exactly the same, but the second option provides more specific metadata. You could even go further by provide a translation of the foreign word inside the attribute (the markup escapes me at the moment).
The markup that would cover what PG currently has would be want I would call a "level one markup" and that is the minimum, obviously, that a TEI could be marked to. Level two would be given a little more metadata, but nothing drastic. Maybe marking certain words as foreign instead of italics. Marking a letter as such instead of just a block of indented paragraphs. etc. etc.
Level three would be going the extra, extra mile. It's the kind of markup I don't expect to see, but is possible in TEI.
I can completely agree with this notion of markup levels, but it seems to me that the thing that should distinquish the levels is completeness, not correctness. Documents at every level should be correct, even if not complete. In your example, the use of the <hi> tag tells the user (or more accurately, his or her software agent) "this text was italicized in the original text, but I am unable or unwilling to tell you why." The markup is incomplete, but it is not incorrect. If you were to mark up a block quotation with the <div> tag you are telling the user agent "this text was set aside as a block in the original text, but I am unable or unwilling to tell you why." I can live with that. But if you mark up a block of text with the <p> tag you are telling the user agent "this block of text contains one or more compete sentences, and deals with a single thought or topic or quotes one speaker's continuous words." Marking up a definition term as a paragraph is as incorrect as marking up the word 'seven' as a gloss for the word 'score.' Please don't let the reasonable need to tolerate incompleteness become an excuse for incorrectness.
I expect most TEI documents we post will fall in level one or level two.
Josh