Re: Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 19)

Joshua Hutchinson <joshua@hutchinson.net> wrote:
Lee Passey wrote:
[snip]
As Mr. Noring is always quick to point out, XML files can be viewed natively in both Firefox and IE6 when accompanied by appropriate style sheets, so I attempted to open this file directly in both of these browsers.
While this is true, our tei files are specifically meant as a master document and NOT as a viewing document. They will NOT parse in any browser "out of the box". As you've seen, you can jury-rig things to the point where it is usuable, but that is not our intention. We provide the HTML files directly for people that want to browse the file in IE or Firefox.
I understand that creating a file format which could be viewed without further processing was not your intention, but now that we have some evidence that suggests that it is a real possiblity is there any reason _not_ to pursue that possiblity, especially if it only requires adding three lines to the source (and making sure that all the dtd's are accessible)? [snip]
I have no solution to this problem, except to suggest that named entities simply be avoided in favor of numeric entities, at least in the short term (I do note that the etext 16523-x.xml does not contain any named entities).
I personally prefer numeric entities, as well, but for the more common ones, the conversion process will support named entities in the .tei file. Most of them appear as unicode in the HTML, so it typically isn't an issue in the final product.
You are correct; so long as you are relying on conversion to HTML (or some other file format) before the file is used, there should be no problem (so long as the conversion utility can get to the correct .ent files). Use of named entities is only a problem if you are attempting to display the TEI-XML directly. [snip]
It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels.
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 was a superset of Latin1? Anyway, I know in this particular file there are quite a few UTF-8 encoded characters (and a couple more that should be that we found yesterday backchannel).
UTF-8 and Latin-1 (aka ISO-8859-1) are both encoding methods. They share the same codepoints (the value of an acute 'e' is 233 in both encodings) but they use different encoding methods. Neither is a superset or subset of the other. Values from 0 to 127 are the same in both encodings, but values from 128 to 255 are encoded in a single byte in Latin-1 whereas those same values are encoded in two bytes in UTF-8. Values above 255 are represented in two or more bytes in UTF-8 (up to 6) where those same values cannot be represented at all in Latin-1. From an efficiency standpoint (which is not always the best way to look at things) if you have an English text which contains some few characters having values above 127, and which has as many above 255 as below, or if you have a text which contains a large number of characters with values above 255, UTF-8 is the probably the most efficient encoding (size-wise). If you have a western european text with a large number of characters above 127, but very few above 255 (French is a good example) Latin-1, with values above 255 expressed as entities (numberic or named) is probably the most efficient encoding. If you have a text where most of the characters have values above 1920 UTF-16 is probably the most efficient encoding (now we're really straying from the point). In any case, it doesn't matter which encoding is used, so long as it is not misrepresented in the <?xml ...> declaration.
If you're interested, I'll start putting together a generic CSS file for TEI.
We aren't too interested in CSS directly for the TEI file (the css file sitting beside the TEI file right now is a mistake ... that should be changed later today). However, once I have a few more documents posted and people seem fairly satisfied with the results, I want to get alternate CSS files submitted by other people for the HTML documents.
Well, I might do it anyway for my own edification and enjoyment (and because I think you _will_ be interested at some point in the future ;-).) Some months ago I put together a couple of tables showing how HTML could be mapped to TEI-lite, and vice-versa. The goal was to create a mapping that could be used for round-tripping via XSLT; that is, a TEI-lite document could be used to create an HTML document which could then be transformed back into TEI without loss of markup. I will probably start from those tables in creating a tei.css file. They may also be useful to you in creating XSLT scripts (aka XSL style sheets). If you're interested they can be found at www.passkeysoft.com/~lee/xhtml2tei.html and www.passkeysoft.com/~lee/tei2xhtml.html.
Also, if any industrious programmers out there know TEI conversions and would like to tackle the job of preparing a conversion process for other end formats (such as Palm files, Plucker, MS Reader, etc) please let me and/or Marcello know. The conversion must run on Linux (our server OS) and be open source (for future compatibility).
You probably don't need anything more than someone with basic shell scripting capabilities, as all the software to do this exists currently. When you say Palm files, I am assuming you mean PalmDOC files, which are nothing more than text files converted into the Palm Database format. This conversion can be performed by the command line program "Makedoc". Source code is available at http://linuxmafia.com/pub/palmos/other-os/makedoc9.tar.gz. The shell script would be: PGTEI -> (via XSLT) -> .txt -> (via makedoc9) -> .pdb Plucker is a progam which encapsulates a bundle of HTML files into a single file which can be rendered on the PalmOS. The script for a plucker transformation should be very similar to the PalmDOC transformation (I'm certain Mr. Desrosiers could help you with the precise syntax): PGTEI -> (via XSLT) -> HTML -> (via plucker distiller) -> .pdb To my knowledge there are no known lit compilers that run on Linux (thus making them ineligble by your requirements). This is not really a big deal because most MSReader users who are familiar with Project Gutenberg are comfortable making .lit files from HTML themselves, so if you can serve good HTML they will be happy. What I would really like to see is an XSL script that could do a PGTEI -> RTF transformation. It probably wouldn't be very useful, but it would sure be interesting. Now on a separate note: As part of my CSS experimentation, I set the display setting for the <tei-header> element to "none", because while I think the data is important, I'm not particularly interested in seeing it when I'm reading. When I did this, I thought I lost the title of the book because it only appears in the <tei-header> element. I discovered later the title was repeated in the <front> element, identified as a <head>er. As I read the TEI spec, (and I am by no means well-versed) I believe that there should also exist a <titlePage> element which should be part of the <front>, and which should contain all the information traditionally found on the title page of a book. The main title should be marked as <titlePart type="main">, subtitles should be marked as <titlePart type="sub">, and the byline should be marked as <byline>. This would be in addition to the information included in the <tei-header> element, which may be formated differently (e.g. the author's name may be presented last name first for automated catalog processing). I also had some question about the difference between the <titlePart> element and the <title> element. Looking at the spec it seems that the <title> element is not to be used to indicate the title of the work, as would appear on a title page, but the title of _another_ work referenced in the main work (these are the titles we were taught to underline back in the days of single font typewriters). For example, if _The Kitáb-i-Aqdas_ made reference to the _Baghad-Vita_, it would be marked as <title>The Baghad-Vita</title>, and should probably be rendered with an italicised font. I also note that you encoded the glossary at the end of the work with <p> tags (naughty, naughty). Based on what I saw in the TEI docs I would have encoded it as follows: <div type="glossary"> <head>Glossary</head> <list type="gloss"> <label>'Abdu'l-Bahá</label> <gloss>The "Servant of Bahá", Abbás Effendi (1844-1921), the eldest son and appointed Successor of Bahá'u'lláh, and the Centre of His Covenant.</gloss> <label>Abjad<label> <gloss>The ancient Arabic system of allocating a numerical value to letters of the alphabet, so that numbers may be represented by letters and vice versa. Thus every word has both a literal meaning and a numerical value.</gloss> etc. </list></div> I hope you find this useful.

Lee Passey wrote:
I understand that creating a file format which could be viewed without further processing was not your intention, but now that we have some evidence that suggests that it is a real possiblity is there any reason _not_ to pursue that possiblity, especially if it only requires adding three lines to the source (and making sure that all the dtd's are accessible)?
Supporting CSS styling will add another complexity layer to an already overly complex thing. A software architect has to leave things out to make the design implementable. Also, things like footnotes are impossible with CSS. So why bother?
In any case, it doesn't matter which encoding is used, so long as it is not misrepresented in the <?xml ...> declaration.
Both the TEI and the XHTML file are correct. I don't know why it doesn't work for you.
As part of my CSS experimentation, I set the display setting for the <tei-header> element to "none", because while I think the data is important, I'm not particularly interested in seeing it when I'm reading. When I did this, I thought I lost the title of the book because it only appears in the <tei-header> element. I discovered later the title was repeated in the <front> element, identified as a <head>er. As I read the TEI spec, (and I am by no means well-versed) I believe that there should also exist a <titlePage> element which should be part of the <front>, and which should contain all the information traditionally found on the title page of a book.
That is for the encoder to decide. If the title page is interesting enough to warrant a separate encoding, she will use <titlePage> etc. to mark it up. If the title page is just plain boring you can generate a standard title page with <divGen type="titlepage">. This will pull all data out of the <teiHeader> and save you the trouble. There are a lot of such shortcuts implemented like <divGen type="toc"> and <divGen type="footnotes">.
I also note that you encoded the glossary at the end of the work with <p> tags (naughty, naughty). Based on what I saw in the TEI docs I would have encoded it as follows:
<div type="glossary"> <head>Glossary</head> <list type="gloss"> <label>'Abdu'l-Bahá</label> <gloss>The "Servant of Bahá", Abbás Effendi (1844-1921), the eldest son and appointed Successor of Bahá'u'lláh, and the Centre of His Covenant.</gloss>
And it wouldn't have validated because gloss has no business inside list. -- Marcello Perathoner webmaster@gutenberg.org

Marcello wrote:
Lee Passey wrote:
I also note that you encoded the glossary at the end of the work with <p> tags (naughty, naughty). Based on what I saw in the TEI docs I would have encoded it as follows:
<div type="glossary"> <head>Glossary</head> <list type="gloss"> <label>'Abdu'l-Bahá</label> <gloss>The "Servant of Bahá", Abbás Effendi (1844-1921), the eldest son and appointed Successor of Bahá'u'lláh, and the Centre of His Covenant.</gloss>
And it wouldn't have validated because gloss has no business inside list.
TEI P4 shows how to do it (I think): http://www.tei-c.org/P4X/DS.html#TDX-280 Then from there it links to: http://www.tei-c.org/P4X/CO.html#COLI Where it gives the following example: <list type="gloss"> <head>Report of the conduct and progress of Ernest Pontifex. Upper Vth form — half term ending Midsummer 1851</head> <label>Classics</label> <item>Idle listless and unimproving</item> <label>Mathematics</label> <item>ditto</item> <label>Divinity</label> <item>ditto</item> <label>Conduct in house</label> <item>Orderly</item> <label>General conduct</label> <item>Not satisfactory, on account of his great unpunctuality and inattention to duties</item> </list> Also refer to: http://www.tei-c.org/P4X/CO.html#COHQU Which talks about the <gloss> element. It appears that this particular markup problem has appeared before for TEI-P4 to even discuss it (see the prior links.) Definitely Lee is right in that <p> is not the best for this purpose, and Marcello is right in that how Lee used it is incorrect. In fact, the closer I look at the above example, the more it looks like XHTML definition lists with almost an exact mapping between the two except that XHTML <dl> (analogous to TEI <list type="gloss">) cannot contain anything but <dd> <dt> pairs, while the TEI version can also contain a <head>er. In fact, as I look at it, getting the example above to work in XHTML is problematic because of the <head> line. In fact, XHTML has pretty poor list support for internal headers and the like (all the lists: ol, ul, and dl, only support li, and dd/dt for dl), so this looks like item #6 in my "problems with TEI+CSS2 rendering" list. Jon

Jon Noring wrote:
Marcello wrote:
Lee Passey wrote:
I also note that you encoded the glossary at the end of the work with <p> tags (naughty, naughty). Based on what I saw in the TEI docs I would have encoded it as follows:
<div type="glossary"> <head>Glossary</head> <list type="gloss"> <label>'Abdu'l-Bahá</label> <gloss>The "Servant of Bahá", Abbás Effendi (1844-1921), the eldest son and appointed Successor of Bahá'u'lláh, and the Centre of His Covenant.</gloss>
And it wouldn't have validated because gloss has no business inside list.
<snip good info on glossary markup>
There is a concept that Marcello and I have discussed of markup "levels". When it comes to something like TEI, there are so many ways you can add meta data it is completely daunting at times. In this example, yes, a more specific markup could have been used. But, in the final render, it works just fine as <p> blocks. Another example is a text with foreign words interspersed throughout. Often, those words would be printed in italics in the original book. Now, the simplest markup in TEI would be to put <hi rend="italics">around</hi> the word. But you could also mark the word with a <foreign lang="en">foreign</foreign> tag. In the final render, it would look exactly the same, but the second option provides more specific metadata. You could even go further by provide a translation of the foreign word inside the attribute (the markup escapes me at the moment). The markup that would cover what PG currently has would be want I would call a "level one markup" and that is the minimum, obviously, that a TEI could be marked to. Level two would be given a little more metadata, but nothing drastic. Maybe marking certain words as foreign instead of italics. Marking a letter as such instead of just a block of indented paragraphs. etc. etc. Level three would be going the extra, extra mile. It's the kind of markup I don't expect to see, but is possible in TEI. I expect most TEI documents we post will fall in level one or level two. Josh

Marcello Perathoner <marcello@perathoner.de> writes:
Also, things like footnotes are impossible with CSS. So why bother?
Display footnotes as sidenotes. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Lee Passey wrote:
Joshua Hutchinson wrote:
While this is true, our tei files are specifically meant as a master document and NOT as a viewing document. They will NOT parse in any browser "out of the box". As you've seen, you can jury-rig things to the point where it is usuable, but that is not our intention. We provide the HTML files directly for people that want to browse the file in IE or Firefox.
I understand that creating a file format which could be viewed without further processing was not your intention, but now that we have some evidence that suggests that it is a real possiblity is there any reason _not_ to pursue that possiblity, especially if it only requires adding three lines to the source (and making sure that all the dtd's are accessible)?
Well, my investigation into PG-TEI and TEI-P4X (thank heavens for TEI Pizza Chef to flatten the otherwise unreadable TEI-P4 DTD!) shows it is also a real possibility. But I believe, subject to change as I learn more from the experts here and the TEI-L folk, that in order to make PGTEI+CSS2 to render in web standards browsers (limited now to Firefox and maybe Opera 8) we also have to appropriately constrain/subset the PG-TEI vocabulary (allowed elements/attributes/attr-values) and content models (what results may be somewhat like TEI-Lite, but not exactly the same -- we can certainly add our own tags as needs require.) We may also have to give up a couple things. [Note: Even if CSS2 rendering is not of interest, I think PG-TEI, when released as version 1.0, needs to be appropriately constrained to make life a whole lot easier for everyone using it -- subject of a future message if this topic comes up.] Assuming appropriate constraints, here's the five items needing further investigation to see how to get them to render properly using CSS2 (there may be other TEI constructs which don't fit well into the XHTML model): 1) The TEI <note> tag. If placed directly inline (not indirectly referenced), it is possible in CSS2 to declare it block and move it outside of the main flow, which is a reasonable way to present it (even if not the best.) I've actually experimented with this, but my test files are inexplicably long-lost <fuming class="mad"/>. This won't work in IE6, but then IE6 sucks when it comes to web standards support. (I assume with XSLT that more advanced moving around of the content within notes is possible to do, such as dumping it into another document or placing it in a notes section.) 2) Hypertext links. CSS2 'display' provides no mapping for anchors. XLink will work, but then that's outside of TEI. (XLink for hypertext linking is recognized in Mozilla/Firefox, but not in Opera 7 -- don't know about Opera 8 yet. Try the following test: http://www.windspun.com/demoxml/demolink.xml 3) Tables. I think the basic TEI table model will map to the XHTML model (there's quite a few table-related CSS2 'display' values.) However, if PG-TEI will optionally allow other table models to be used, such as CALS, all bets are off. I'm not sure that even XSLT will be able to properly map any CALS table to XHTML (may require something outside of XSLT to do the transformation.) 4) Lists. I think that TEI Lists can be made to render properly with CSS2 'display', but not sure. It needs experimentation. 5) Images. CSS2 'display' has no mapping for images and objects. XLink provides the ability to embed objects, but no web browser appears to support this functionality of XLink yet, and anyway XLink will not be used to specify images in PG-TEI documents. (Hmmm, I think here it may be possible with CSS2 to pull out the name of the image and then use that name as a string to embed the image back in -- CSS2 is capable of image embedding. Need to experiment with it. It might work in IE6, too.)
I personally prefer numeric entities, as well, but for the more common ones, the conversion process will support named entities in the .tei file. Most of them appear as unicode in the HTML, so it typically isn't an issue in the final product.
You are correct; so long as you are relying on conversion to HTML (or some other file format) before the file is used, there should be no problem (so long as the conversion utility can get to the correct .ent files). Use of named entities is only a problem if you are attempting to display the TEI-XML directly.
Yes, definitely! Of course, those named character entities which are defined in HTML/XHTML will be renderable in webs standards browsers. But I think it best, in whatever DP exports as PG-TEI, to use numeric character entities. For primarily "ASCII" documents, a manifest of non-ASCII characters used in the document can be placed in a comment somewhere in the header. This allows someone to know what ሴ found in the text is (here it is an Ethiopic character), without having to refer to the Unicode docs. I build a non-ASCII character manifest for many of the XHTML documents I author.
In any case, it doesn't matter which encoding is used, so long as it is not misrepresented in the <?xml ...> declaration.
Yes. To reply to Marcello's comment in another message, the PG-TEI documentation should make it clear, and provide an example, of using either ISO-8859-1 or UTF-8 in the XML declaration. If it was my druthers, only UTF-8 should be used, but a compromise where ISO-8859-1 can also be used is acceptable. But no others for all mostly Latin documents! And I'd work at a future time to re-encode documents in ISO-8859-1 into UTF-8.
We aren't too interested in CSS directly for the TEI file (the css file sitting beside the TEI file right now is a mistake ... that should be changed later today). However, once I have a few more documents posted and people seem fairly satisfied with the results, I want to get alternate CSS files submitted by other people for the HTML documents.
Well, I might do it anyway for my own edification and enjoyment (and because I think you _will_ be interested at some point in the future ;-).)
<laugh> Careful Lee, you almost sound like Bowerbird on that one (but not quite.) I think it is an excellent exercise to explore how to properly render XML-conforming TEI documents using only CSS2 in web standards browsers. It may indicate how to constrain TEI so it is renderable, which may be useful for the set of criteria to build the constrained PG-TEI subset of TEI. It is also useful for the proposed TEI support in OpenReader.
Some months ago I put together a couple of tables showing how HTML could be mapped to TEI-lite, and vice-versa. The goal was to create a mapping that could be used for round-tripping via XSLT; that is, a TEI-lite document could be used to create an HTML document which could then be transformed back into TEI without loss of markup. I will probably start from those tables in creating a tei.css file. They may also be useful to you in creating XSLT scripts (aka XSL style sheets). If you're interested they can be found at www.passkeysoft.com/~lee/xhtml2tei.html and www.passkeysoft.com/~lee/tei2xhtml.html.
Well, round-tripping using XSLT and direct rendering of TEI using CSS2 are two different things. I believe XSLT has more power, but CSS2 is not bad, and CSS3 adds some new stuff (but mostly not supported in Firefox and Opera.)
Also, if any industrious programmers out there know TEI conversions and would like to tackle the job of preparing a conversion process for other end formats (such as Palm files, Plucker, MS Reader, etc) please let me and/or Marcello know. The conversion must run on Linux (our server OS) and be open source (for future compatibility).
To my knowledge there are no known lit compilers that run on Linux (thus making them ineligble by your requirements). This is not really a big deal because most MSReader users who are familiar with Project Gutenberg are comfortable making .lit files from HTML themselves, so if you can serve good HTML they will be happy.
My view in LIT production is to go from PG-TEI to well-structured XHTML 1.1 (which is probably what Lee means by "HTML".) Then from there build OEBPS 1.0.1 (LIT optimized) and OEBPS 1.2. Then let end-users convert the OEBPS 1.0.1 to LIT using the simple litconvertdemo in MS Reader's SDK (I have a "non-demo" version of the same). This approach takes full advantage of what LIT provides, while ReaderWorks does not (RW is buggy plus does not support a couple of the Reader/LIT features.) That is, to produce the hightest quality LIT having available the full range of Reader/LIT features, it is much better to start with OEBPS 1.0.1 than to use ReaderWorks which assembles HTML fragments. Jon

Jon Noring wrote:
Well, my investigation into PG-TEI and TEI-P4X (thank heavens for TEI Pizza Chef to flatten the otherwise unreadable TEI-P4 DTD!) shows it is also a real possibility. But I believe, subject to change as I learn more from the experts here and the TEI-L folk, that in order to make PGTEI+CSS2 to render in web standards browsers (limited now to Firefox and maybe Opera 8) we also have to appropriately constrain/subset the PG-TEI vocabulary (allowed elements/attributes/attr-values) and content models (what results may be somewhat like TEI-Lite, but not exactly the same -- we can certainly add our own tags as needs require.) We may also have to give up a couple things.
You can render XML, using XSLT + CSS in Firefox and IE, for a small demo, look at http://www.gutenberg.org/files/11335/11335-x/11335-x.xml. This sample still has a few rough edges, but can be made more beautiful. The XSLT is simply pulled in by the browser. For any TEI file to work in an actual environment, you need to have a set of working instructions and conventions, such as what to put in rend attributes, and how to interpret certain things. TEI is mainly concerned about the semantics, but to render it, you need, even in a minimal way, also concern yourself about looks. Just some examples: I consider the foreign tag to imply no rendering information, only a language change. I will use <hi> with a lang (and rend) attribute to indicate a rendering change as well as a language change. If somebody applies italics to all foreign tags, it wont be as I intended it. Similarly, I consider quotation marks part of the text, and will leave them, even when I use <q> tags, and never emit quotation marks when rendering TEI. Another user may choose different. Some have argued (with valid reasons) that the entire idea of TEI markup is broken, and have proposed systems in which the mark-up is separated from the text (stream of characters), in such a way that multiple, parallel systems of mark-up can exist. Think of a separate (part of a) file, saying characters 21 to 34 are italics, and so on. This may sound odd, but it is the way the old Macintosh wordprocessor MacWrite worked. Jeroen.

You can render XML, using XSLT + CSS in Firefox and IE, for a small demo, look at http://www.gutenberg.org/files/11335/11335-x/11335-x.xml. This sample still has a few rough edges, but can be made more beautiful. The XSLT is simply pulled in by the browser.
I've been doing XML styling for years... There's nothing really magical about it. You can see that here: http://plkr.org/rss.pl There's plenty of Gutenberg XML examples here as well: http://gutenberg.hwg.org/checkdoc2.html David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com
participants (7)
-
David A. Desrosiers
-
Jeroen Hellingman (Mailing List Account)
-
Jon Noring
-
Joshua Hutchinson
-
Karl Eichwalder
-
Lee Passey
-
Marcello Perathoner