
Joshua Hutchinson <joshua@hutchinson.net> wrote:
Lee Passey wrote:
[snip]
As Mr. Noring is always quick to point out, XML files can be viewed natively in both Firefox and IE6 when accompanied by appropriate style sheets, so I attempted to open this file directly in both of these browsers.
While this is true, our tei files are specifically meant as a master document and NOT as a viewing document. They will NOT parse in any browser "out of the box". As you've seen, you can jury-rig things to the point where it is usuable, but that is not our intention. We provide the HTML files directly for people that want to browse the file in IE or Firefox.
I understand that creating a file format which could be viewed without further processing was not your intention, but now that we have some evidence that suggests that it is a real possiblity is there any reason _not_ to pursue that possiblity, especially if it only requires adding three lines to the source (and making sure that all the dtd's are accessible)? [snip]
I have no solution to this problem, except to suggest that named entities simply be avoided in favor of numeric entities, at least in the short term (I do note that the etext 16523-x.xml does not contain any named entities).
I personally prefer numeric entities, as well, but for the more common ones, the conversion process will support named entities in the .tei file. Most of them appear as unicode in the HTML, so it typically isn't an issue in the final product.
You are correct; so long as you are relying on conversion to HTML (or some other file format) before the file is used, there should be no problem (so long as the conversion utility can get to the correct .ent files). Use of named entities is only a problem if you are attempting to display the TEI-XML directly. [snip]
It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels.
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 was a superset of Latin1? Anyway, I know in this particular file there are quite a few UTF-8 encoded characters (and a couple more that should be that we found yesterday backchannel).
UTF-8 and Latin-1 (aka ISO-8859-1) are both encoding methods. They share the same codepoints (the value of an acute 'e' is 233 in both encodings) but they use different encoding methods. Neither is a superset or subset of the other. Values from 0 to 127 are the same in both encodings, but values from 128 to 255 are encoded in a single byte in Latin-1 whereas those same values are encoded in two bytes in UTF-8. Values above 255 are represented in two or more bytes in UTF-8 (up to 6) where those same values cannot be represented at all in Latin-1. From an efficiency standpoint (which is not always the best way to look at things) if you have an English text which contains some few characters having values above 127, and which has as many above 255 as below, or if you have a text which contains a large number of characters with values above 255, UTF-8 is the probably the most efficient encoding (size-wise). If you have a western european text with a large number of characters above 127, but very few above 255 (French is a good example) Latin-1, with values above 255 expressed as entities (numberic or named) is probably the most efficient encoding. If you have a text where most of the characters have values above 1920 UTF-16 is probably the most efficient encoding (now we're really straying from the point). In any case, it doesn't matter which encoding is used, so long as it is not misrepresented in the <?xml ...> declaration.
If you're interested, I'll start putting together a generic CSS file for TEI.
We aren't too interested in CSS directly for the TEI file (the css file sitting beside the TEI file right now is a mistake ... that should be changed later today). However, once I have a few more documents posted and people seem fairly satisfied with the results, I want to get alternate CSS files submitted by other people for the HTML documents.
Well, I might do it anyway for my own edification and enjoyment (and because I think you _will_ be interested at some point in the future ;-).) Some months ago I put together a couple of tables showing how HTML could be mapped to TEI-lite, and vice-versa. The goal was to create a mapping that could be used for round-tripping via XSLT; that is, a TEI-lite document could be used to create an HTML document which could then be transformed back into TEI without loss of markup. I will probably start from those tables in creating a tei.css file. They may also be useful to you in creating XSLT scripts (aka XSL style sheets). If you're interested they can be found at www.passkeysoft.com/~lee/xhtml2tei.html and www.passkeysoft.com/~lee/tei2xhtml.html.
Also, if any industrious programmers out there know TEI conversions and would like to tackle the job of preparing a conversion process for other end formats (such as Palm files, Plucker, MS Reader, etc) please let me and/or Marcello know. The conversion must run on Linux (our server OS) and be open source (for future compatibility).
You probably don't need anything more than someone with basic shell scripting capabilities, as all the software to do this exists currently. When you say Palm files, I am assuming you mean PalmDOC files, which are nothing more than text files converted into the Palm Database format. This conversion can be performed by the command line program "Makedoc". Source code is available at http://linuxmafia.com/pub/palmos/other-os/makedoc9.tar.gz. The shell script would be: PGTEI -> (via XSLT) -> .txt -> (via makedoc9) -> .pdb Plucker is a progam which encapsulates a bundle of HTML files into a single file which can be rendered on the PalmOS. The script for a plucker transformation should be very similar to the PalmDOC transformation (I'm certain Mr. Desrosiers could help you with the precise syntax): PGTEI -> (via XSLT) -> HTML -> (via plucker distiller) -> .pdb To my knowledge there are no known lit compilers that run on Linux (thus making them ineligble by your requirements). This is not really a big deal because most MSReader users who are familiar with Project Gutenberg are comfortable making .lit files from HTML themselves, so if you can serve good HTML they will be happy. What I would really like to see is an XSL script that could do a PGTEI -> RTF transformation. It probably wouldn't be very useful, but it would sure be interesting. Now on a separate note: As part of my CSS experimentation, I set the display setting for the <tei-header> element to "none", because while I think the data is important, I'm not particularly interested in seeing it when I'm reading. When I did this, I thought I lost the title of the book because it only appears in the <tei-header> element. I discovered later the title was repeated in the <front> element, identified as a <head>er. As I read the TEI spec, (and I am by no means well-versed) I believe that there should also exist a <titlePage> element which should be part of the <front>, and which should contain all the information traditionally found on the title page of a book. The main title should be marked as <titlePart type="main">, subtitles should be marked as <titlePart type="sub">, and the byline should be marked as <byline>. This would be in addition to the information included in the <tei-header> element, which may be formated differently (e.g. the author's name may be presented last name first for automated catalog processing). I also had some question about the difference between the <titlePart> element and the <title> element. Looking at the spec it seems that the <title> element is not to be used to indicate the title of the work, as would appear on a title page, but the title of _another_ work referenced in the main work (these are the titles we were taught to underline back in the days of single font typewriters). For example, if _The Kitáb-i-Aqdas_ made reference to the _Baghad-Vita_, it would be marked as <title>The Baghad-Vita</title>, and should probably be rendered with an italicised font. I also note that you encoded the glossary at the end of the work with <p> tags (naughty, naughty). Based on what I saw in the TEI docs I would have encoded it as follows: <div type="glossary"> <head>Glossary</head> <list type="gloss"> <label>'Abdu'l-Bahá</label> <gloss>The "Servant of Bahá", Abbás Effendi (1844-1921), the eldest son and appointed Successor of Bahá'u'lláh, and the Centre of His Covenant.</gloss> <label>Abjad<label> <gloss>The ancient Arabic system of allocating a numerical value to letters of the alphabet, so that numbers may be represented by letters and vice versa. Thus every word has both a literal meaning and a numerical value.</gloss> etc. </list></div> I hope you find this useful.