Re: Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 19)

23 Aug 2005

      Joshua Hutchinson <joshua@hutchinson.net> wrote:
...
Lee Passey wrote:
[snip]
...
...
As Mr. Noring is always quick to point out, XML files can be viewed 
natively in both Firefox and IE6 when accompanied by appropriate 
style sheets, so I attempted to open this file directly in both of 
these browsers.
While this is true, our tei files are specifically meant as a master 
document and NOT as a viewing document.  They will NOT parse in any 
browser "out of the box".  As you've seen, you can jury-rig things to 
the point where it is usuable, but that is not our intention.  We 
provide the HTML files directly for people that want to browse the 
file in IE or Firefox.
I understand that creating a file format which could be viewed without 
further processing was not your intention, but now that we have some 
evidence that suggests that it is a real possiblity is there any reason 
_not_ to pursue that possiblity, especially if it only requires adding 
three lines to the source (and making sure that all the dtd's are 
accessible)?

[snip]
...
...
I have no solution to this problem, except to suggest that named 
entities simply be avoided in favor of numeric entities, at least in 
the short term (I do note that the etext 16523-x.xml does not contain 
any named entities).
I personally prefer numeric entities, as well, but for the more common 
ones, the conversion process will support named entities in the .tei 
file.  Most of them appear as unicode in the HTML, so it typically 
isn't an issue in the final product.
You are correct; so long as you are relying on conversion to HTML (or 
some other file format) before the file is used, there should be no 
problem (so long as the conversion utility can get to the correct .ent 
files). Use of named entities is only a problem if you are attempting to 
display the TEI-XML directly.

[snip]
...
...
It appears that the file is latin-1 encoded, despite the fact that 
the DTD claims that it is utf-8 encoded. This caused Firefox some 
grief as it tried to utf-8-decode some latin-1 accented vowels.
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 
was a superset of Latin1?  Anyway, I know in this particular file 
there are quite a few UTF-8 encoded characters (and a couple more that 
should be that we found yesterday backchannel).
UTF-8 and Latin-1 (aka ISO-8859-1) are both encoding methods. They share 
the same codepoints (the value of an acute 'e' is 233 in both encodings) 
but they use different encoding methods. Neither is a superset or subset 
of the other. Values from 0 to 127 are the same in both encodings, but 
values from 128 to 255 are encoded in a single byte in Latin-1 whereas 
those same values are encoded in two bytes in UTF-8. Values above 255 
are represented in two or more bytes in UTF-8 (up to 6) where those same 
values cannot be represented at all in Latin-1. From an efficiency 
standpoint (which is not always the best way to look at things) if you 
have an English text which contains some few characters having values 
above 127, and which has as many above 255 as below, or if you have a 
text which contains a large number of characters with values above 255, 
UTF-8 is the probably the most efficient encoding (size-wise). If you 
have a western european text with a large number of characters above 
127, but very few above 255 (French is a good example) Latin-1, with 
values above 255 expressed as entities (numberic or named) is probably 
the most efficient encoding. If you have a text where most of the 
characters have values above 1920 UTF-16 is probably the most efficient 
encoding (now we're really straying from the point).

In any case, it doesn't matter which encoding is used, so long as it is 
not misrepresented in the <?xml ...> declaration.
...
...
If you're interested, I'll start putting together a generic CSS file 
for TEI.
We aren't too interested in CSS directly for the TEI file (the css 
file sitting beside the TEI file right now is a mistake ... that 
should be changed later today).  However, once I have a few more 
documents posted and people seem fairly satisfied with the results, I 
want to get alternate CSS files submitted by other people for the HTML 
documents.
Well, I might do it anyway for my own edification and enjoyment (and 
because I think you _will_ be interested at some point in the future ;-).)

Some months ago I put together a couple of tables showing how HTML could 
be mapped to TEI-lite, and vice-versa. The goal was to create a mapping 
that could be used for round-tripping via XSLT; that is, a TEI-lite 
document could be used to create an HTML document which could then be 
transformed back into TEI without loss of markup. I will probably start 
from those tables in creating a tei.css file. They may also be useful to 
you in creating XSLT scripts (aka XSL style sheets). If you're 
interested they can be found at www.passkeysoft.com/~lee/xhtml2tei.html 
and www.passkeysoft.com/~lee/tei2xhtml.html.
...
Also, if any industrious programmers out there know TEI conversions 
and would like to tackle the job of preparing a conversion process for 
other end formats (such as Palm files, Plucker, MS Reader, etc) please 
let me and/or Marcello know.  The conversion must run on Linux (our 
server OS) and be open source (for future compatibility).
You probably don't need anything more than someone with basic shell 
scripting capabilities, as all the software to do this exists currently. 
When you say Palm files, I am assuming you mean PalmDOC files, which are 
nothing more than text files converted into the Palm Database format. 
This conversion can be performed by the command line program "Makedoc". 
Source code is available at 
http://linuxmafia.com/pub/palmos/other-os/makedoc9.tar.gz. The shell 
script would be:

PGTEI -> (via XSLT) -> .txt -> (via makedoc9) -> .pdb

Plucker is a progam which encapsulates a bundle of HTML files into a 
single file which can be rendered on the PalmOS. The script for a 
plucker transformation should be very similar to the PalmDOC 
transformation (I'm certain Mr. Desrosiers could help you with the 
precise syntax):

PGTEI -> (via XSLT) -> HTML -> (via plucker distiller) -> .pdb

To my knowledge there are no known lit compilers that run on Linux (thus 
making them ineligble by your requirements). This is not really a big 
deal because most MSReader users who are familiar with Project Gutenberg 
are comfortable making .lit files from HTML themselves, so if you can 
serve good HTML they will be happy.

What I would really like to see is an XSL script that could do a PGTEI 
-> RTF transformation. It probably wouldn't be very useful, but it would 
sure be interesting.

Now on a separate note:

As part of my CSS experimentation, I set the display setting for the 
<tei-header> element to "none", because while I think the data is 
important, I'm not particularly interested in seeing it when I'm 
reading. When I did this, I thought I lost the title of the book because 
it only appears in the <tei-header> element. I discovered later the 
title was repeated in the <front> element, identified as a <head>er.  As 
I read the TEI spec, (and I am by no means well-versed) I believe that 
there should also exist a <titlePage> element which should be part of 
the <front>, and which should contain all the information traditionally 
found on the title page of a book. The main title should be marked as 
<titlePart type="main">, subtitles should be marked as <titlePart 
type="sub">, and the byline should be marked as <byline>. This would be 
in addition to the information included in the <tei-header> element, 
which may be formated differently (e.g. the author's name may be 
presented last name first for automated catalog processing).

I also had some question about the difference between the <titlePart> 
element and the <title> element. Looking at the spec it seems that the 
<title> element is not to be used to indicate the title of the work, as 
would appear on a title page, but the title of _another_ work referenced 
in the main work (these are the titles we were taught to underline back 
in the days of single font typewriters). For example, if _The 
Kitáb-i-Aqdas_ made reference to the _Baghad-Vita_, it would be marked 
as <title>The Baghad-Vita</title>, and should probably be rendered with 
an italicised font.

I also note that you encoded the glossary at the end of the work with 
<p> tags (naughty, naughty). Based on what I saw in the TEI docs I would 
have encoded it as follows:

<div type="glossary">
<head>Glossary</head>
<list type="gloss">
<label>'Abdu'l-Bahá</label>
<gloss>The "Servant of Bahá", Abbás Effendi (1844-1921), the eldest son 
and appointed Successor of Bahá'u'lláh, and the Centre of His 
Covenant.</gloss>
<label>Abjad<label>
<gloss>The ancient Arabic system of allocating a numerical value to 
letters of the alphabet, so that numbers may be represented by letters 
and vice versa. Thus every word has both a literal meaning and a 
numerical value.</gloss>

etc.

</list></div>

I hope you find this useful.