Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives!

22 Aug 2005

      Joshua Hutchinson <joshua@hutchinson.net> wrote:
...
Thanks to some back and forth with David Widger, we have posted a text 
to the PG archives that is basically the XML with its straight from 
conversion txt, html and pdf files.
http://www.gutenberg.org/1/6/5/2/16523
For those interested: This book (Kitab-i-Aqdas) is a religious book 
from the Baha'i Faith. The text is freely available from the Baha'i 
website with a usage license that allows us to post the text to our 
archive as long as we don't make any content changes. I've basically 
converted it from the Microsoft Word format they posted in to a PGTEI 
based master and used that to create text in UTF-8, Latin-1 and 7-bit 
ASCII, html and pdf.
Regarding the XML. The XML file can be found in the 16523-x 
subdirectory. These files are not designed to be read directly in a 
web browser like IE or Firefox. They are plain text files and open 
just fine in Notepad or vi or any other text editor of choice. For 
those wishing to play with the XML, our online validator and 
conversion tools can be found here:
http://www.gutenberg.org/tei
Besides wanting to celebrate the first XML posting ;) ... I'm also 
looking for contructive criticism. What doesn't look right? What 
problems do you see with the results?
Congratulations on a worthwhile accomplishment.

I would like to point out, however, that this is _not_ Gutenberg's first 
XML posting; I believe there are hundreds of XHTML files currently 
available. You probably intended to say that this is Gutenberg's first 
TEI-XML posting. I know that this seems like picking at some pretty 
minor nits, but there are some people who believe that there is actually 
a text markup language called XML. XML is actually a syntax for creating 
markup languages, and there are many markup language available which 
conform to the XML syntax, e.g. XHTML, TEI, and DocBook. For clarity's 
sake it is probably desirable to always refer to a specific XML 
vocabulary, except when discussing the XML syntax which applies to all 
XML vocabularies equally.

Some specific, and very preliminary observations:

As Mr. Noring is always quick to point out, XML files can be viewed 
natively in both Firefox and IE6 when accompanied by appropriate style 
sheets, so I attempted to open this file directly in both of these browsers.

In IE6, I get the error "The system cannot locate the object specified. 
Error processing resource 
'http://www.tei-c.org/P4X/DTD/pgtei-extensions.ent'. Apparently, your 
dtd, http://www.gutenberg.org/tei/marcello/0.3/dtd/pgtei.dtd, contains 
the line:

<!ENTITY % TEI SYSTEM "http://www.tei-c.org/P4X/DTD/tei2.dtd"> %TEI;

It looks like IE sees a full url for the TEI SYSTEM entity, so it 
assumes that

<!ENTITY % TEI.extensions.ent SYSTEM "pgtei-extensions.ent" >

refers to a file on the same system as "tei2.dtd." Of course, the TEI 
consortium doesn't maintain a file called "pgtei-extensions.ent", so IE 
fails catastrophically. Now I'm still having a hard time wrapping my 
head around dtd's, so I have no idea if IE's behavior is technically 
correct or not, but it would be nice if the dtd's could be reworked in 
such a way that this failure does not occur, perhaps by hosting the TEI 
dtd's at http://www.gutenberg.org/tei/marcello/0.3/dtd/, and referencing 
them there.

Firefox does not have this problem, but Firefox also breaks when it 
encounters named entities, even when the entities are referenced in .ent 
files included from the dtd's, leading me to believe that Firefox avoids 
the problems associated with "roaming dtd's" by simply not parsing them 
in the first place. Numerical entities _are_ recognized, and rendered 
appropriately, as are named entities when the entity definition is 
contained in the XML file itself. I have no solution to this problem, 
except to suggest that named entities simply be avoided in favor of 
numeric entities, at least in the short term (I do note that the etext 
16523-x.xml does not contain any named entities).

One of my pet peeves is the use of the <p> (paragraph) tag as a generic 
block tag, rather than limiting its use to true paragraphs, and using 
the <div> tag for generic blocks of text. I am happy to say that the 
text is mostly correct in this regard. The byline <p>by Bahá’u’lláh</p> 
should be marked using the <byline> tag instead of <p>; there may be 
other similar problems I simply haven't encountered yet.

It appears that the file is latin-1 encoded, despite the fact that the 
DTD claims that it is utf-8 encoded. This caused Firefox some grief as 
it tried to utf-8-decode some latin-1 accented vowels.

I grabbed an arbitrary "tei.css" style sheet off the net, and added the 
line:

<?xml-stylesheet href="tei.css" type="text/css"?>

to the beginning of the file. Looking at it in both browsers (after I 
had copied enough .dtd's and .ent's to my local file system that IE 
could cope) the document looked quirky, but readable. When I deleted the 
.css file the document turned into a plain-text file, totally without 
styling, but nothing broke. I think every PGTEI document should probably 
start with the three lines:

<?xml-stylesheet href="tei.css" type="text/css"?>
<?xml-stylesheet href="pgtei.css" type="text/css"?>
<?xml-stylesheet href="usertei.css" type="text/css"?>

and one of the next tasks should be to develop CSS files for generic TEI 
files and PG TEI files (the "usertei.css" file should be reserved for 
sophisticated users who may want to override the standard styles). If 
this were done (and the dtd issues are resolved for IE), the production 
TEI files should be usable directly by a modern web browser without any 
kind of pre-processing.

If you're interested, I'll start putting together a generic CSS file for 
TEI.

Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives!

Lee Passey