
Joshua Hutchinson <joshua@hutchinson.net> wrote:
Thanks to some back and forth with David Widger, we have posted a text to the PG archives that is basically the XML with its straight from conversion txt, html and pdf files.
http://www.gutenberg.org/1/6/5/2/16523
For those interested: This book (Kitab-i-Aqdas) is a religious book from the Baha'i Faith. The text is freely available from the Baha'i website with a usage license that allows us to post the text to our archive as long as we don't make any content changes. I've basically converted it from the Microsoft Word format they posted in to a PGTEI based master and used that to create text in UTF-8, Latin-1 and 7-bit ASCII, html and pdf.
Regarding the XML. The XML file can be found in the 16523-x subdirectory. These files are not designed to be read directly in a web browser like IE or Firefox. They are plain text files and open just fine in Notepad or vi or any other text editor of choice. For those wishing to play with the XML, our online validator and conversion tools can be found here:
Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results?
Congratulations on a worthwhile accomplishment. I would like to point out, however, that this is _not_ Gutenberg's first XML posting; I believe there are hundreds of XHTML files currently available. You probably intended to say that this is Gutenberg's first TEI-XML posting. I know that this seems like picking at some pretty minor nits, but there are some people who believe that there is actually a text markup language called XML. XML is actually a syntax for creating markup languages, and there are many markup language available which conform to the XML syntax, e.g. XHTML, TEI, and DocBook. For clarity's sake it is probably desirable to always refer to a specific XML vocabulary, except when discussing the XML syntax which applies to all XML vocabularies equally. Some specific, and very preliminary observations: As Mr. Noring is always quick to point out, XML files can be viewed natively in both Firefox and IE6 when accompanied by appropriate style sheets, so I attempted to open this file directly in both of these browsers. In IE6, I get the error "The system cannot locate the object specified. Error processing resource 'http://www.tei-c.org/P4X/DTD/pgtei-extensions.ent'. Apparently, your dtd, http://www.gutenberg.org/tei/marcello/0.3/dtd/pgtei.dtd, contains the line: <!ENTITY % TEI SYSTEM "http://www.tei-c.org/P4X/DTD/tei2.dtd"> %TEI; It looks like IE sees a full url for the TEI SYSTEM entity, so it assumes that <!ENTITY % TEI.extensions.ent SYSTEM "pgtei-extensions.ent" > refers to a file on the same system as "tei2.dtd." Of course, the TEI consortium doesn't maintain a file called "pgtei-extensions.ent", so IE fails catastrophically. Now I'm still having a hard time wrapping my head around dtd's, so I have no idea if IE's behavior is technically correct or not, but it would be nice if the dtd's could be reworked in such a way that this failure does not occur, perhaps by hosting the TEI dtd's at http://www.gutenberg.org/tei/marcello/0.3/dtd/, and referencing them there. Firefox does not have this problem, but Firefox also breaks when it encounters named entities, even when the entities are referenced in .ent files included from the dtd's, leading me to believe that Firefox avoids the problems associated with "roaming dtd's" by simply not parsing them in the first place. Numerical entities _are_ recognized, and rendered appropriately, as are named entities when the entity definition is contained in the XML file itself. I have no solution to this problem, except to suggest that named entities simply be avoided in favor of numeric entities, at least in the short term (I do note that the etext 16523-x.xml does not contain any named entities). One of my pet peeves is the use of the <p> (paragraph) tag as a generic block tag, rather than limiting its use to true paragraphs, and using the <div> tag for generic blocks of text. I am happy to say that the text is mostly correct in this regard. The byline <p>by Bahá’u’lláh</p> should be marked using the <byline> tag instead of <p>; there may be other similar problems I simply haven't encountered yet. It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels. I grabbed an arbitrary "tei.css" style sheet off the net, and added the line: <?xml-stylesheet href="tei.css" type="text/css"?> to the beginning of the file. Looking at it in both browsers (after I had copied enough .dtd's and .ent's to my local file system that IE could cope) the document looked quirky, but readable. When I deleted the .css file the document turned into a plain-text file, totally without styling, but nothing broke. I think every PGTEI document should probably start with the three lines: <?xml-stylesheet href="tei.css" type="text/css"?> <?xml-stylesheet href="pgtei.css" type="text/css"?> <?xml-stylesheet href="usertei.css" type="text/css"?> and one of the next tasks should be to develop CSS files for generic TEI files and PG TEI files (the "usertei.css" file should be reserved for sophisticated users who may want to override the standard styles). If this were done (and the dtd issues are resolved for IE), the production TEI files should be usable directly by a modern web browser without any kind of pre-processing. If you're interested, I'll start putting together a generic CSS file for TEI.