Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives!

Joshua Hutchinson <joshua@hutchinson.net> wrote:
Thanks to some back and forth with David Widger, we have posted a text to the PG archives that is basically the XML with its straight from conversion txt, html and pdf files.
http://www.gutenberg.org/1/6/5/2/16523
For those interested: This book (Kitab-i-Aqdas) is a religious book from the Baha'i Faith. The text is freely available from the Baha'i website with a usage license that allows us to post the text to our archive as long as we don't make any content changes. I've basically converted it from the Microsoft Word format they posted in to a PGTEI based master and used that to create text in UTF-8, Latin-1 and 7-bit ASCII, html and pdf.
Regarding the XML. The XML file can be found in the 16523-x subdirectory. These files are not designed to be read directly in a web browser like IE or Firefox. They are plain text files and open just fine in Notepad or vi or any other text editor of choice. For those wishing to play with the XML, our online validator and conversion tools can be found here:
Besides wanting to celebrate the first XML posting ;) ... I'm also looking for contructive criticism. What doesn't look right? What problems do you see with the results?
Congratulations on a worthwhile accomplishment. I would like to point out, however, that this is _not_ Gutenberg's first XML posting; I believe there are hundreds of XHTML files currently available. You probably intended to say that this is Gutenberg's first TEI-XML posting. I know that this seems like picking at some pretty minor nits, but there are some people who believe that there is actually a text markup language called XML. XML is actually a syntax for creating markup languages, and there are many markup language available which conform to the XML syntax, e.g. XHTML, TEI, and DocBook. For clarity's sake it is probably desirable to always refer to a specific XML vocabulary, except when discussing the XML syntax which applies to all XML vocabularies equally. Some specific, and very preliminary observations: As Mr. Noring is always quick to point out, XML files can be viewed natively in both Firefox and IE6 when accompanied by appropriate style sheets, so I attempted to open this file directly in both of these browsers. In IE6, I get the error "The system cannot locate the object specified. Error processing resource 'http://www.tei-c.org/P4X/DTD/pgtei-extensions.ent'. Apparently, your dtd, http://www.gutenberg.org/tei/marcello/0.3/dtd/pgtei.dtd, contains the line: <!ENTITY % TEI SYSTEM "http://www.tei-c.org/P4X/DTD/tei2.dtd"> %TEI; It looks like IE sees a full url for the TEI SYSTEM entity, so it assumes that <!ENTITY % TEI.extensions.ent SYSTEM "pgtei-extensions.ent" > refers to a file on the same system as "tei2.dtd." Of course, the TEI consortium doesn't maintain a file called "pgtei-extensions.ent", so IE fails catastrophically. Now I'm still having a hard time wrapping my head around dtd's, so I have no idea if IE's behavior is technically correct or not, but it would be nice if the dtd's could be reworked in such a way that this failure does not occur, perhaps by hosting the TEI dtd's at http://www.gutenberg.org/tei/marcello/0.3/dtd/, and referencing them there. Firefox does not have this problem, but Firefox also breaks when it encounters named entities, even when the entities are referenced in .ent files included from the dtd's, leading me to believe that Firefox avoids the problems associated with "roaming dtd's" by simply not parsing them in the first place. Numerical entities _are_ recognized, and rendered appropriately, as are named entities when the entity definition is contained in the XML file itself. I have no solution to this problem, except to suggest that named entities simply be avoided in favor of numeric entities, at least in the short term (I do note that the etext 16523-x.xml does not contain any named entities). One of my pet peeves is the use of the <p> (paragraph) tag as a generic block tag, rather than limiting its use to true paragraphs, and using the <div> tag for generic blocks of text. I am happy to say that the text is mostly correct in this regard. The byline <p>by Bahá’u’lláh</p> should be marked using the <byline> tag instead of <p>; there may be other similar problems I simply haven't encountered yet. It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels. I grabbed an arbitrary "tei.css" style sheet off the net, and added the line: <?xml-stylesheet href="tei.css" type="text/css"?> to the beginning of the file. Looking at it in both browsers (after I had copied enough .dtd's and .ent's to my local file system that IE could cope) the document looked quirky, but readable. When I deleted the .css file the document turned into a plain-text file, totally without styling, but nothing broke. I think every PGTEI document should probably start with the three lines: <?xml-stylesheet href="tei.css" type="text/css"?> <?xml-stylesheet href="pgtei.css" type="text/css"?> <?xml-stylesheet href="usertei.css" type="text/css"?> and one of the next tasks should be to develop CSS files for generic TEI files and PG TEI files (the "usertei.css" file should be reserved for sophisticated users who may want to override the standard styles). If this were done (and the dtd issues are resolved for IE), the production TEI files should be usable directly by a modern web browser without any kind of pre-processing. If you're interested, I'll start putting together a generic CSS file for TEI.

Lee Passey wrote:
Congratulations on a worthwhile accomplishment.
Thanks!
I would like to point out, however, that this is _not_ Gutenberg's first XML posting; I believe there are hundreds of XHTML files currently available. You probably intended to say that this is Gutenberg's first TEI-XML posting. I know that this seems like picking at some pretty minor nits, but there are some people who believe that there is actually a text markup language called XML. XML is actually a syntax for creating markup languages, and there are many markup language available which conform to the XML syntax, e.g. XHTML, TEI, and DocBook. For clarity's sake it is probably desirable to always refer to a specific XML vocabulary, except when discussing the XML syntax which applies to all XML vocabularies equally.
We've had some back channel discussion on just how to name this and we've decided to change the extension to .tei to give a better indication of what the file is.
Some specific, and very preliminary observations:
As Mr. Noring is always quick to point out, XML files can be viewed natively in both Firefox and IE6 when accompanied by appropriate style sheets, so I attempted to open this file directly in both of these browsers.
While this is true, our tei files are specifically meant as a master document and NOT as a viewing document. They will NOT parse in any browser "out of the box". As you've seen, you can jury-rig things to the point where it is usuable, but that is not our intention. We provide the HTML files directly for people that want to browse the file in IE or Firefox. Also, we have had some backchannel discussion about how the web server should serve the .tei files. I think Marcello is going to change the server to tell your browser that the .tei files is a mime encoding of text so that it will display like a .txt file would. This will help prevent people from being confused when their browser tries to display the file directly and fails miserably.
Firefox does not have this problem, but Firefox also breaks when it encounters named entities, even when the entities are referenced in .ent files included from the dtd's, leading me to believe that Firefox avoids the problems associated with "roaming dtd's" by simply not parsing them in the first place. Numerical entities _are_ recognized, and rendered appropriately, as are named entities when the entity definition is contained in the XML file itself. I have no solution to this problem, except to suggest that named entities simply be avoided in favor of numeric entities, at least in the short term (I do note that the etext 16523-x.xml does not contain any named entities).
I personally prefer numeric entities, as well, but for the more common ones, the conversion process will support named entities in the .tei file. Most of them appear as unicode in the HTML, so it typically isn't an issue in the final product.
One of my pet peeves is the use of the <p> (paragraph) tag as a generic block tag, rather than limiting its use to true paragraphs, and using the <div> tag for generic blocks of text. I am happy to say that the text is mostly correct in this regard. The byline <p>by Bahá’u’lláh</p> should be marked using the <byline> tag instead of <p>; there may be other similar problems I simply haven't encountered yet.
You are correct. That'll get fixed today.
It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels.
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 was a superset of Latin1? Anyway, I know if this particular file there are quite a few UTF-8 encoded characters (and a couple more that should be that we found yesterday backchannel).
If you're interested, I'll start putting together a generic CSS file for TEI.
We aren't too interested in CSS directly for the TEI file (the css file sitting beside the TEI file right now is a mistake ... that should be changed later today). However, once I have a few more documents posted and people seem fairly satisfied with the results, I want to get alternate CSS files submitted by other people for the HTML documents. Also, if any industrious programmers out there know TEI conversions and would like to tackle the job of preparing a conversion process for other end formats (such as Palm files, Plucker, MS Reader, etc) please let me and/or Marcello know. The conversion must run on Linux (our server OS) and be open source (for future compatibility). Josh

Joshua wrote:
Lee Passey wrote:
As Mr. Noring is always quick to point out, XML files can be viewed natively in both Firefox and IE6 when accompanied by appropriate style sheets, so I attempted to open this file directly in both of these browsers.
While this is true, our tei files are specifically meant as a master document and NOT as a viewing document. They will NOT parse in any browser "out of the box". As you've seen, you can jury-rig things to the point where it is usuable, but that is not our intention. We provide the HTML files directly for people that want to browse the file in IE or Firefox.
One value in the direct viewing of PG-TEI documents is for checking the markup -- to make sure the content is properly marked up (Lee later brought up a specific example of incorrectly applied markup to the particular PG-TEI document under discussion.) For example, one could put together a "silly.css", using a variety of text colors, font-styles, font-weights, etc., to highlight certain structures and text semantics. Another knotty issue is that TEI includes structural/semantic markup that current HTML-based browsers don't know how to natively (without CSS) handle or interpret properly (and even with the right CSS some substandard browsers like IE6 can't be forced to handle properly.) This includes the inline note tag -- HTML has never had an inline note tag where it is assumed, even without CSS, the browser will pull the note out of the main flow and present it separately (such as in a popup window.) [HTML *should* have had this feature from the start but that's water under the bridge -- XHTML 2.0 plans to include functionality to allow this, so future browsers will have to be able, without CSS, to extract certain inline stuff and render it outside the main flow, such as in a popup window, to the side, or other means. My kudos to the XHTML working group for implementing this!]
Also, we have had some backchannel discussion about how the web server should serve the .tei files. I think Marcello is going to change the server to tell your browser that the .tei files is a mime encoding of text so that it will display like a .txt file would. This will help prevent people from being confused when their browser tries to display the file directly and fails miserably.
Good point! Another way around the issue is to simply zip up the TEI document for download, and include a separate "readthisfirst.txt" file describing what it is and how to directly render it if that is of interest to the end-user.
Firefox does not have this problem, but Firefox also breaks when it encounters named entities, even when the entities are referenced in .ent files included from the dtd's, leading me to believe that Firefox avoids the problems associated with "roaming dtd's" by simply not parsing them in the first place.
This is interesting. Didn't know this. I don't think Firefox has concentrated on general XML rendering. Interestingly FF does support a subset of XLink, thus it is possible, using XLink, to create hypertext links in non-XHTML documents (with the full XLink, it is possible to do other things, such as embed images, to be equivalent to the HTML <img> and <object> tags.) I'll have to repeat this experiment with Opera 8 to see if they've enabled some XLink stuff (Opera 7 did not.)
It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels.
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 was a superset of Latin1? Anyway, I know if this particular file there are quite a few UTF-8 encoded characters (and a couple more that should be that we found yesterday backchannel).
If what Lee refers to as "Latin-1" is ISO-8859, then Lee is right, it is NOT correct to specify the document encoding as UTF-8 since they are incompatible. It is my personal view that ISO-8859 should never be used for the PG masters -- UTF-8 should be used instead. That "7-bit" ASCII conforms to UTF-8 is a nice bonus. (But ISO-8859-x, a.k.a. "8-bit ASCII" and "Latin-1", does not conform to UTF-8.)
If you're interested, I'll start putting together a generic CSS file for TEI.
We aren't too interested in CSS directly for the TEI file (the css file sitting beside the TEI file right now is a mistake ... that should be changed later today). However, once I have a few more documents posted and people seem fairly satisfied with the results, I want to get alternate CSS files submitted by other people for the HTML documents.
As noted above, I think a generic CSS file for PG-TEI would be a great idea! It allows direct viewing of the master for errors, and the CSS can be tweaked for direct viewing by end-users (probably restricted to Firefox and Opera in order to handle inline notes, where the CSS has to move the inline notes and similar stuff to a box outside of the flow of the text, maybe highlighted in some way -- as noted above, IE6 chokes on this CSS2 stuff.) Another issue of incompatibility, where CSS may break down, is that the table model in TEI is different in some ways from the HTML table model. Not sure if this can be fixed with CSS 'display'. Does PG-TEI include support for TEI tables? (I would assume it does.)
Also, if any industrious programmers out there know TEI conversions and would like to tackle the job of preparing a conversion process for other end formats (such as Palm files, Plucker, MS Reader, etc) please let me and/or Marcello know. The conversion must run on Linux (our server OS) and be open source (for future compatibility).
For MS Reader, unless one wants to build an unapproved and possibly illegal converter (since the LIT format has been cracked it is now possible), one has to use Microsoft's litgen.dll to produce LIT files, thus restricting the converter to MS Windows (litgen.dll requires, in turn, MSXML for XML document parsing and validation.) Litgen takes as input an OEBPS 1.0.1 Publication. Now I do think it worthwhile to produce OEBPS as one of the output formats. PG/DP can generate both OEBPS 1.0.1 (optimized for conversion into LIT so others may do so automatically), and OEBPS 1.2 (which is the current OEBPS standard and is preferable.) Essentially, the process works as follows: PGTEI --> XHTML 1.1 (or XHTML 1.0 Strict) --> OEBPS 1.x Document(s) OEBPS 1.x document(s) + OEBPS Package --> OEBPS 1.x Publication Inline notes would be handled by inserting an anchor link where the note was, and pulling the note into a separate XHTML/OEBPS document. The notes can either be aggregated into one document, or each be kept in their own document. The OEBPS 1.x framework will easily handle multiple documents that comprise one publication (it's very cool, really, in how it works.) Jon (p.s., Lee, did you experiment with Opera 8? They have a full-featured free version -- just have to put up with the ads in the free version.)

Jon Noring wrote:
As noted above, I think a generic CSS file for PG-TEI would be a great idea!
Every PGTEI producer is free to use as many CSS she wants. It just doesn't make sense to post them.
Now I do think it worthwhile to produce OEBPS as one of the output formats. PG/DP can generate both OEBPS 1.0.1 (optimized for conversion into LIT so others may do so automatically), and OEBPS 1.2 (which is the current OEBPS standard and is preferable.) Essentially, the process works as follows:
PGTEI --> XHTML 1.1 (or XHTML 1.0 Strict) --> OEBPS 1.x Document(s)
OEBPS 1.x document(s) + OEBPS Package --> OEBPS 1.x Publication
We already produce XHTML 1.0. So if you want to build a converter XHTML -> OEBPS you may start right now. P.S. I just don't have the time nor the inclination to read all your words. If you want better answers I suggest getting to the point faster. -- Marcello Perathoner webmaster@gutenberg.org

On Tue, 23 Aug 2005, Joshua Hutchinson wrote:
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 was a superset of Latin1? Anyway, I know if this particular file there are quite a few UTF-8 encoded characters (and a couple more that should be that we found yesterday backchannel).
Well, if you look merely at abstract numbered code points, it is correct to say that the initial code points of Unicode are numbered the same as ISO Latin-1. However, you have to realize that, while ISO Latin-1 is a legacy encoding in which each character is encoded using only one byte, the nature of Unicode has led to different different methods (Unicode Transformation Formats) of actually encoding each character in a series of bytes. One way to look at UTF-8 is as a compressed format. (When used to encode texts which consist primarily of the character found in lower ascii, UTF-16, which uses two bytes for each character, results in noticably longer files) Ascii characters are encoded the same in UTF-8 as in common legacy single-byte encodings, but all higer numbered characters are represented by muli-byte sequences. Excerpt from: http://en.wikipedia.org/wiki/UTF-8 So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes. I hope that is somewhat clear.... Andrew

Lee Passey wrote:
It appears that the file is latin-1 encoded, despite the fact that the DTD claims that it is utf-8 encoded. This caused Firefox some grief as it tried to utf-8-decode some latin-1 accented vowels.
That is just what Apache thinks it is because it doesn't look inside the file before serving it. Apache can be made to serve the encoding based on the file extension. Lacking a definite extension it will serve the default which is iso-8859-1. The same problem exists with all plain text files in the archive. They are all served as iso-8859-1. We cannot fix that unless we rename all files: 12345-8.txt --> 12345-8.txt.8 12345-0.txt --> 12345-0.txt.0 In this case Apache sees the .0 extension, strips it, and serves the file as 12345-0.txt with utf-8 encoding. And don't look at me. I made this suggestion before the new filesystem went live.
I grabbed an arbitrary "tei.css" style sheet off the net, and added the line:
<?xml-stylesheet href="tei.css" type="text/css"?>
You can also include an XSL stylesheet which gives you far more power. But why do you want to look at the TEI file in the browser when there is an HTML file available? -- Marcello Perathoner webmaster@gutenberg.org
participants (5)
-
Andrew Sly
-
Jon Noring
-
Joshua Hutchinson
-
Lee Passey
-
Marcello Perathoner