re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 19)

lee said:
Edit the .xml file with a simple text editor (beware Microsoft tools!) to add the line: ?xml-stylesheet href="persistent.css" type="text/css"? You can experiment by adding new styles to 'persistent.css' (don't forget to save the file and reload your browser after adding rules). For example, add "p { display:block; text-indent: 3em }" and all of a sudden you will get distinct, indented paragraphs (and some non-paragraphs will also become distinct and indented). Add "teiHeader { display: none }" and all the Gutenberg legal cruft, together with the metadata which is typically only of interest to archivers, will disappear (it's still there, it's just not "in your face" anymore).
that is, in other words, if i tell it to use a stylesheet, and then go and create that stylesheet, it will work. :+) i knew that anyway, but i guess it's good to be reminded. ;+) *** jeroen said:
You can render XML, using XSLT + CSS in Firefox and IE, for a small demo, look at http://www.gutenberg.org/files/11335/11335-x/11335-x.xml.
yes, i should have mentioned jeroen's files work in firefox... (not in safari. but in firefox.)
Some have argued (with valid reasons) that the entire idea of TEI markup is broken, and have proposed systems in which the mark-up is separated from the text (stream of characters), in such a way that multiple, parallel systems of mark-up can exist. Think of a separate (part of a) file, saying characters 21 to 34 are italics, and so on. This may sound odd, but it is the way the old Macintosh wordprocessor MacWrite worked.
actually, that's the way the underlying _editfield_ of the (classic) mac operating system is structured. -bowerbird

Bowerbird wrote:
jeroen said:
Some have argued (with valid reasons) that the entire idea of TEI markup is broken, and have proposed systems in which the mark-up is separated from the text (stream of characters), in such a way that multiple, parallel systems of mark-up can exist. Think of a separate (part of a) file, saying characters 21 to 34 are italics, and so on. This may sound odd, but it is the way the old Macintosh wordprocessor MacWrite worked.
actually, that's the way the underlying _editfield_ of the (classic) mac operating system is structured.
I was told by someone (who I think is in the know) that the idea of separating markup from content (by having layers) was first proposed years ago by Ted Nelson of "Project Xanadu" fame. I recall asking Dr. Stephen DeRose at Brown University (one of the world's leading electronic document experts) about Ted Nelson's proposal and how it compares with SGML/XML markup. Dr. DeRose's reply was essentially that layering has some obvious advantages (i.e., easier to represent non-hierarchical structures), but that there were a lot of real world disadvantages as well. In the early days, before SGML, the researchers were exploring all kinds of avenues, and nearly all of them moved in the direction of direct markup rather than Ted Nelson's layering. Of course, one wonders if the dynamics have changed enough that revisiting the issue would yield a different result. Can't answer that, but other than being able to non-hierarchically "markup" documents with layering, I do not see any compelling advantages -- there'd have to be some whole new killer application which requires such layering to work properly, and I've not seen such an application arise the last few years.. (It is possible in XML to do some non-hierarchical markup using empty "milemarkers" with ID/IDREF pairs. But one would have to build a special application to read such documents -- that's no different than building an application to process the "layer" approach. An example of non-hierarchical documents is the modern Bible, where verses can cross sentence and even paragraph boundaries. So one has the choice in SGML/XML of marking it up by chapter/paragraphs, and put in verse "milemarkers", or the opposite. Most would agree that one applies hierarchical markup to document structure (paragraphs), and then add milemarkers to locate the start of a new verse.) Jon

Jon Noring <jon@noring.name> writes:
Bowerbird wrote:
jeroen said:
Some have argued (with valid reasons) that the entire idea of TEI markup is broken, and have proposed systems in which the mark-up is separated from the text (stream of characters), in such a way that multiple, parallel systems of mark-up can exist. Think of a separate (part of a) file, saying characters 21 to 34 are italics, and so on. This may sound odd, but it is the way the old Macintosh wordprocessor MacWrite worked.
About two years ago I was playing around with the same idea. My solution was to take a CSS approach to layering. CSS places an external layer of formating instructions on top of a text, so why not extend CSS to also be able to add layers of semantic markup to a text? This would make it easy to add semantic markup including glosses, notes, comments (scholia) etc to a text, even it the text is located on a server somewhere on the Net. The folks doing the Hypereal Dictionary of Mathematics are creating a scholia system based on Emacs text properties to add layers of scholia to texts. b/ -- Brad Collins <brad@chenla.org>, Bangkok, Thailand

Brad Collins wrote:
jeroen said:
Some have argued (with valid reasons) that the entire idea of TEI markup is broken, and have proposed systems in which the mark-up is separated from the text (stream of characters), in such a way that multiple, parallel systems of mark-up can exist. Think of a separate (part of a) file, saying characters 21 to 34 are italics, and so on. This may sound odd, but it is the way the old Macintosh wordprocessor MacWrite worked.
About two years ago I was playing around with the same idea. My solution was to take a CSS approach to layering.
CSS places an external layer of formating instructions on top of a text, so why not extend CSS to also be able to add layers of semantic markup to a text?
This would make it easy to add semantic markup including glosses, notes, comments (scholia) etc to a text, even it the text is located on a server somewhere on the Net.
The folks doing the Hypereal Dictionary of Mathematics are creating a scholia system based on Emacs text properties to add layers of scholia to texts.
Interesting! I'll not comment directly on Brad's idea, but will talk about a distantly related idea, which sort of intersects with what Brad is talking about when we propose tweaking with CSS. A couple years ago I floated the idea to both OeBF (as part of OEBPS work) and to the accessibility folk (such as DAISY) that we explore a better way a document author can assign structural semantics to the tags in arbitrary XML documents. A problem the accessibility people have when encountering an arbitrary XML document (from an unknown vocabulary) is what do the tags mean from a document structure viewpoint? A text-to-speech converter needs to unambigiously know this to do an effective job at properly conveying the content to the listener. An attached visual CSS style sheet (standards conforming at least) is insufficient to communicate the exact structures in such arbitrary XML documents. So I proposed something called a "Rosetta Stone", which would be a sort of attached document (probably XML) which describes the semantics of the tags in the content document so the document structure can be identified by machine processing. The RS may syntactically be based upon XSLT, but it is not intended to be a markup transformation -- it is solely a way to assign semantics to elements so the user agent (such as text-to-speech engine) can figure out what to do with them. Key to the Rosetta Stone is setting up a universal "metavocabulary" to describe common document structures. Now, I have no illusion this will be easy -- it will not be easy -- it will be damn hard to do right. Then there's the issue of the granularity of the metavocabulary -- how fine with document structure does one go -- and what types of documents will be targeted? By and large CSS was not designed for the purpose of assigning structural semantics to tags. CSS does have the 'display' property which assigns, at a very rudimentary level, some critical structural semantics (block, inline, table, list). But as we know, the allowed 'display' values are quite limited -- they don't, and in practical sense cannot, assign some critical semantics such as hypertext links, embedded images and objects (XLink is the vocabulary-agnostic solution for these particular things.) There is no CSS 'display' property for section headers, for example (in CSS, a header has to be treated as simply a kind of "block-level" tag), yet it is clear for text-to-speech that section headers be specifically identified as such, and not lumped in with paragraphs. Then there's the issue that CSS is intended for *styling* during presentation (by and large visual styling). That is its purpose -- it's not designed to be a "Rosetta Stone" for conveying detailed structural information. I don't know if the "Rosetta Stone" idea is tractable, and will in the long-run solve any real problems. In lieu of that, the accessibility community, and I think anyone else using markup to structure texts, would want all XML documents representing publications to conform with particular, well-defined vocabularies which are marked up in an acceptable structural, presentational agnostic manner. Properly done TEI is one such acceptable vocabulary, the more I study it. The accessibility folk have proposed their own, Digital Talking Book, which is essentially XHTML with some interesting TEI-like extensions. (Just about any markup vocabulary can be abused/misused to make it more difficult to convey the structural/semantic meaning of the content. Even TEI -- this is why I'm interested in subsetting and constraining the TEI vocabulary to assure the marked up content will be more accessible which includes presentation agnosticism.) Jon

Jon Noring <jon@noring.name> writes:
Brad Collins wrote:
Key to the Rosetta Stone is setting up a universal "metavocabulary" to describe common document structures. Now, I have no illusion this will be easy -- it will not be easy -- it will be damn hard to do right. Then there's the issue of the granularity of the metavocabulary -- how fine with document structure does one go -- and what types of documents will be targeted?
But it is possible as long as people can make the distinction between description and meaning. We might agree to call something the same thing, but not agree on what it means. This is a good thing to work towards. XHTML isn't a bad basic, universal structural language, but it has no way of dealing with semantic markup. This is why Docbook and TEI are becoming increasingly popular, they provide a way of semantically describing a text. A lot of semantic markup won't be displayed to the end user at all. This is as it should be. Wikipedia is a good example of over-linking to articles which often do nothing to help explain the concept being described by the article or are actually related terms. Are most of these links generated automatically? The links in the jrank edition of the 1911 Encyclopædia Britannica is another example of pointless automatically generated links. The main purpose of semantic tagging is for indexing and search. If all texts marked Personal Names, Place Names, Event Names, and names of Works (books, serials, etc) we could build applications which would provide a far richer user experience. Search services could then provide fine-grained searching of a particular text, rather than just pointing to a document. I don't think that PG should take on the job of doing this kind of markup.... PG has enough on it's plate as it is :)
By and large CSS was not designed for the purpose of assigning structural semantics to tags. CSS does have the 'display' property which assigns, at a very rudimentary level, some critical structural semantics (block, inline, table, list). But as we know, the allowed 'display' values are quite limited -- they don't, and in practical sense cannot, assign some critical semantics such as hypertext links, embedded images and objects (XLink is the vocabulary-agnostic solution for these particular things.) There is no CSS 'display' property for section headers, for example (in CSS, a header has to be treated as simply a kind of "block-level" tag), yet it is clear for text-to-speech that section headers be specifically identified as such, and not lumped in with paragraphs.
Then there's the issue that CSS is intended for *styling* during presentation (by and large visual styling). That is its purpose -- it's not designed to be a "Rosetta Stone" for conveying detailed structural information.
I was thinking along the lines of creating a new CSS module which would allow XPath as an alternative to CSS Selectors, and then create semantic css elements. I suppose you could declare at the beginning of a style sheet it is semantic or style and which selector type you want to use. This is just off the top of my head but you could then have something like this in the stylesheet (the XPATH is probably not correct): //p/ string("Scrooge") { semantic-type : personal-name used-for : Ebenezer Scrooge defined-by : bxid://aut:OSE0-1157 url : http://chenla.org/blah/blah/Scrooge.html scope-note : "Ebenezer Scrooge is the miserly old man visited by the ghost of his dead partner Jacob Marley in Charles Dicken's A Christmas Carol." } I think it's an interesting idea, but it would require adoption by each browser which is a long shot at best. I think it's more practical to define a master XML text and then let people apply markup to a locals copy which would act as a layer. Multiple layers could be merged together with something that works like diff and patch.... This would work just like ARCH (a version control system like CVS) -- all local copies are branches which you can keep private, merge with the orignal, or branch the version into a new edition. b/ -- Brad Collins <brad@chenla.org>, Bangkok, Thailand

I'm looking at Version 0.3 of PG-TEI. If that is no longer the current version, apologies. The following are highlighting modes (inline 'rend' values) that I've seen employed with some frequency in public domain printed materials, that are not presently supported by PG-TEI. roman (when the 'default' typeface is not roman) blackletter red-ink Thanks, -- RS

Robert Shimmin wrote:
I'm looking at Version 0.3 of PG-TEI. If that is no longer the current version, apologies. The following are highlighting modes (inline 'rend' values) that I've seen employed with some frequency in public domain printed materials, that are not presently supported by PG-TEI.
roman (when the 'default' typeface is not roman) blackletter red-ink
Version 0.4 is coming soon, and will have some more features: - better support for unicode - support for many CSS2 properties in the rend attribute. Eventually we'll switch to support only CSS properties in the rend attribute (we definitely need some standard there!). In the meantime you can use this markup: Roman as "not italic": italic <emph rend="font-style(normal)"> roman </hi> italic ... Blackletter: <hi rend="font-family(blackletter)"> Red-Ink: <emph rend="color(red)"> <q who="Jesus" rend="color(red)"> -- Marcello Perathoner webmaster@gutenberg.org
participants (5)
-
Bowerbird@aol.com
-
Brad Collins
-
Jon Noring
-
Marcello Perathoner
-
Robert Shimmin