Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 25)

Jeroen Hellingman wrote: [snip]
Some have argued (with valid reasons) that the entire idea of TEI markup is broken, and have proposed systems in which the mark-up is separated from the text (stream of characters), in such a way that multiple, parallel systems of mark-up can exist. Think of a separate (part of a) file, saying characters 21 to 34 are italics, and so on. This may sound odd, but it is the way the old Macintosh wordprocessor MacWrite worked.
Jeroen.
This is also the way HTML Tidy works, internally. As an HTML file is parsed (and fixed, if necessary) a DOM tree is built. But when a text node is encountered rather than malloc'ing a potentially small amount of memory and storing a pointer, the text is copied into a pre-allocated text buffer, and the start and end points of the fragment are saved in the node structure (the start and end points are actually saved in every node, so you can grab any node in the tree and know that it encompasses "this much" of the actual text.) When 'pretty-printing' the tree, text is grabbed from the buffer as needed. Having created this structure in memory, there is no reason at all it couldn't be saved out separately, with text nodes simply referring to an offset and length in a separate file which receives the entire text buffer, or a separate segment in the same file that contains the text. Likewise, if someone wanted, hypothetically mind you, to write a set of annotations and footnotes to classic literature found at Gutenberg, the same sort of strategy could be used; the annotations would be in a separate file and refer to text at a certain offset in the base file. You'd have to write a small application to merge the two files for presentations, but that sort of thing is trivial, perfectly suited for perl, awk or python. This type of division between markup and content is also perfectly suited to writing an application to display e-books in a low memory/low power device. The DOM tree could quickly be loaded into memory and remain resident, permitting fast navigation and styling, but the actual text could remain in static storage, only being accessed when needed. One of the downsides to this sort of system is that the base content _must_ remain 1. accessible and 2. inviolate. The Gutenberg edition of _The Adventures of Sherlock Holmes_ was first released in 1999 and has gone through 12 revisions, the most recent being in 2002. Version 10 is still available at gutenberg.org, but I can't find any earlier versions (this is not a criticism; PG is not an archive, after all). So if I were to write an annotation designed to be overlayed over the PG text I would want to have some assurances that the base text were always available, or I would want to be sure that the base text was always physically attached to the annotation file (to the extent that anything digital can be said to be physical). If I were to write a separate HTML markup file for TAOSH, I would want some assurance that the base text would not be altered in any way which would change the position of any character in the file, otherwise my markup would break. So there are definitely problems with this sort of application, but there are real benefits too, in some circumstances.

On Wed, 24 Aug 2005, Lee Passey wrote:
_must_ remain 1. accessible and 2. inviolate. The Gutenberg edition of _The Adventures of Sherlock Holmes_ was first released in 1999 and has gone through 12 revisions, the most recent being in 2002. Version 10 is still available at gutenberg.org, but I can't find any earlier versions (this is not a criticism; PG is not an archive, after all).
A brief explanation here of the historical edition numbering of PG texts. Every text was released initially in a version "10" (Think of that as 1.0) And then subsequent "editions" would be numbered 11, 12, etc. If you look hard enough in the pre-10,000 files, you can find a couple of exceptions, but that will cover most cases. Also note that the consensus that emerged was that a small number of minor corrections could be made without increasing the edition number. Andrew
participants (2)
-
Andrew Sly
-
Lee Passey