New subject: ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 25)

24 Aug 2005

      Jeroen Hellingman wrote:

[snip]
...
Some have argued (with valid reasons) that the entire idea of TEI 
markup is broken, and have proposed systems
in which the mark-up is separated from the text (stream of 
characters), in such a way that multiple, parallel systems of
mark-up can exist. Think of a separate (part of a) file, saying 
characters 21 to 34 are italics, and so on. This may sound
odd, but it is the way the old Macintosh wordprocessor MacWrite worked.
Jeroen.
This is also the way HTML Tidy works, internally. As an HTML file is 
parsed (and fixed, if necessary) a DOM tree is built. But when a text 
node is encountered rather than malloc'ing a potentially small amount of 
memory and storing a pointer, the text is copied into a pre-allocated 
text buffer, and the start and end points of the fragment are saved in 
the node structure (the start and end points are actually saved in every 
node, so you can grab any node in the tree and know that it encompasses 
"this much" of the actual text.) When 'pretty-printing' the tree, text 
is grabbed from the buffer as needed.

Having created this structure in memory, there is no reason at all it 
couldn't be saved out separately, with text nodes simply referring to an 
offset and length in a separate file which receives the entire text 
buffer, or a separate segment in the same file that contains the text.

Likewise, if someone wanted, hypothetically mind you, to write a set of 
annotations and footnotes to classic literature found at Gutenberg, the 
same sort of strategy could be used; the annotations would be in a 
separate file and refer to text at a certain offset in the base file. 
You'd have to write a small application to merge the two files for 
presentations, but that sort of thing is trivial, perfectly suited for 
perl, awk or python.

This type of division between markup and content is also perfectly 
suited to writing an application to display e-books in a low memory/low 
power device. The DOM tree could quickly be loaded into memory and 
remain resident, permitting fast navigation and styling, but the actual 
text could remain in static storage, only being accessed when needed.

One of the downsides to this sort of system is that the base content 
_must_ remain 1. accessible and 2. inviolate. The Gutenberg edition of 
_The Adventures of Sherlock Holmes_ was first released in 1999 and has 
gone through 12 revisions, the most recent being in 2002. Version 10 is 
still available at gutenberg.org, but I can't find any earlier versions 
(this is not a criticism; PG is not an archive, after all). So if I were 
to write an annotation designed to be overlayed over the PG text I would 
want to have some assurances that the base text were always available, 
or I would want to be sure that the base text was always physically 
attached to the annotation file (to the extent that anything digital can 
be said to be physical). If I were to write a separate HTML markup file 
for TAOSH, I would want some assurance that the base text would not be 
altered in any way which would change the position of any character in 
the file, otherwise my markup would break.

So there are definitely problems with this sort of application, but 
there are real benefits too, in some circumstances.

Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives! (gutvol-d Digest, Vol 13, Issue 25)

Lee Passey

Andrew Sly

tags

participants (2)