
Ack! This is a looong post.... and I'd promised myself I wouldn't get dragged into this flame-fest :( Steve Thomas <stephen.thomas@adelaide.edu.au> writes:
OK, you've somewhat overstated the case, and I think by now we'd all agree that "8-bit" characters are important. But it is a shame that most of the geeks -- no offence, I count myself as one -- on this list, immediately skipped your main point to whine about the need for accents and foreign scripts. You guys can't seem to see the wood for the trees.
You're right, it's not just about accents, and it's not just about consistently converting texts into different formats, though these are both important issues in their own right. This aside, it's you who have it backwards. You keep talking about the end-use of the text, which is opening up a file and reading it. But it's far from being this simple. XML is not meant for humans, it is meant for software. The XML will be converted to plain-text, HTML and PDF for humans but mostly the XML will be used by applications humans need to find texts and determine if they are worth reading in the first place. If you have a small library with 10,000 books in it, and the library is shelved roughly by category you can easily get to know it just by glancing over the spines. You could even have a rough list that breaks down the books by title, author and category. But if you have 100,000 or a 1,000,000 books in your library your job of finding things becomes a lot more difficult. Keyword searching ala Google fill never cut it. Google gives you a means of finding your car keys -- you know what you are looking for and you ask it to look for places which it thinks might have them. Ask Google for a list of the works by Charles Dickens and you will get a list of web pages it thinks has lists of Charles Dickens' works. Ask the LOC Catalog the same question and it will return you a list of items in their catalog which claim to have been written by Charles Dickens. But this list would be huge because of duplicate editions of individual works. A Christmas Carol alone turns up a couple hundred items. But what if you could ask this same question and it would return a list of works (not web pages, or different editions) by Charles Dickens organized in any way you want? But this is not a good example. Can you ask for a list of all the characters in Great Expectations? Can you search for all contemporary obituaries of Charles Dickens? To build applications which answer these types of questions requires more than a good cataloging system (though the FRBR approach goes a long ways in this regard) you need the table of contents of each work (a TOC is description of the structure of a text) and you need to have a good index of what is in the text. A back of book index is more than just a matter of keywords, it is a form of semantic markup. It maps concepts, people, places and events to the text itself. By combining the catalog metadata, the table of contents, and a good quality index we have the basic tools for finding a book and determining if it is worth reading. We do this today in libraries but it is a slow laborious task which requires you going to a catalog looking for possible candidates, then retrieving each candidate and scanning it's TOC, preface, dust-jacket blurb or introduction or index to determine if it's worth reading. Traditional libraries are restricted by the physical medium that books are published in. But if you could pull all of these elements together into a consistent framework, you would have a remarkable resource which would transform an archive of books into a repository of knowledge which is far more valuable and powerful than the sum of its parts. Semantic markup like TEI is needed not only for creating this kind of library, but for creating services which will be needed as the amount of information on the Net grows beyond what even monster search services like Google can handle. You talk about missing the forest for the trees but you forget that a large part of the forest is a tangled root system deep underground which the end user will never see. Without that root system the forest will die. Structured, semantic markup and rich cataloging are the root system of a library. Anyone who says -- I don't care about the technical stuff just give me what I want, doesn't understand that it's the technical stuff which enables them to get the stuff they want. Is this hard work? Hell yes, and it should be. Understanding, evaluating and making sense of the world around us is the most difficult thing humans do. But saying that it's not worth doing because it's hard is simply pathetic. Look at works like the OED. Would they have been created if their attitude was, oh, it's too hard to build a dictionary based on historical principles and I don't read the quotes much anyway, so just give me a list of words. Even if you don't read the quotes, the unabridged OED and the unabridged Websters, or Century Dictionary were used to create brilliant concise works like Merriam-Websters Collegiate, or the Concise Oxford English Dictionary. The OED and the massive collection of research and material that was created to write is the root system for all dictionaries Oxford produces. The more important question we should be asking is, what is the role that PG and even DP should be playing in all of this. It's reasonable to ask that PG produce basic structured markup which shows the basic structure and important elements in each text. This is no more difficult than HTML. I believe that a new group needs to be established who will then take the simple TEI produced by PG and DP and then doing more complex cataloging, indexing and semantic markup which will then be sent back to PG to be released as new editions. The TEI documentation (which is 1,400 pages -- not 14,000 as Bowerbird exaggerated) recommends that markup be done in several passes. Start with simple structural markup (as I said, is about the same as HTML), and then pass it onto another team which can do a second more detailed pass, and so on until its complete. In this way you have a means of creating texts which will be gradually woven into the library but everyone will be using a consistent and interoperable format which can be as simple or as complex as anyone requires. If everything is in basic TEI-Lite, it will be easy for smaller specialized groups to come along to do this additional markup. A group could form around a single author like Mark Twain, or around a category of works like mathematics. Then it will be easy for them to donate back their work to PG, making the texts richer, rather than their work becoming a separate branch of the texts which aren't interoperable with the PG editions. Plain-text, HTML and PDF can't do this because they are display formats for human consumption. Each have their uses and their markets. TEI is used for the root system which needs to be grown, tended and cared for as the forest grows, even if 90% of people aren't even aware it's there or don't understand that the applications they depend on to find any particular tree in the forest and see if it's the tree they need, wouldn't work without it. To understand where I'm coming from on all of this, I should mention (plug plug plug) that I've been working on just such a system (http://www.chenla.org) which is divided into two parts -- the Burr Metadata Framework (BMF) which is meant to be sort of a Wiki markup for both integration of and export to TEI and MARC. The second part is the Librarium which uses BMF to integrate the catalog with the works in a library. We have recently put up our first experimental record (an authority record Charles Dickens) which has been converted into html, and plain text. Conversion to TEI and MARC is coming. Taken together, the system can be used to integrate library catalogs with books and other texts and reference works all together with authority data for persons and groups, geographic locations, events and concepts. We don't intend to be a service for the general public but rather create a catalog and content for other use in other libraries and web sites. The site is hosted at ibiblio. In the next few weeks we should have enough documentation and another 30 or more records (which we call Burrs) online to make a general announcement of the project. I'm still ironing out some bugs in the version control software and still need to do a lot of work to complete a general introduction to the design but it's all getting there. At the moment we can convert BMF to Emacs-Wiki format which I then use to publish to Blosxom which delivers basic HTML. BMF was designed with conversion to TEI in mind, though this might seem hard to believe when you look at the BMF source the first time (there is a link to a pretty-print version of the Dickens source). So what's in it for PG? The Librarium will be developing detailed authority and bibliographic records for all PG material and it's hoped that PG can eventually draw on our catalog material for it's own authority records and catalog. This should be a help both for books already in PG's collection but also for copyright clearance for new books and free up resources for putting out more books, with better metadata. b/ -- Brad Collins <brad@chenla.org>, Bangkok, Thailand