Re: [gutvol-d] jeroen's even-handed analysis

20 Oct 2004

      Ack! This is a looong post.... and I'd promised myself I wouldn't get
dragged into this flame-fest :(

Steve Thomas <stephen.thomas@adelaide.edu.au> writes:
...
OK, you've somewhat overstated the case, and I think by now we'd all
agree that "8-bit" characters are important. But it is a shame that
most of the geeks -- no offence, I count myself as one -- on this
list, immediately skipped your main point to whine about the need for
accents and foreign scripts. You guys can't seem to see the wood for
the trees.
You're right, it's not just about accents, and it's not just about
consistently converting texts into different formats, though these
are both important issues in their own right.

This aside, it's you who have it backwards.  You keep talking about
the end-use of the text, which is opening up a file and reading it.

But it's far from being this simple.  XML is not meant for humans, it
is meant for software.  The XML will be converted to plain-text, HTML
and PDF for humans but mostly the XML will be used by applications
humans need to find texts and determine if they are worth reading in
the first place.

If you have a small library with 10,000 books in it, and the library
is shelved roughly by category you can easily get to know it just by
glancing over the spines.  

You could even have a rough list that breaks down the books by title,
author and category.  But if you have 100,000 or a 1,000,000 books in
your library your job of finding things becomes a lot more
difficult.  Keyword searching ala Google fill never cut it.  Google
gives you a means of finding your car keys -- you know what you are
looking for and you ask it to look for places which it thinks might
have them.

Ask Google for a list of the works by Charles Dickens and you will get
a list of web pages it thinks has lists of Charles Dickens' works.
Ask the LOC Catalog the same question and it will return you a list
of items in their catalog which claim to have been written by Charles
Dickens.  But this list would be huge because of duplicate editions
of individual works.  A Christmas Carol alone turns up a couple
hundred items.

But what if you could ask this same question and it would return a
list of works (not web pages, or different editions) by Charles
Dickens organized in any way you want?  But this is not a good
example.  Can you ask for a list of all the characters in Great
Expectations?  Can you search for all contemporary obituaries of
Charles Dickens?

To build applications which answer these types of questions requires
more than a good cataloging system (though the FRBR approach goes a
long ways in this regard) you need the table of contents of each work
(a TOC is description of the structure of a text) and you need to have
a good index of what is in the text.  A back of book index is more
than just a matter of keywords, it is a form of semantic markup.  It
maps concepts, people, places and events to the text itself.

By combining the catalog metadata, the table of contents, and a good
quality index we have the basic tools for finding a book and
determining if it is worth reading.  We do this today in libraries but
it is a slow laborious task which requires you going to a catalog
looking for possible candidates, then retrieving each candidate and
scanning it's TOC, preface, dust-jacket blurb or introduction or index
to determine if it's worth reading.

Traditional libraries are restricted by the physical medium that
books are published in.  But if you could pull all of these elements
together into a consistent framework, you would have a remarkable
resource which would transform an archive of books into a repository
of knowledge which is far more valuable and powerful than the sum of
its parts.

Semantic markup like TEI is needed not only for creating this kind of
library, but for creating services which will be needed as the amount
of information on the Net grows beyond what even monster search
services like Google can handle.

You talk about missing the forest for the trees but you forget that a
large part of the forest is a tangled root system deep underground
which the end user will never see.  Without that root system the
forest will die.  Structured, semantic markup and rich cataloging are
the root system of a library.

Anyone who says -- I don't care about the technical stuff just give
me what I want, doesn't understand that it's the technical stuff
which enables them to get the stuff they want.

Is this hard work?  Hell yes, and it should be.  Understanding,
evaluating and making sense of the world around us is the most
difficult thing humans do.  But saying that it's not worth doing
because it's hard is simply pathetic.

Look at works like the OED.  Would they have been created if their
attitude was, oh, it's too hard to build a dictionary based on
historical principles and I don't read the quotes much anyway, so just
give me a list of words.  Even if you don't read the quotes, the
unabridged OED and the unabridged Websters, or Century Dictionary
were used to create  brilliant concise works like Merriam-Websters
Collegiate, or the Concise Oxford English Dictionary.  The OED and
the massive collection of research and material that was created to
write is the root system for all dictionaries Oxford produces.

The more important question we should be asking is, what is the role
that PG and even DP should be playing in all of this.  It's
reasonable to ask that PG produce basic structured markup which shows
the basic structure and important elements in each text.  This is no
more difficult than HTML.

I believe that a new group needs to be established who will then take
the simple TEI produced by PG and DP and then doing more complex
cataloging, indexing and semantic markup which will then be sent back
to PG to be released as new editions.

The TEI documentation (which is 1,400 pages -- not 14,000 as Bowerbird
exaggerated) recommends that markup be done in several passes.  Start
with simple structural markup (as I said, is about the same as HTML),
and then pass it onto another team which can do a second more detailed
pass, and so on until its complete.  In this way you have a means of
creating texts which will be gradually woven into the library but
everyone will be using a consistent and interoperable format which can
be as simple or as complex as anyone requires.

If everything is in basic TEI-Lite, it will be easy for smaller
specialized groups to come along to do this additional markup.  A
group could form around a single author like Mark Twain, or around a
category of works like mathematics.  Then it will be easy for them to
donate back their work to PG, making the texts richer, rather than
their work becoming a separate branch of the texts which aren't
interoperable with the PG editions.

Plain-text, HTML and PDF can't do this because they are display
formats for human consumption.  Each have their uses and their
markets.  TEI is used for the root system which needs to be grown,
tended and cared for as the forest grows, even if 90% of people aren't
even aware it's there or don't understand that the applications they
depend on to find any particular tree in the forest and see if it's
the tree they need, wouldn't work without it.

To understand where I'm coming from on all of this, I should mention
(plug plug plug) that I've been working on just such a system
(http://www.chenla.org) which is divided into two parts -- the Burr
Metadata Framework (BMF) which is meant to be sort of a Wiki markup
for both integration of and export to TEI and MARC. 

The second part is the Librarium which uses BMF to integrate the
catalog with the works in a library.  We have recently put up our
first experimental record (an authority record Charles Dickens) which
has been converted into html, and plain text.  Conversion to TEI and
MARC is coming.

Taken together, the system can be used to integrate library catalogs
with books and other texts and reference works all together with
authority data for persons and groups, geographic locations, events
and concepts.  We don't intend to be a service for the general public
but rather create a catalog and content for other use in other
libraries and web sites.

The site is hosted at ibiblio.  In the next few weeks we should have
enough documentation and another 30 or more records (which we call
Burrs) online to make a general announcement of the project.  I'm
still ironing out some bugs in the version control software and still
need to do a lot of work to complete a general introduction to the
design but it's all getting there.

At the moment we can convert BMF to Emacs-Wiki format which I then use
to publish to Blosxom which delivers basic HTML.  BMF was designed
with conversion to TEI in mind, though this might seem hard to
believe when you look at the BMF source the first time (there is a
link to a pretty-print version of the Dickens source).

So what's in it for PG?

The Librarium will be developing detailed authority and bibliographic
records for all PG material and it's hoped that PG can eventually draw
on our catalog material for it's own authority records and catalog.
This should be a help both for books already in PG's collection but
also for copyright clearance for new books and free up resources for
putting out more books, with better metadata.

b/

-- 
Brad Collins <brad@chenla.org>, Bangkok, Thailand