
Jon Noring wrote:
Actually, I think what we'd like to do is to "categorize" the texts
using one or more categorical systems, and then embed that information right into the book (which is a digital object).
Instead of embedding it into the e-book, I think it would work better as a seperate file. If you embed it into the ebooks, you will need to put it in all the versions (html, text, pdf, tei, etc..), and keep ALL of them up-to-date. Also, you would have to search the entire text of the book to find all the meta-data. As a seperate file, it would also by easier to download just that when you want to be able to do "local" searches, without needing to download the full text of every e-book. Also, if you want to make it "user" editable, however you want to define "user", it would be better as seperate file, so that the original files don't constantly get flagged as modified. Also, make it easy to join the meta-files into a single file (cat *.meta > all.meta would be ideal) so that large number of books could be munged at once, or catalogues of specific groupings could easily be created (i.e. science-fiction/german). This would just require having a header in each file specifying which book it applies to. The format could be text, or XML, or even tei. If you use an XML based version, a text version could be easily created.
This is essentially adding metadata, or what the Yahoo folk call "microformats" (which is a terrible name), right into the object. This is done now in many kinds of digital objects, such as audio, video and some ebook formats.
Instead of just category, you could store all sorts of information in the "meta" file. Authors name, copyright date(s), categories (science finction, horticulture, cook-book), available formats (text, html, tei, pdf, etc.), language(s), links to web sites, link to author meta file, and any other information like you would like to find in a card catalog,
This way no external categorization need to be applied -- it is all recorded internally, meaning each book can become autonomous of the others since it carries its own metadata. Particular "libraries" can build a lookup table of their choosing by simply sniffing through all the texts it holds. It doesn't really matter where the text files are placed or organized in a file structure. Multiple categorization systems can be supported in parallel provided the texts carry the requisite information.
I think that it could become a problem if the meta-data in the different formats were found to be different. Which one has the most correct information, the text version or the html one?
In XML, there's a number of ways this info could be embedded. In plain text documents, some sort of machine recognizable "plain text" syntax has to be developed -- it'd be quite simple, actually. I think those who advocate plain text should develop a "plain text" metadata system (such as one based on Dublin Core) to insert somewhere in the file.
If you wanted to search for all polish math books, how would you write the query program so that you would get all of them, without duplicates because of the different formats, and without wasting a lot of CPU cycles. Not all texts have a .txt version,