Re: [gutvol-d] Categorizing PG content

14 Jul 2006

      Jon Noring wrote:
...
Actually, I think what we'd like to do is to "categorize" the texts
using one or more categorical systems, and then embed that information
right into the book (which is a digital object).
Instead of embedding it into the e-book, I think it would work
better as a seperate file.

If you embed it into the ebooks,  you will need to put it in all the
versions (html, text, pdf, tei, etc..), and keep ALL of them up-to-date.
Also, you would have to search the entire text of the book to find
all the meta-data.

As a seperate file, it would also by easier to download just that
when you want to be able to do "local" searches, without needing
to download the full text of every e-book.

Also, if you want to make it "user" editable, however you want
to define "user", it would be better as seperate file, so that the
original files don't constantly get flagged as modified.

Also, make it easy to join the meta-files into a single file
(cat *.meta > all.meta would be ideal) so that large number
of books could be munged at once, or catalogues of specific
groupings could easily be created (i.e. science-fiction/german).
This would just require having a header in each file specifying
which book it applies to.

The format could be text, or XML, or even tei. If you use an
XML based version, a text version could be easily created.
...
This is essentially adding metadata, or what the Yahoo folk call
"microformats" (which is a terrible name), right into the object.
This is done now in many kinds of digital objects, such as audio,
video and some ebook formats.
Instead of just category, you could store all sorts of information
in the "meta" file. Authors name, copyright date(s), categories
(science finction, horticulture, cook-book), available formats (text,
html, tei, pdf, etc.), language(s), links to web sites, link to author
meta file, and any other information like you would like to find
in a card catalog,
...
This way no external categorization need to be applied -- it is all
recorded internally, meaning each book can become autonomous of the
others since it carries its own metadata. Particular "libraries" can
build a lookup table of their choosing by simply sniffing through all
the texts it holds. It doesn't really matter where the text files are
placed or organized in a file structure. Multiple categorization
systems can be supported in parallel provided the texts carry the
requisite information.
I think that it could become a problem if the meta-data in the
different formats were found to be different. Which one has the
most correct information, the text version or the html one?
...
In XML, there's a number of ways this info could be embedded. In plain
text documents, some sort of machine recognizable "plain text" syntax
has to be developed -- it'd be quite simple, actually. I think those
who advocate plain text should develop a "plain text" metadata system
(such as one based on Dublin Core) to insert somewhere in the file.
If you wanted to search for all polish math books, how would you write
the query program so that you would get all of them, without
duplicates because of the different formats, and without wasting a
lot of CPU cycles. Not all texts have a .txt version,

Re: [gutvol-d] Categorizing PG content

Kevin Handy