Re: [gutvol-d] RST/PGTEI/etc

9 Feb 2012

      On Wed, February 8, 2012 4:27 pm, don kretz wrote:
...
"Metadata" is a loaded word in this context. In particular, DP has it's own
legacy set of definitions.
It's worth taking the time to explain what you think the metadata is in
pretty complete and precise terms, lest you be misunderstood and
differing assumptions be made particularly about the difficulty of
collecting it.
Fair enough.

Let me start by simply exposing, not discussing, two basic concepts:

Metadata is data about data. Thus, when you start to talk about "metadata" you
first have to identify the "data" that you're "meta-ing."

The bibliographic people at the International Federation of Library
Associations and Institutions have come up with the WEMI model: Work,
Expression, Manifestation, Item. "Work" is the abstract concept from which all
Expressions are derived (one cannot have a "Work" without at least one
"Expression"). _Huckleberry Finn_ by Mark Twain is a "Work." The 1902
Authorized Edition of _Huck Finn_ with correction by the author is an
"Expression" as is a translation of Huck Finn into another language. Every
"Expression" has at least on manifestation (the 1902 Authorized edition
published in New York by Dunlop). "Items" is the specific book, on a specific
shelf in a specific library. We don't care about "Items".

So, just brainstorming here, there can be at least three "data" that we might
want to "meta". First, there's WEM data. This is the data usually captured in
the <meta> section of an ePub's .opf file, and what is traditionally thought
of when someone uses the term "metadata." Most electronic publications use the
Dublin Core model to record this metadata, but some do not. WEM metadata is
the sole focus of Internet Archive's Open Library project (openlibrary.org).

ePub is an interesting use case, because an ePub is fundamentally a collection
of files, which only together can be considered a "publication"; individually,
these files are just, well, individual files. In this case, there needs to be
metadata that describes which files are part of the publication, and how they
go together. This is metadata not about the work or it's expression, but
metadata about the ePub publication structure. In the case of ePub,
publication metadata is also stored in the .opf file; indeed, that is the .opf
file's primary function.

Every e-text in PG's corpus came from somewhere. Some individual created the
file, and usually other individuals have modified it over the course of it's
lifetime. At some point in time someone decided that PG wouldn't be sued for
publishing it's version on the internet. And sometimes some automated process
may have been triggered that changed the nature of the e-text.

The collection of data that describes PG's processes is also metadata. Much of
this data has been lost in the sands of time (download statistics) and other
data that should have been lost (Al Haines' credit line) has been preserved.
But it is what it is, and what it is is metadata.

I've identified here three relevant types of metadata: WEM metadata (the most
important), publication metadata (also important), and PG metadata (some
important, some not).

Going forward I will try to differentiate between these different types of
metadata. When I do not, everyone may assume that I'm talking about WEM
metadata.

I'm particularly interested in hearing from Ms. Lofstrom with suggestions
about what WEM metadata should be collected, and how it might be structured
and retained.

Re: [gutvol-d] RST/PGTEI/etc

Lee Passey