
On Wed, February 8, 2012 4:27 pm, don kretz wrote:
"Metadata" is a loaded word in this context. In particular, DP has it's own legacy set of definitions.
It's worth taking the time to explain what you think the metadata is in pretty complete and precise terms, lest you be misunderstood and differing assumptions be made particularly about the difficulty of collecting it.
Fair enough. Let me start by simply exposing, not discussing, two basic concepts: Metadata is data about data. Thus, when you start to talk about "metadata" you first have to identify the "data" that you're "meta-ing." The bibliographic people at the International Federation of Library Associations and Institutions have come up with the WEMI model: Work, Expression, Manifestation, Item. "Work" is the abstract concept from which all Expressions are derived (one cannot have a "Work" without at least one "Expression"). _Huckleberry Finn_ by Mark Twain is a "Work." The 1902 Authorized Edition of _Huck Finn_ with correction by the author is an "Expression" as is a translation of Huck Finn into another language. Every "Expression" has at least on manifestation (the 1902 Authorized edition published in New York by Dunlop). "Items" is the specific book, on a specific shelf in a specific library. We don't care about "Items". So, just brainstorming here, there can be at least three "data" that we might want to "meta". First, there's WEM data. This is the data usually captured in the <meta> section of an ePub's .opf file, and what is traditionally thought of when someone uses the term "metadata." Most electronic publications use the Dublin Core model to record this metadata, but some do not. WEM metadata is the sole focus of Internet Archive's Open Library project (openlibrary.org). ePub is an interesting use case, because an ePub is fundamentally a collection of files, which only together can be considered a "publication"; individually, these files are just, well, individual files. In this case, there needs to be metadata that describes which files are part of the publication, and how they go together. This is metadata not about the work or it's expression, but metadata about the ePub publication structure. In the case of ePub, publication metadata is also stored in the .opf file; indeed, that is the .opf file's primary function. Every e-text in PG's corpus came from somewhere. Some individual created the file, and usually other individuals have modified it over the course of it's lifetime. At some point in time someone decided that PG wouldn't be sued for publishing it's version on the internet. And sometimes some automated process may have been triggered that changed the nature of the e-text. The collection of data that describes PG's processes is also metadata. Much of this data has been lost in the sands of time (download statistics) and other data that should have been lost (Al Haines' credit line) has been preserved. But it is what it is, and what it is is metadata. I've identified here three relevant types of metadata: WEM metadata (the most important), publication metadata (also important), and PG metadata (some important, some not). Going forward I will try to differentiate between these different types of metadata. When I do not, everyone may assume that I'm talking about WEM metadata. I'm particularly interested in hearing from Ms. Lofstrom with suggestions about what WEM metadata should be collected, and how it might be structured and retained.