
Greg>We have partial metadata in the text/html. Are you talking about adding DC metadata in what is visible? Or, encoded but not necessarily visible? Either seems pretty achievable. Well, consider the "Metadata" I can find in one particular PG HTML file. Different tools grab and process this "metadata" in different ways -- not suprising since the "metadata" is presenting in self-conflicting ways: <title>Emma, by Jane Austen</title> The Project Gutenberg EBook of Emma, by Jane Austen Title: Emma Author: Jane Austen Release Date: January 21, 2010 Ebook #158 PROJECT GUTENBERG EBOOK EMMA Produced by An Anonymous Volunteer, and David Widger <h1>EMMA</h1> <h2>By Jane Austen</h2> <h1>VOLUME I</h1> End of the Project Gutenberg EBook of Emma, by Jane Austen *** END OF THIS PROJECT GUTENBERG EBOOK EMMA *** ***** This file should be named 158-h.htm or 158-h.zip ***** ========================== Note for example the title information is presented in five different conflicting ways. IE PG has given this book five different "titles." This HTML "Metadata" to somehow supposed to be mapped (for example) onto Dublincore: contributor creator date description format identifier language publisher rights source subject title And now compare to the PG RDF catalog (which does this somewhat better) http://www.gutenberg.org/ebooks/158.rdf One simple approach (for example) might be simply to append the Dublincore info to the end of the file, but one might also hope that PG might decide on a clean manner how to name the "title" and "author" of their works. For example one commercial tool I use finds a title on this book of: The Project Gutenberg EBook of Emma, by Jane Austen And an author of [none] Whereas I would have hoped for a title of: Emma And an author of: Austen, Jane And perhaps a publisher of: Project Gutenberg For example if I search for this book locally in my collection, I would hope to find it sorted by title on "E", not "P" or "T" (which would not be useful given how many books I have from PG) and I would hope sorted on author it would show up under "A" not "J"