
On 2/25/2011 11:48 AM, dakretz@gmail.com wrote:
There are conventions for identifying these artifacts in an html document, but there is nothing incorporated into HTML that declares unambiguously that (for instance) an H2 with a class of "chapter-head" is the way a given document does it.
True. If you choose to restrict yourselves to "unclassified" HTML (i.e. HTML without class attributes) there is no way in the W3C definition of HTML to explicitly indicate that any particular point in a document is the start of a chapter, or that any particular point is the end of a chapter. Of course, there are ways to indicate that a span of text is a paragraph as opposed to an anonymous block, that it is a header as opposed to a paragraph, or that it should be emphasized. These structures alone put bare HTML ahead of unstructured text. If you choose not to establish (or to ignore) conventions neither is there is any way in HTML to implicitly detect these kind of structures. But then, if you choose to ignore conventions there is no way to implicitly detect these kind of structures in /any/ kind of text, so on this count HTML is at least as good as anything else. Luckily, I do not choose to restrict myself to "unclassified" HTML, and I am willing to adhere to conventions, so I've never been confronted with an e-book that I could not represent in HTML.
Nor is there an implicit way to declare such a structure as metadata.
True again (although see my comments above about implicit structures). Luckily, there is an /explicit/ way to declare metadata in HTML: the <meta> tag.
All we do is choose one of the ways we saw someone else do it, or make up another way ourselves. But it doesn't inhere to HTML, nor is it particularly easier than any other form of representation. It's equivalent to saying "It's a new chapter when there are four blank lines, and the first paragraph is the chapter title" in a text document.
Yes, this much should be obvious. There is no adherence to conventions (some would say that /are/ no conventions), and therefore there is nothing that an automated process can rely on. Without adherence to specifications and conventions any kind of automated processing, and this includes extracting metadata as much as it does file format conversions, is impossible. This is why I constantly return to (2), politics. There are many, many (1) technical solutions to the very problems you have identified here. I happen to prefer XHTML enhanced by classes and conventions -- it has a nice balance between theory and practice. But it's not the only possible solution. TEI is workable, as is ReStructured Text. Given time, even z.m.l. could evolve into a workable solution if BowerBird were not quite so dogmatic about it. But as Mr. Hart has recently quite vehemently stated (with the caveat that I usually have a hard time figuring out what Mr. Hart is actually saying), is that PG will not enforce a set of standards or conventions, that PG will not adopt a set of standards or conventions, that PG will not endorse a set of standards or conventions, that PG will not recommend a set of standards or conventions, that PG will not speak kindly of any set of standards or conventions and that PG will not acknowledge that there are any standards or conventions. Given this position, I find it difficult, if not impossible, to believe that Project Gutenberg can evolve to do anything more that what it is doing at present: store a mess'o'text and make it available for download. I will admit that Distributed Proofreaders is not quite so dogmatic in this regard, although it too suffers from political gridlock. DP texts are better structured than the majority of Project Gutenberg texts (especially the early PG texts that no-one seems to be interested in cleaning up) but it still refuses to adopt a complete enough set of standards and conventions to allow the kind of automated data processing that many here desire. The basic principle seems to be "we can't because we won't."
And including images by reference with a url is an explicit admission that it stands outside the HTML structure. There's no assurance that the reference is even available or legitimate if that document were copied elsewhere.
I guess that depends on your point of view. As I see it, the 'href' attribute is an explicit way to /include/ objects in the HTML structure by reference. I'm not so rigid in my mindset that I insist that a single "e-book" necessarily requires a single file. And if I need a single file I have ZIP or its fraternal twin ePub. In any case, including an 'href' inside HTML is better than anything unstructured text can offer.
And there's no way within the bounds of (X)HTML standards to provide such a capability.
Based on the requirements I have gleaned from your message, it seems that only TEI offers the capabilities you desire, and we already know the political firestorm that /that/ suggestion would engender. But I think your assertion, unsupported by evidence, is simply wrong. If you insist on embedding arbitrary binary objects inside an HTML file it can be done with the <object> element. It will be hard to find user agents to support them, but that's the fault of the implementation, not the specification. In fact, so far I have not seen evidence of /any/ book structure that cannot be accommodated by (X)HTML.
Certainly we can't publish or even describe an API that would provide unambiguous text and metadata sufficient to construct a properly structured ebook in any other representation than the one in which it is stored.
Why not? It certainly seems feasible to me. Can you provide examples why it cannot be done?
And among the storage formats we provide from which such information might be inferred, it seems to be in most cases the plain-text format that is most accessible.
Only to humans, which have highly developed neural networks. Unstructured text is only acceptable if you are willing to accept that all manipulations of the files will be done by humans, and even then some conventions are encouraged if not required. It seems to me that this argument boils down to, "We can't figure out how to structure texts without adopting certain standards and conventions, we refuse to adopt the standards and conventions that will allow these texts to be repurposed, therefore we're just going to give up. Our texts are made by humans, for humans, and nothing else is supported." Again, I have to say that the problems facing automated text conversions are political, not technical, and I don't see any movement among the PG despots that indicate any kind of willingness to help solve those problems. The best advice I can offer is to seek asylum elsewhere.