[gutvol-d] Re: Rome is still burning!

26 Feb 2011

      On 2/25/2011 11:48 AM, dakretz@gmail.com wrote:
...
There are conventions for identifying these artifacts in an html
document, but there is nothing incorporated into HTML that declares
unambiguously that (for instance) an H2 with a class of
"chapter-head" is the way a given document does it.
True. If you choose to restrict yourselves to "unclassified" HTML (i.e. 
HTML without class attributes) there is no way in the W3C definition of 
HTML to explicitly indicate that any particular point in a document is 
the start of a chapter, or that any particular point is the end of a 
chapter. Of course, there are ways to indicate that a span of text is a 
paragraph as opposed to an anonymous block, that it is a header as 
opposed to a paragraph, or that it should be emphasized. These 
structures alone put bare HTML ahead of unstructured text.

If you choose not to establish (or to ignore) conventions neither is 
there is any way in HTML to implicitly detect these kind of structures. 
But then, if you choose to ignore conventions there is no way to 
implicitly detect these kind of structures in /any/ kind of text, so on 
this count HTML is at least as good as anything else.

Luckily, I do not choose to restrict myself to "unclassified" HTML, and 
I am willing to adhere to conventions, so I've never been confronted 
with an e-book that I could not represent in HTML.
...
Nor is there an implicit way to declare such a structure as metadata.
True again (although see my comments above about implicit structures). 
Luckily, there is an /explicit/ way to declare metadata in HTML: the 
<meta> tag.
...
All we do is choose one of the ways we saw someone else do it, or
make up another way ourselves. But it doesn't inhere to HTML, nor is
it particularly easier than any other form of representation. It's
equivalent to saying "It's a new chapter when there are four blank
lines, and the first paragraph is the chapter title" in a text
document.
Yes, this much should be obvious. There is no adherence to conventions 
(some would say that /are/ no conventions), and therefore there is 
nothing that an automated process can rely on. Without adherence to 
specifications and conventions any kind of automated processing, and 
this includes extracting metadata as much as it does file format 
conversions, is impossible.

This is why I constantly return to (2), politics. There are many, many 
(1) technical solutions to the very problems you have identified here. I 
happen to prefer XHTML enhanced by classes and conventions -- it has a 
nice balance between theory and practice. But it's not the only possible 
solution. TEI is workable, as is ReStructured Text. Given time, even 
z.m.l. could evolve into a workable solution if BowerBird were not quite 
so dogmatic about it.

But as Mr. Hart has recently quite vehemently stated (with the caveat 
that I usually have a hard time figuring out what Mr. Hart is actually 
saying), is that PG will not enforce a set of standards or conventions, 
that PG will not adopt a set of standards or conventions, that PG will 
not endorse a set of standards or conventions, that PG will not 
recommend a set of standards or conventions, that PG will not speak 
kindly of any set of standards or conventions and that PG will not 
acknowledge that there are any standards or conventions.

Given this position, I find it difficult, if not impossible, to believe 
that Project Gutenberg can evolve to do anything more that what it is 
doing at present: store a mess'o'text and make it available for download.

I will admit that Distributed Proofreaders is not quite so dogmatic in 
this regard, although it too suffers from political gridlock. DP texts 
are better structured than the majority of Project Gutenberg texts 
(especially the early PG texts that no-one seems to be interested in 
cleaning up) but it still refuses to adopt a complete enough set of 
standards and conventions to allow the kind of automated data processing 
that many here desire. The basic principle seems to be "we can't because 
we won't."
...
And including images by reference with a url is an explicit admission
that it stands outside the HTML structure. There's no assurance that
the reference is even available or legitimate if that document were
copied elsewhere.
I guess that depends on your point of view. As I see it, the 'href' 
attribute is an explicit way to /include/ objects in the HTML structure 
by reference. I'm not so rigid in my mindset that I insist that a single 
"e-book" necessarily requires a single file. And if I need a single file 
I have ZIP or its fraternal twin ePub.

In any case, including an 'href' inside HTML is better than anything 
unstructured text can offer.
...
And there's no way within the bounds of (X)HTML standards to provide
such a capability.
Based on the requirements I have gleaned from your message, it seems 
that only TEI offers the capabilities you desire, and we already know 
the political firestorm that /that/ suggestion would engender.

But I think your assertion, unsupported by evidence, is simply wrong. If 
you insist on embedding arbitrary binary objects inside an HTML file it 
can be done with the <object> element. It will be hard to find user 
agents to support them, but that's the fault of the implementation, not 
the specification. In fact, so far I have not seen evidence of /any/ 
book structure that cannot be accommodated by (X)HTML.
...
Certainly we can't publish or even describe an API that would provide
unambiguous text and metadata sufficient to construct a properly
structured ebook in any other representation than the one in which it
is stored.
Why not? It certainly seems feasible to me. Can you provide examples why 
it cannot be done?
...
And among the storage formats we provide from which such
information might be inferred, it seems to be in most cases the
plain-text format that is most accessible.
Only to humans, which have highly developed neural networks. 
Unstructured text is only acceptable if you are willing to accept that 
all manipulations of the files will be done by humans, and even then 
some conventions are encouraged if not required.

It seems to me that this argument boils down to, "We can't figure out 
how to structure texts without adopting certain standards and 
conventions, we refuse to adopt the standards and conventions that will 
allow these texts to be repurposed, therefore we're just going to give 
up. Our texts are made by humans, for humans, and nothing else is 
supported."

Again, I have to say that the problems facing automated text conversions 
are political, not technical, and I don't see any movement among the PG 
despots that indicate any kind of willingness to help solve those 
problems. The best advice I can offer is to seek asylum elsewhere.

[gutvol-d] Re: Rome is still burning!

Lee Passey