
don said:
We're currently not collecting most of that, except by inference from the layout;
of course, "by inference from the layout" is _precisely_ how we humans understand the structure of our books. despite the fact that headers are not _labeled_ as headers, just to give the most obvious example that comes to mind. we still seem to be able to point out a header just _fine_! likewise, thank goodness we humans don't seem to need to have every paragraph _labeled_, because that would be extremely obtrusive, and unnecessarily awkward, right? let's stop treating our computers as if they are retarded, and need to have everything explicitly spelled out before they are capable of understanding and categorizing it... there are no stupid computers; only inept programmers. for instance, it's quite easy to write a routine that can recognize the headers in any project gutenberg e-text with a very high degree of both certainty and accuracy. and in the cases where our easy-to-code program does make a mistake, we can easily edit the offending e-text, so that it will _not_ cause a problem for our nifty script. ditto with everything else you mentioned, don... -bowerbird