Well, lets keep the name calling off-line, and the discussion pure..., and realise that XML is not a format, but a way of specifying formats (and probably all these formats have in common is that they use angled brackets in some way), and that semantically tagged is an ideal, that even the most ambitious attempts at a generic DTD for pre-existing texts (and that is what we are mostly dealing with in PG) have not reached, and is either unreachable (since we can't know the original intend with much of the formatting we encounter) or impractical (since the effort to do all this tagging is just too big, and isn't really needed by 99% of the users.) In my opinion, the best attempt to such a generic beast has been the TEI effort, which is described in a massive 1400 page document, still requires customization for numerous academic projects (both are bad news; both are unavoidable given the complexity of the task) -- but which can cover 95 percent of all text with just 5 percent of that bulk in an incarnation called TEI-Lite, and that is basically all I suggest to PG to adopt as a standard. The nice thing of this monster is that we can add those 5 percent, and if somebody decides to add more, nothing will stop him, and he can easily return the improved version to the collection. Doing fully automatic convertion to good paged PDFs for printing nice copies (and I mean good, as different from workable) will probably always remain a dream, as good layout, just as good a good typographic design is a skill, learned through doing it a lot. Even in a highly programmable environment such as TeX, I've never been able to print something from "semantic" markup without manual interventions once in a while -- even for something as arcane as a two column dictionary. Simularly, doing a good HTML (as different from a reasonable HTML) will probably also require manual intervention and tweaking once in a while... but both these things do not disqualify the large benefits we could have from having TEI tagged master copies in our collection, even if just at a relatively simple level of tagging (just marking headings, divisions, italics, footnotes, and tables). The task of producing nice HTML / Printable versions of XML documents is further complicated by the highly verbose and somewhat unintuitive model of XSLT, which is presented as the most important tool for this task -- from the computer scientist purist point of view that might be true, but for many less gods, who think five lines of basic is already a lot, its functional programming model and verbosity is a real piss-off. Getting 14000+ texts to XML can be done, just as they where produced initially, by starting somewhere with the first one, and not stopping until we've completed them all. A very simple alternative way would be to load them in OpenOffice, apply the formatting you like and save it (OpenOffice uses XML files for everything, and collects them in zip archives. If you don't believe that, change the extention of an OpenOffice document to .zip, and have a look inside) ofcourse that formatting would be very much non-"semantic". Jeroen. (Still formatting his ebooks in SGML based TEI) Marcello Perathoner wrote:
Bowerbird@aol.com wrote:
If you are interested in good HTML or PDF you must start with a sematically tagged file (these days that's mostly an XML file).
can you give and defend your definition of "good" in this case?
ditto with "semantically tagged file"?
and, if you are up to the challenge, what is your recommendation as to the route that should be taken to get a library of 14,000+ e-texts converted to the brand of x.m.l. markup you think is best?
(bonus points if you can convince all the other x.m.l. advocates that the markup version you prefer is better than the ones they prefer.)
finally, greg recently requested that people come forward with working routines to implement an x.m.l.-master methodology. are you able to answer that call? did you? if so, do let us know.
-bowerbird