
jeroen said:
Well, lets keep the name calling off-line, and the discussion pure...
sounds like an excellent idea to me. let's see if marcello will agree. *** i appreciate your analysis, and agree with it in large part, because i think you've faced a good number of the problems. to pull them out into a bullet-point list, they are these:
that semantically tagged is an ideal, that even the most ambitious attempts at a generic DTD for pre-existing texts
(and that is what we are mostly dealing with in PG) have not reached
and is either unreachable (since we can't know the original intend with much of the formatting we encounter)
or impractical (since the effort to do all this tagging is just too big
and isn't really needed by 99% of the users.)
In my opinion, the best attempt to such a generic beast has been the TEI effort
which is described in a massive 1400 page document,
still requires customization for numerous academic projects
(both are bad news; both are unavoidable given the complexity of the task)
but which can cover 95 percent of all text with just 5 percent of that bulk
in an incarnation called TEI-Lite, and that is basically all I suggest to PG to adopt as a standard.
so if i was to summarize the bulk of what you've said here, concentrating on the negative, but hopefully in a fair way... semantically tagging is an ideal which may be unreachable, and is certainly impractical, since it is a big effort and is just not needed by most readers. one method -- t.e.i. -- runs to 14,000 pages of documentation, yet still requires "customization". however, a less-complex subset -- called t.e.i.-lite -- is available, and that is what i recommend... again, i don't mean to "load" the argument by concentrating on the negative aspects of a heavy-markup approach like this, because i can certainly see benefits of marked-up e-texts too. certainly a minimal form of markup is practically a requirement to move the e-texts to a reasonable e-book and typographic future. and if the library was already marked-up in x.m.l., and working, i would probably have no objections at all to continuing with it... but the reality is that the library is _not_ marked-up already. so it is necessary for us to examine very closely the _costs_ of _doing_ any markup, to make sure the _benefits_ outweigh them. in a phrase, we need to be cognizant of the _cost-benefit_ratio_. in particular, we should also consider _all_forms_of_markup_ that we think could give us a reasonable set of the benefits at a range of costs, to see which gives us the best cost-benefit ratio.
Doing fully automatic convertion to good paged PDFs for printing nice copies (and I mean good, as different from workable) will probably always remain a dream
sometimes dreams come true, you know... :+)
as good layout, just as good a good typographic design is a skill, learned through doing it a lot.
i agree. completely. it is also worth noting that we need to be able to deliver not just _one_ "good paged .pdf" of an e-text, but rather an entire _spectrum_ of "good paged .pdfs" -- in order to satisfy the entire spectrum of _readers_ out in the world. we can't just churn out a .pdf in 12-point-type and be done, because some readers will want 18-point-type, or 36-point. most will want a plain white background, but some will want a pale blue one, or a faint yellow one, or who knows what color. to be able to give the user that full range of options and _still_ deliver "a good paged pdf with good typographic design" is hard! i believe it is also true, however, that this skill can be implemented in source-code if we dedicate some effort. (it's difficult. but it's not like sending a man to the moon.) i have taken the first steps in making that effort, and i would encourage you to feel free to give me constructive criticism in examining the progress that i've made, and guiding it along. that beta-test listserve: zml_talk-subscribe@yahoogroups.com or, since you are doing well here in the realm of theoretical, perhaps you might want to instead specify what "a good pdf" would look like, or what _you) mean by a "nice" printed copy. i don't think there is a lot of awareness here along these lines, and i think it would move the discussion along _significantly_ if we could come to share some agreement on what we _want_. at some point in time, we are going to have to evaluate the quality of the output we get from various methodologies, to determine if it is "good enough" or not. to do that, we need to develop a standard... i'm not saying i think it will be _difficult_ to create our standard. to the contrary, i think it will be fairly easy, once we get started. rather what i am saying is that that work has not been done here, so we are still operating in the dark to a large degree.
Even in a highly programmable environment such as TeX, I've never been able to print something from "semantic" markup without manual interventions once in a while -- even for something as arcane as a two column dictionary.
i believe you.
Simularly, doing a good HTML (as different from a reasonable HTML) will probably also require manual intervention and tweaking
i believe you here, too. and once again here, there is little conscious agreement here about _what_ constitutes a "good" .html version of an e-text (as distinguished from a "reasonable" one, to use your terms). as with the pdf/print standard, i think that it will be fairly simple to come to agreement about what we want .html versions to be like -- the best of the files being done now come fairly close, i'd say -- but we haven't actually done the process of forming that agreement.
but both these things do not disqualify the large benefits we could have from having TEI tagged master copies
here you are confounding two arguments. the argument for having a "master" version that will generate all the "ancillary" versions is _overwhelming_. it's just ridiculous to try and maintain multiple versions; the costs of that are far too high for the benefits returned. but the argument that that "master" version should be t.e.i. -- or t.e.i.-lite or any of the other x.m.l.-based formats -- is _far_ less compelling. i think z.m.l. makes a better master.
even if just at a relatively simple level of tagging (just marking headings, divisions, italics, footnotes, and tables).
i wholeheartedly agree that a "simple level of tagging" that "marks" these type of things in an unequivocal manner is a very important minimum-usability hurdle to clear. as you might expect, though, i don't think angle-brackets are necessary at all to create this "simple level of tagging". i do _not_ expect you to take that on faith, however. i'll show you how to do it. the proof is in the pudding.
The task of producing nice HTML / Printable versions of XML documents is further complicated by the highly verbose and somewhat unintuitive model of XSLT, which is presented as the most important tool for this task
agreed, and i'm glad you recognize the huge costs in this arena.
from the computer scientist purist point of view that might be true, but for many less gods, who think five lines of basic is already a lot, its functional programming model and verbosity is a real piss-off.
i'm glad you said that, so i didn't have to...
Getting 14000+ texts to XML can be done, just as they where produced initially, by starting somewhere with the first one, and not stopping until we've completed them all.
that's the attitude! :+) is that the wisest choice of action, though? i'm not nearly so convinced of that. i think we need to set a better path, and go off on _that_ one...
A very simple alternative way would be to load them in OpenOffice, apply the formatting you like and save it
i am even less convinced of the wisdom _or_ the "simplicity", of _that_ course of action... any manual methodology is likely to be quite inferior, from a cost-benefit perspective, because the costs would be astronomical. even if you're using volunteers, at some point, you have to place value on human labor... if you cannot automate some 95% of the initial markup, you need to take your method back to the drawing board. we need to save the human labor to do the _checking_ of the markup, not waste it doing the initial markup itself...
of course that formatting would be very much non-"semantic".
which, of course, negates a lot of the benefits as well, and thus degenerates the cost-benefit ratio even further. (and i should point out that none of your discussion really gets at the essence of what _semantic_ markup would be.)
(Still formatting his ebooks in SGML based TEI)
i respect the work you are putting into the effort, immensely. -bowerbird