jeroen's even-handed analysis

jeroen said:
Well, lets keep the name calling off-line, and the discussion pure...
sounds like an excellent idea to me. let's see if marcello will agree. *** i appreciate your analysis, and agree with it in large part, because i think you've faced a good number of the problems. to pull them out into a bullet-point list, they are these:
that semantically tagged is an ideal, that even the most ambitious attempts at a generic DTD for pre-existing texts
(and that is what we are mostly dealing with in PG) have not reached
and is either unreachable (since we can't know the original intend with much of the formatting we encounter)
or impractical (since the effort to do all this tagging is just too big
and isn't really needed by 99% of the users.)
In my opinion, the best attempt to such a generic beast has been the TEI effort
which is described in a massive 1400 page document,
still requires customization for numerous academic projects
(both are bad news; both are unavoidable given the complexity of the task)
but which can cover 95 percent of all text with just 5 percent of that bulk
in an incarnation called TEI-Lite, and that is basically all I suggest to PG to adopt as a standard.
so if i was to summarize the bulk of what you've said here, concentrating on the negative, but hopefully in a fair way... semantically tagging is an ideal which may be unreachable, and is certainly impractical, since it is a big effort and is just not needed by most readers. one method -- t.e.i. -- runs to 14,000 pages of documentation, yet still requires "customization". however, a less-complex subset -- called t.e.i.-lite -- is available, and that is what i recommend... again, i don't mean to "load" the argument by concentrating on the negative aspects of a heavy-markup approach like this, because i can certainly see benefits of marked-up e-texts too. certainly a minimal form of markup is practically a requirement to move the e-texts to a reasonable e-book and typographic future. and if the library was already marked-up in x.m.l., and working, i would probably have no objections at all to continuing with it... but the reality is that the library is _not_ marked-up already. so it is necessary for us to examine very closely the _costs_ of _doing_ any markup, to make sure the _benefits_ outweigh them. in a phrase, we need to be cognizant of the _cost-benefit_ratio_. in particular, we should also consider _all_forms_of_markup_ that we think could give us a reasonable set of the benefits at a range of costs, to see which gives us the best cost-benefit ratio.
Doing fully automatic convertion to good paged PDFs for printing nice copies (and I mean good, as different from workable) will probably always remain a dream
sometimes dreams come true, you know... :+)
as good layout, just as good a good typographic design is a skill, learned through doing it a lot.
i agree. completely. it is also worth noting that we need to be able to deliver not just _one_ "good paged .pdf" of an e-text, but rather an entire _spectrum_ of "good paged .pdfs" -- in order to satisfy the entire spectrum of _readers_ out in the world. we can't just churn out a .pdf in 12-point-type and be done, because some readers will want 18-point-type, or 36-point. most will want a plain white background, but some will want a pale blue one, or a faint yellow one, or who knows what color. to be able to give the user that full range of options and _still_ deliver "a good paged pdf with good typographic design" is hard! i believe it is also true, however, that this skill can be implemented in source-code if we dedicate some effort. (it's difficult. but it's not like sending a man to the moon.) i have taken the first steps in making that effort, and i would encourage you to feel free to give me constructive criticism in examining the progress that i've made, and guiding it along. that beta-test listserve: zml_talk-subscribe@yahoogroups.com or, since you are doing well here in the realm of theoretical, perhaps you might want to instead specify what "a good pdf" would look like, or what _you) mean by a "nice" printed copy. i don't think there is a lot of awareness here along these lines, and i think it would move the discussion along _significantly_ if we could come to share some agreement on what we _want_. at some point in time, we are going to have to evaluate the quality of the output we get from various methodologies, to determine if it is "good enough" or not. to do that, we need to develop a standard... i'm not saying i think it will be _difficult_ to create our standard. to the contrary, i think it will be fairly easy, once we get started. rather what i am saying is that that work has not been done here, so we are still operating in the dark to a large degree.
Even in a highly programmable environment such as TeX, I've never been able to print something from "semantic" markup without manual interventions once in a while -- even for something as arcane as a two column dictionary.
i believe you.
Simularly, doing a good HTML (as different from a reasonable HTML) will probably also require manual intervention and tweaking
i believe you here, too. and once again here, there is little conscious agreement here about _what_ constitutes a "good" .html version of an e-text (as distinguished from a "reasonable" one, to use your terms). as with the pdf/print standard, i think that it will be fairly simple to come to agreement about what we want .html versions to be like -- the best of the files being done now come fairly close, i'd say -- but we haven't actually done the process of forming that agreement.
but both these things do not disqualify the large benefits we could have from having TEI tagged master copies
here you are confounding two arguments. the argument for having a "master" version that will generate all the "ancillary" versions is _overwhelming_. it's just ridiculous to try and maintain multiple versions; the costs of that are far too high for the benefits returned. but the argument that that "master" version should be t.e.i. -- or t.e.i.-lite or any of the other x.m.l.-based formats -- is _far_ less compelling. i think z.m.l. makes a better master.
even if just at a relatively simple level of tagging (just marking headings, divisions, italics, footnotes, and tables).
i wholeheartedly agree that a "simple level of tagging" that "marks" these type of things in an unequivocal manner is a very important minimum-usability hurdle to clear. as you might expect, though, i don't think angle-brackets are necessary at all to create this "simple level of tagging". i do _not_ expect you to take that on faith, however. i'll show you how to do it. the proof is in the pudding.
The task of producing nice HTML / Printable versions of XML documents is further complicated by the highly verbose and somewhat unintuitive model of XSLT, which is presented as the most important tool for this task
agreed, and i'm glad you recognize the huge costs in this arena.
from the computer scientist purist point of view that might be true, but for many less gods, who think five lines of basic is already a lot, its functional programming model and verbosity is a real piss-off.
i'm glad you said that, so i didn't have to...
Getting 14000+ texts to XML can be done, just as they where produced initially, by starting somewhere with the first one, and not stopping until we've completed them all.
that's the attitude! :+) is that the wisest choice of action, though? i'm not nearly so convinced of that. i think we need to set a better path, and go off on _that_ one...
A very simple alternative way would be to load them in OpenOffice, apply the formatting you like and save it
i am even less convinced of the wisdom _or_ the "simplicity", of _that_ course of action... any manual methodology is likely to be quite inferior, from a cost-benefit perspective, because the costs would be astronomical. even if you're using volunteers, at some point, you have to place value on human labor... if you cannot automate some 95% of the initial markup, you need to take your method back to the drawing board. we need to save the human labor to do the _checking_ of the markup, not waste it doing the initial markup itself...
of course that formatting would be very much non-"semantic".
which, of course, negates a lot of the benefits as well, and thus degenerates the cost-benefit ratio even further. (and i should point out that none of your discussion really gets at the essence of what _semantic_ markup would be.)
(Still formatting his ebooks in SGML based TEI)
i respect the work you are putting into the effort, immensely. -bowerbird

N.B. Bowerbirds first post on gutvol-p on 11/06/03 is attached below in its full glory for the readers convenience.
bowerbird 11/06/03 it reminds me that i firmly believe us users should file a class-action lawsuit against you computer overlords
bowerbird 10/14/04 is this kind of name-calling condoned on this listserve?
Bowerbird started name-calling in his very first post and now he is soooo sensitive about it. In good ol' usenet tradition, he who dishes out freely must also be able to pocket graciously, but, of course, this applies to other people, not to Bowerbird himself. Also he got himself kicked out from at least one newsgroup (ask Jon Noring for details) and was at one point set under moderator supervision on this list. (Why has this changed? Have some bits got lost in the move to pglaf.org?)
bowerbird 11/06/03 this isn't flame-bait, 'cause i ain't even gonna argue with ya. i've concluded it's a waste of my time to even _discuss_ x.m.l.
Since then, Bowerbird has done nothing else than _cuss_ XML, that is: belittle the XML language and the people who are using it. He concluded that he'd rather waste _other_ people's time than his own. If Bowerbird was really interested in establishing an alternative to XML markup, he would have fired up emacs and started coding his reader. In a month or two he would have shown us the first prototype. If the thing really was heaps better than XML, we would have acclaimed him and considered changing the DP formatting rules to his format. Of course, flamewars being his favorite pastime, he got nowhere with his reader. Furthermore he wastes the time of other volunteers and newcomers to this mailing list by luring them into yet another endless discussion of the exact same topic we already had plenty before. (The topic being: the self-celebration of His Royal Highness Bowerbird.)
bowerbird 11/06/03 you're already over-budget and severely behind-schedule [...] i have written such a program, and i'll have a beta version soon.
Bowerbird first announced his reader on 02/14/03.
-- for immediate release -- [...] bowerbird intelligentleman announces an open-source project geared toward creating an o.e.b. "presentation system", i.e., a cross-platform reader-program that will allow users to read o.e.b files. [...] bowerbird further indicated that he is fully confident that the effort would bear fruit quickly, since he has previously programmed a wide variety of electronic-book applications.
http://www.gnutemberg.org/pipermail/libergnu.mbox/libergnu.mbox That was 20 months ago. Since then we saw a lot of announcements but never a line of source code. (Note: he says "Open Source" in his press release, and he also says OEB, which is an XML application.) If Bowerbird was in good faith, he'd published some source code immediately after his announcement to have people review it and comment on it. As it stands, nobody has ever seen one single line of his alleged mother of all readers. (All we did see were some `screenshots' probably done with Microsoft Paint.)
bowerbird 11/06/03 so you better know i'm prepared to deliver.
I defy Bowerbird to publish the source code of what he has done in 20+ months of development. Hic Rhodus, Bowerbird, hic salta! Prove to us that you can build a better reader. Don't give us any of your lame excuses but deliver now or be silent forever. (Lame excuses we already had include: I won't show you because you are so nasty.)
bowerbird 11/06/03 in a phrase, it's time to put up or shut up.
Of course, this rule again applies to other people, not to Bowerbird himself: he did not put up and never will shut up. Conclusion: Bowerbird is a kook (def: http://www.catb.org/~esr/jargon/html/K/kook.html ) who knowingly wastes the time of volunteers who could otherwise do many useful things for PG. And he has got nothing to show for it. And now, for the enlightenment of the newcomers, and for the entertainment of those of us who know how hard Bowerbird has been working and how much he has achieved in this short year, Bowerbirds posting debut on gutvol-p of nearly a year ago ... unabridged.
i've been writing some apps for project gutenberg, so i subscribed to this listserve this evening, and i went back and read all the posts for a full year, just to get the flavor of what has gone on here...
boy, what a waste of time... :+)
it reminds me that i firmly believe us users should file a class-action lawsuit against you computer overlords for all the time and trouble you have dragged us through in trying to transition us to x.m.l.
you're already over-budget and severely behind-schedule in delivering on the things x.m.l. was supposed to bring us, and we haven't seen even a fraction of the promised benefits.
this isn't flame-bait, 'cause i ain't even gonna argue with ya. i've concluded it's a waste of my time to even _discuss_ x.m.l.
there are 10,000 e-texts in the project gutenberg library -- 4,000 more than there were when you had your last flamewar -- so you've got lots of opportunity to show the value of x.m.l., just get to work, and let us know when you're done doing markup.
heck, don't even bother to contact us then, just go right to work making some x.m.l.-savvy _viewer-programs_ for us end-users, because it doesn't do us one bit of good to have marked-up files if we don't have any viewers that can make use of that mark-up.
in a phrase, it's time to put up or shut up.
and yes, i realize you'll throw that challenge right back at me, so you better know i'm prepared to deliver.
i say we don't need much "markup" -- sometimes _none_ -- to turn project gutenberg's plain-ascii e-text files into a slick electronic-book experience for end-users, if we only put a little bit of intelligence into an e-book viewer-program.
i have written such a program, and i'll have a beta version soon. so i hope i've pissed you off, because i _like_ hostile beta-testers; i trust them to step past the polite praise and tell me what's wrong.
that's enough for now, i gotta get back to work. and so do you...
-bowerbird
-- Marcello Perathoner webmaster@gutenberg.org

Bowerbird@aol.com wrote:
however, a less-complex subset -- called t.e.i.-lite -- is available, and that is what i recommend...
You do have a curious way of avoiding capital letters... :-) After al my objections against XML and TEI, you may wonder why I still recommend to use TEI lite is that it forms a very decent base to start some structurial tagging with -- you don't need the full 1400 pages of TEI to get started with it, and you also don't need to reinvent the wheel, and come up with some alternative, equally simple scheme. Doing this has the added benefit that for those text that require it, you can easily step up, and work with the full set, if so required or desired. If you're just against using angled brackets, they are simple to use and understand by both humans and computers. You can do more fancy tricks to make marked-up texts look more like plain text, but attempts to do so, both by TeX or SGML add considerable complexity to the reader -- both machine _and_ human. XML has one thing for it, and that is its simplicity (and that some people build complicated things on it, such as namespaces, XSLT, etc., that require a course in computer-science could be quite hidden from most users.) You can ofcourse object out of principle against something 1400 pages thick, but that is unavoidable, given the complexity and wide diversity of books that have been published in the 500+ years since Gutenberg's invention. Since much of the difficult stuff of XML will eventually be hidden from users. Future versions of layout programs will probably be able to read a thing coded in TEI directly (doing an XSLT transform to some internal format), and format it nicely according to some defaults. You can then apply all the required formatting tweaks to it, export to some nice lay-out format (XSL-FO, maybe, PDF, or who knows), and safe all your nice tweaks, linked to your original TEI, so you have best of both worlds. I already have numerous benefits from working in XML, in that I can generate nice HTML files (that often need no touch-up at all) and reasonable plain ASCII for PG, but also have spelling checking on a per language base, extract all fragments in a certain language, create tables of contents, etc. on the fly, extract dublin core bibliographic records, and more. Jeroen.

Jeroen Hellingman <jeroen@bohol.ph> writes:
Since much of the difficult stuff of XML will eventually be hidden from users. Future versions of layout programs will probably be able to read a thing coded in TEI directly (doing an XSLT transform to some internal format), and format it nicely according to some defaults.
It already works for simple books using CSS; here is an example (a text by Ludwig Tieck in German): http://www.gnu.franken.de/Tieck/Werke/dichterleben/ Sorry for the strange layout - it is just a test. It works with Mozilla 1.6 and better. Of course, the other feature you mentioned are more important. -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)
participants (4)
-
Bowerbird@aol.com
-
Jeroen Hellingman
-
Karl Eichwalder
-
Marcello Perathoner