[gutvol-d] jeroen's even-handed analysis

19 Oct 2004

      jeroen said:
...
Well, lets keep the name calling off-line, and the discussion pure...
sounds like an excellent idea to me.  let's see if marcello will agree.

***

i appreciate your analysis, and agree with it in large part,
because i think you've faced a good number of the problems.

to pull them out into a bullet-point list, they are these:
...
that semantically tagged is an ideal, that even the most 
  ambitious attempts at a generic DTD for pre-existing texts
...
(and that is what we are mostly dealing with in PG) 
  have not reached
...
and is either unreachable (since we can't know 
  the original intend with much of the formatting we encounter)
...
or impractical
  (since the effort to do all this tagging is just too big
...
and isn't really needed by 99% of the users.)
...
In my opinion, the best attempt to 
  such a generic beast has been the TEI effort
...
which is described in a massive 1400 page document,
...
still requires customization for numerous academic projects
...
(both are bad news; both are unavoidable 
  given the complexity of the task)
...
but which can cover 95 percent of all text 
  with just 5 percent of that bulk
...
in an incarnation called TEI-Lite, 
  and that is basically all I suggest to PG to adopt as a standard.
so if i was to summarize the bulk of what you've said here,
concentrating on the negative, but hopefully in a fair way...

     semantically tagging is an ideal 
     which may be unreachable, 
     and is certainly impractical,
     since it is a big effort and is
     just not needed by most readers.

     one method -- t.e.i. -- runs to
     14,000 pages of documentation,
     yet still requires "customization".

     however, a less-complex subset
     -- called t.e.i.-lite -- is available,
     and that is what i recommend...

again, i don't mean to "load" the argument by concentrating
on the negative aspects of a heavy-markup approach like this,
because i can certainly see benefits of marked-up e-texts too.

certainly a minimal form of markup is practically a requirement
to move the e-texts to a reasonable e-book and typographic future.

and if the library was already marked-up in x.m.l., and working,
i would probably have no objections at all to continuing with it...

but the reality is that the library is _not_ marked-up already.
so it is necessary for us to examine very closely the _costs_ of
_doing_ any markup, to make sure the _benefits_ outweigh them.

in a phrase, we need to be cognizant of the _cost-benefit_ratio_.

in particular, we should also consider _all_forms_of_markup_
that we think could give us a reasonable set of the benefits at a 
range of costs, to see which gives us the best cost-benefit ratio.
...
Doing fully automatic convertion to good paged PDFs for 
  printing nice copies (and I mean good, as different from workable) 
  will probably always remain a dream
sometimes dreams come true, you know...      :+)
...
as good layout, just as good a good typographic design 
  is a skill, learned through doing it a lot.
i agree.  completely.

it is also worth noting that we need to be able to deliver
not just _one_ "good paged .pdf" of an e-text, but rather
an entire _spectrum_ of "good paged .pdfs" -- in order to
satisfy the entire spectrum of _readers_ out in the world.
we can't just churn out a .pdf in 12-point-type and be done,
because some readers will want 18-point-type, or 36-point.
most will want a plain white background, but some will want
a pale blue one, or a faint yellow one, or who knows what color.

to be able to give the user that full range of options and _still_
deliver "a good paged pdf with good typographic design" is hard!

i believe it is also true, however, that this skill can be
implemented in source-code if we dedicate some effort.
(it's difficult.  but it's not like sending a man to the moon.)

i have taken the first steps in making that effort, and i would
encourage you to feel free to give me constructive criticism
in examining the progress that i've made, and guiding it along.
that beta-test listserve:  zml_talk-subscribe@yahoogroups.com

or, since you are doing well here in the realm of theoretical,
perhaps you might want to instead specify what "a good pdf"
would look like, or what _you) mean by a "nice" printed copy.

i don't think there is a lot of awareness here along these lines,
and i think it would move the discussion along _significantly_
if we could come to share some agreement on what we _want_.

at some point in time, we are going to have to evaluate the quality
of the output we get from various methodologies, to determine if it
is "good enough" or not.  to do that, we need to develop a standard...

i'm not saying i think it will be _difficult_ to create our standard.
to the contrary, i think it will be fairly easy, once we get started.
rather what i am saying is that that work has not been done here,
so we are still operating in the dark to a large degree.
...
Even in a highly programmable environment such as TeX, 
  I've never been able to print something from "semantic" markup 
  without manual interventions once in a while -- 
  even for something as arcane as a two column dictionary.
i believe you.
...
Simularly, doing a good HTML (as different from a reasonable HTML) 
  will probably also require manual intervention and tweaking
i believe you here, too.

and once again here, there is little conscious agreement here
about _what_ constitutes a "good" .html version of an e-text
(as distinguished from a "reasonable" one, to use your terms).

as with the pdf/print standard, i think that it will be fairly simple
to come to agreement about what we want .html versions to be like
-- the best of the files being done now come fairly close, i'd say --
but we haven't actually done the process of forming that agreement.
...
but both these things do not disqualify the large benefits 
  we  could have from having TEI tagged master copies
here you are confounding two arguments.

the argument for having a "master" version that will
generate all the "ancillary" versions is _overwhelming_.
it's just ridiculous to try and maintain multiple versions;
the costs of that are far too high for the benefits returned.

but the argument that that "master" version should be t.e.i. 
-- or t.e.i.-lite or any of the other x.m.l.-based formats --
is _far_ less compelling.  i think z.m.l. makes a better master.
...
even if just at a relatively simple level of tagging 
  (just marking headings, divisions, italics, footnotes, and tables).
i wholeheartedly agree that a "simple level of tagging"
that "marks" these type of things in an unequivocal manner
is a very important minimum-usability hurdle to clear.

as you might expect, though, i don't think angle-brackets
are necessary at all to create this "simple level of tagging".

i do _not_ expect you to take that on faith, however.
i'll show you how to do it.  the proof is in the pudding.
...
The task of producing nice HTML / Printable versions 
  of XML documents is further complicated by the 
  highly verbose and somewhat unintuitive model of XSLT, 
  which is presented as the most important tool for this task
agreed, and i'm glad you recognize the huge costs in this arena.
...
from the computer scientist purist point of view 
  that might be true, but for many less gods, 
  who think five lines of basic is already a lot, 
  its  functional programming model and verbosity 
  is a real piss-off.
i'm glad you said that, so i didn't have to...
...
Getting 14000+ texts to XML can be done, 
  just as they where produced initially, 
  by starting somewhere with the first one,
  and not stopping until we've completed them all.
that's the attitude!        :+)

is that the wisest choice of action, though?
i'm not nearly so convinced of that.
i think we need to set a better path,
and go off on _that_ one...
...
A very simple alternative way would be to 
  load them in OpenOffice, 
  apply the formatting you like 
  and save it
i am even less convinced of the wisdom _or_ 
the "simplicity", of _that_ course of action...

any manual methodology is likely to be quite inferior,
from a cost-benefit perspective, because the costs
would be astronomical.  even if you're using volunteers,
at some point, you have to place value on human labor...

if you cannot automate some 95% of the initial markup,
you need to take your method back to the drawing board.
we need to save the human labor to do the _checking_ of
the markup, not waste it doing the initial markup itself...
...
of course that formatting would be very much non-"semantic".
which, of course, negates a lot of the benefits as well,
and thus degenerates the cost-benefit ratio even further.

(and i should point out that none of your discussion really
gets at the essence of what _semantic_ markup would be.)
...
(Still formatting his ebooks in SGML based TEI)
i respect the work you are putting into the effort, immensely.

-bowerbird

[gutvol-d] jeroen's even-handed analysis

Bowerbird＠aol.com