Re: [gutvol-d] Don't feed the troll ! [Was: Extra spaces in html files]

14 Oct 2004

      Well, lets keep the name calling off-line, and the discussion pure..., 
and realise that XML is not a format, but a way of specifying formats 
(and probably all these formats have in common is that they use angled 
brackets in some way), and that semantically tagged is an ideal, that 
even the most ambitious attempts at a generic DTD for pre-existing texts 
(and that is what we are mostly dealing with in PG) have not reached, 
and is either unreachable (since we can't know the original intend with 
much of the formatting we encounter) or impractical (since the effort to 
do all this tagging is just too big, and isn't really needed by 99% of 
the users.) In my opinion, the best attempt to such a generic beast has 
been the TEI effort, which is described in a massive 1400 page document, 
still requires customization for numerous academic projects (both are 
bad news; both are unavoidable given the complexity of the task) -- but 
which can cover 95 percent of all text with just 5 percent of that bulk 
in an incarnation called TEI-Lite, and that is basically all I suggest 
to PG to adopt as a standard. The nice thing of this monster is that we 
can add those 5 percent, and if somebody decides to add more, nothing 
will stop him, and he can easily return the improved version to the 
collection.

Doing fully automatic convertion to good paged PDFs for printing nice 
copies (and I mean good, as different from workable) will probably 
always remain a dream, as good layout, just as good a good typographic 
design is a skill, learned through doing it a lot. Even in a highly 
programmable environment such as TeX, I've never been able to print 
something from "semantic" markup without manual interventions once in a 
while -- even for something as arcane as a two column dictionary.
Simularly, doing a good HTML (as different from a reasonable HTML) will 
probably also require manual intervention and tweaking once in a 
while... but both these things do not disqualify the large benefits we 
could have from having TEI tagged master copies in our collection, even 
if just at a relatively simple level of tagging (just marking headings, 
divisions, italics, footnotes, and tables).

The task of producing nice HTML / Printable versions of XML documents is 
further complicated by the highly verbose and somewhat unintuitive model 
of XSLT, which is presented as the most important tool for this task -- 
from the computer scientist purist point of view that might be true, but 
for many less gods, who think five lines of basic is already a lot, its 
functional programming model and verbosity is a real piss-off.

Getting 14000+ texts to XML can be done, just as they where produced 
initially, by starting somewhere with the first one, and not stopping 
until we've completed them all.

A very simple alternative way would be to load them in OpenOffice, apply 
the formatting you like and save it (OpenOffice uses XML files for 
everything, and collects them in zip archives. If you don't believe 
that, change the extention of an OpenOffice document to .zip, and have a 
look inside) ofcourse that formatting would be very much non-"semantic".

Jeroen.

(Still formatting his ebooks in SGML based TEI)

Marcello Perathoner wrote:
...
Bowerbird@aol.com wrote:
...
...
If you are interested in good HTML or PDF  you must start with a 
sematically tagged file
 (these days that's mostly an XML file).
can you give and defend your definition of "good" in this case?
ditto with "semantically tagged file"?
and, if you are up to the challenge, what is your recommendation
as to the route that should be taken to get a library of 14,000+
e-texts converted to the brand of x.m.l. markup you think is best?
(bonus points if you can convince all the other x.m.l. advocates that
the markup version you prefer is better than the ones they prefer.)
finally, greg recently requested that people come forward with
working routines to implement an x.m.l.-master methodology.
are you able to answer that call?  did you?  if so, do let us know.
-bowerbird

Re: [gutvol-d] Don't feed the troll ! [Was: Extra spaces in html files]

Jeroen Hellingman