re: [gutvol-d] Re: Extra spaces in html files

karl said:
If you are interested in good HTML or PDF you must start with a sematically tagged file (these days that's mostly an XML file).
can you give and defend your definition of "good" in this case? ditto with "semantically tagged file"? and, if you are up to the challenge, what is your recommendation as to the route that should be taken to get a library of 14,000+ e-texts converted to the brand of x.m.l. markup you think is best? (bonus points if you can convince all the other x.m.l. advocates that the markup version you prefer is better than the ones they prefer.) finally, greg recently requested that people come forward with working routines to implement an x.m.l.-master methodology. are you able to answer that call? did you? if so, do let us know. -bowerbird

Bowerbird@aol.com wrote:
If you are interested in good HTML or PDF you must start with a sematically tagged file (these days that's mostly an XML file).
can you give and defend your definition of "good" in this case?
ditto with "semantically tagged file"?
and, if you are up to the challenge, what is your recommendation as to the route that should be taken to get a library of 14,000+ e-texts converted to the brand of x.m.l. markup you think is best?
(bonus points if you can convince all the other x.m.l. advocates that the markup version you prefer is better than the ones they prefer.)
finally, greg recently requested that people come forward with working routines to implement an x.m.l.-master methodology. are you able to answer that call? did you? if so, do let us know.
-bowerbird
-- Marcello Perathoner webmaster@gutenberg.org

Well, lets keep the name calling off-line, and the discussion pure..., and realise that XML is not a format, but a way of specifying formats (and probably all these formats have in common is that they use angled brackets in some way), and that semantically tagged is an ideal, that even the most ambitious attempts at a generic DTD for pre-existing texts (and that is what we are mostly dealing with in PG) have not reached, and is either unreachable (since we can't know the original intend with much of the formatting we encounter) or impractical (since the effort to do all this tagging is just too big, and isn't really needed by 99% of the users.) In my opinion, the best attempt to such a generic beast has been the TEI effort, which is described in a massive 1400 page document, still requires customization for numerous academic projects (both are bad news; both are unavoidable given the complexity of the task) -- but which can cover 95 percent of all text with just 5 percent of that bulk in an incarnation called TEI-Lite, and that is basically all I suggest to PG to adopt as a standard. The nice thing of this monster is that we can add those 5 percent, and if somebody decides to add more, nothing will stop him, and he can easily return the improved version to the collection. Doing fully automatic convertion to good paged PDFs for printing nice copies (and I mean good, as different from workable) will probably always remain a dream, as good layout, just as good a good typographic design is a skill, learned through doing it a lot. Even in a highly programmable environment such as TeX, I've never been able to print something from "semantic" markup without manual interventions once in a while -- even for something as arcane as a two column dictionary. Simularly, doing a good HTML (as different from a reasonable HTML) will probably also require manual intervention and tweaking once in a while... but both these things do not disqualify the large benefits we could have from having TEI tagged master copies in our collection, even if just at a relatively simple level of tagging (just marking headings, divisions, italics, footnotes, and tables). The task of producing nice HTML / Printable versions of XML documents is further complicated by the highly verbose and somewhat unintuitive model of XSLT, which is presented as the most important tool for this task -- from the computer scientist purist point of view that might be true, but for many less gods, who think five lines of basic is already a lot, its functional programming model and verbosity is a real piss-off. Getting 14000+ texts to XML can be done, just as they where produced initially, by starting somewhere with the first one, and not stopping until we've completed them all. A very simple alternative way would be to load them in OpenOffice, apply the formatting you like and save it (OpenOffice uses XML files for everything, and collects them in zip archives. If you don't believe that, change the extention of an OpenOffice document to .zip, and have a look inside) ofcourse that formatting would be very much non-"semantic". Jeroen. (Still formatting his ebooks in SGML based TEI) Marcello Perathoner wrote:
Bowerbird@aol.com wrote:
If you are interested in good HTML or PDF you must start with a sematically tagged file (these days that's mostly an XML file).
can you give and defend your definition of "good" in this case?
ditto with "semantically tagged file"?
and, if you are up to the challenge, what is your recommendation as to the route that should be taken to get a library of 14,000+ e-texts converted to the brand of x.m.l. markup you think is best?
(bonus points if you can convince all the other x.m.l. advocates that the markup version you prefer is better than the ones they prefer.)
finally, greg recently requested that people come forward with working routines to implement an x.m.l.-master methodology. are you able to answer that call? did you? if so, do let us know.
-bowerbird

Jeroen wrote:
The task of producing nice HTML / Printable versions of XML documents is further complicated by the highly verbose and somewhat unintuitive model of XSLT, which is presented as the most important tool for this task -- from the computer scientist purist point of view that might be true, but for many less gods, who think five lines of basic is already a lot, its functional programming model and verbosity is a real piss-off.
There is actually a fairly powerful "non-professional" alternative to the XSLT/XSL-FO approach to converting XML into PDF (or similar page-oriented layout): YesLogic's Prince product (soon to be at version 4.0 with optimized PDF output and embedded fonts -- wait until 4.0 is released in the next few days.) Prince uses the XML+CSS approach, and of course invokes the advanced CSS2 and some of the proposed CSS3 constructs. The founder of YesLogic, Michael Day, serves on the CSS Working Group of W3C, so he is quite aware of the power and limitations of CSS. Of course, there are a few knotty things that the current CSS2 cannot do, but YesLogic has added a few "custom" CSS constructs to fill in the voids, just as both Mozilla and Opera have (little known, btw). (I also want to add for those few here interested that the CSS parser in Prince is probably the best out there.) Now, I do agree that the absolute best outputs for print from XML sources via the XSLT/XSL-FO and Prince approaches require human intervention ("tweaking"), but the nice thing with a tool like Prince is that it gets one most of the way there, uses the slightly easier-to-use CSS, and allows for manual tweaking until the PDF is just right. Prince supports SVG and plans to add MathML support as well. They are a major supporter of the OpenReader System which I'm leading the development of: http://www.openreader.org . As an aside, for OpenReader I'm now building a supporter's/endorser's page, and any company, organization or individual willing to add their logo or name to the page, contact me in private email -- I'll send you the link to the current draft supporters page if you're interested in supporting/endorsing OpenReader. Maybe PG Foundation is interested? Greg? Michael? Btw, OpenReader plans to eventually natively support TEI-Lite (or maybe a well-defined subset of TEI or TEI-Lite) without need for conversion, including supporting constructs not supported in HTML web browsers such as inline notes and the like. Refer to the OpenReader web site for the details. Heck, we may even support ZML if it becomes popular as Bowerbird believes it will -- it'd be trivial to support ZML, actually (we'd internally convert it to XML and then present it using standardized CSS style sheets.) Jon Noring

Bowerbird@aol.com writes:
can you give and defend your definition of "good" in this case?
"Save as HTML" normally is not good enough.
ditto with "semantically tagged file"?
Why do you ask?
and, if you are up to the challenge, what is your recommendation as to the route that should be taken to get a library of 14,000+ e-texts converted to the brand of x.m.l. markup you think is best?
We can keep the old file unchanged for the time being. XML produced by http://www.pgdp.net/ is good enough to work with.
finally, greg recently requested that people come forward with working routines to implement an x.m.l.-master methodology. are you able to answer that call? did you? if so, do let us know.
For converting TEI XML to HTML and PDF you can use Sebastian Rahtz' XSL stylesheets: http://www.tei-c.org/Stylesheets/teixsl.html I'm old fashioned and like playing with DSSSL tools (that's all in German and not that polished nor finished -- take it as a proof of concept): http://www.gnu.franken.de/Tieck/ http://www.gnu.franken.de/Tieck/Dokumente/Koepke/ -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)
participants (5)
-
Bowerbird@aol.com
-
Jeroen Hellingman
-
Jon Noring
-
Karl Eichwalder
-
Marcello Perathoner