
I am reassured to discover that I am not the only one turned off by the tedious and reiterative banter of the Usual Suspects. Let me state my views: 1. XML should be at the centre of any text management (doh). 2. There should be one or more markup schemes for simple embellishment of text to trivialise conversion of existing text to XML. One of these MAY be zml, or my dottedfile format--or any other system that others prefer. The only requirement should be the easy conversion to good XML. 3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype. 4. From that point on, (X)HTML and HTML/ePub are trivial to generate, while LaTeX and PDF come from XSLT--which is surprisingly easy. I am not trying to patronise--and I know full well that none of this is new. But I want to emphasise how practicable the approach is. I refer to my site (http://www.limpidsoft.com), which now has 60 books derived from PG source, generated in my spare time over the course of a few weeks. It will be immediately obvious that few of these books are perfect: bad page breaks will leap out from some of the PDFs and BB will just HATE the CSS. As any user of LaTeX will know, it is just a matter of hand-polishing the intermediate LaTeX files before the final conversion to PDFs. And I will eventually do this to the current files. At this point, though, I am satisfied that large-scale styling from PG text source IS practicable. SO, if PG were to establish and maintain canonical XML texts, would it not be future-proofed? John Redmond

On Thu, Feb 24, 2011 at 6:31 PM, John Redmond <john_redmond@optusnet.com.au> wrote:
3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype.
But XHTML sucks for books. There's no sidenote/footnote/endnote markups, there's no titlepage mark-up (which would make title and author automatically readable in most cases), etc.
As any user of LaTeX will know, it is just a matter of hand-polishing the intermediate LaTeX files before the final conversion to PDFs.
It's never "just" a matter of hand-polishing; that's a serious flaw. Especially as it would have to be done for every separate size of PDF.
SO, if PG were to establish and maintain canonical XML texts, would it not be future-proofed?
XML is not magic; many blobs of XML are opaque to any but their generators. HTML will probably be around for forever, but as I said above, it's suboptimal for books. -- Kie ekzistas vivo, ekzistas espero.

David Starner wrote: But XHTML sucks for books. There's no sidenote/footnote/endnote
markups, there's no titlepage mark-up (which would make title and author automatically readable in most cases), etc.
Point well taken: I should have referred (appropriately enhanced) XHTML. I have been processing notes in some of the books (see Ivanhoe and Ballantrae). I include the inline NOTE element which, in processing, leads to end notes in XHTML and footnotes in PDF. The lack of page breaks is one of the resoundingly big negatives in HTML--particularly in generating hard copy--so we agree on that. I have tried to compensate for the lack of a title page by styling and generating a table of contents. Not up to PDF standards, but really not TOO bad. Have a look at a couple of the HTML examples. Page dimensions, particularly aspect ratios, are presently a stinker for PDF ebooks--and I hope that the area settles down SOON. But, with my scripts, it is only a problem of handling the multiple PDFs. In my setup, I can generate--and DO generate--multiple XSLT stylesheets with combinations of page dimensions and fonts, etc. So producing multiple PDFs is not the problem (it is quick and easy). What you do with 'em is the problem. Thanks for your interest, John Redmond

At DP, we haven't even identified enough HTML structure to provide a standard encoding for the most basic structural elements of a book - i.e. Chapters. Nor a way to embed illustrations in the HTML other than as (non-portable) url links. It seems to me that XHTML is especially badly suited to encoding books. I can't imagine how anyone builds an ebook without identifying and acquiring chapters and illustrations.

Hi Don, Excuse me, What? Am 25.02.2011 um 07:01 schrieb don kretz:
At DP, we haven't even identified enough HTML structure to provide a standard encoding for the most basic structural elements of a book - i.e. Chapters. Please define what you consider a "Chapter" is.
Nor a way to embed illustrations in the HTML other than as (non-portable) url links. What do you mean by non-portable?? You do parse the HTML during conversion. Do you not? And you can convert the information in the image tag? Strange, very strange.
It seems to me that XHTML is especially badly suited to encoding books.
I can't imagine how anyone builds an ebook without identifying and acquiring chapters and illustrations. I can not follow you properly. In a printed book a chapter is identifiable without any structural elements in the sense of mark-up! Formatting, yes.
For the fun of it. A Chapter is marked by a paragraph/line/or beginning line of a paragraph that is formatted, normally in a consistent manner throughout a book, in a certain way. Identifying it is not that hard, marking it up is just as easy. Processing it is just as easy as you parse and identify the markup. So what is the problem. The semantics? That is how a tool interprets the mark-up! regards Keith.

On 02/25/2011 07:01 AM, don kretz wrote:
At DP, we haven't even identified enough HTML structure to provide a standard encoding for the most basic structural elements of a book - i.e. Chapters.
Nor a way to embed illustrations in the HTML other than as (non-portable) url links.
What is the problem with urls? Please expand. -- Marcello Perathoner webmaster@gutenberg.org

On 2/24/2011 11:01 PM, don kretz wrote:
At DP, we haven't even identified enough HTML structure to provide a standard encoding for the most basic structural elements of a book - i.e. Chapters.
I can believe this. Not that XHTML doesn't provide a way to encode chapters -- I can think of 3 or 4 ways off the top of my head -- but that DP couldn't come together and agree on a solution.
Nor a way to embed illustrations in the HTML other than as (non-portable) url links.
I suppose you could embed them in an <object> element, but it seems to me that a nice, portable, relative URL is a completely acceptable method.
It seems to me that XHTML is especially badly suited to encoding books.
The marketplace disagrees. If commercial companies can make it work, why can't DP?
I can't imagine how anyone builds an ebook without identifying and acquiring chapters and illustrations.
Nor can I, but given the fact that HTML /can/ accommodate those things this statement is a bit of a non-sequitor.

There are conventions for identifying these artifacts in an html document, but there is nothing incorporated into HTML that declares unambiguously that (for instance) an H2 with a class of "chapter-head" is the way a given document does it. Nor is there an implicit way to declare such a structure as metadata. All we do is choose one of the ways we saw someone else do it, or make up another way ourselves. But it doesn't inhere to HTML, nor is it particularly easier than any other form of representation. It's equivalent to saying "It's a new chapter when there are four blank lines, and the first paragraph is the chapter title" in a text document. And including images by reference with a url is an explicit admission that it stands outside the HTML structure. There's no assurance that the reference is even available or legitimate if that document were copied elsewhere. And there's no way within the bounds of (X)HTML standards to provide such a capability. Certainly we can't publish or even describe an API that would provide unambiguous text and metadata sufficient to construct a properly structured ebook in any other representation than the one in which it is stored. And among the storage formats we provide from which such information might be inferred, it seems to be in most cases the plain-text format that is most accessible. At least whatever conventions there are seem to be more consistently adhered to.

On 2/25/2011 11:48 AM, dakretz@gmail.com wrote:
There are conventions for identifying these artifacts in an html document, but there is nothing incorporated into HTML that declares unambiguously that (for instance) an H2 with a class of "chapter-head" is the way a given document does it.
True. If you choose to restrict yourselves to "unclassified" HTML (i.e. HTML without class attributes) there is no way in the W3C definition of HTML to explicitly indicate that any particular point in a document is the start of a chapter, or that any particular point is the end of a chapter. Of course, there are ways to indicate that a span of text is a paragraph as opposed to an anonymous block, that it is a header as opposed to a paragraph, or that it should be emphasized. These structures alone put bare HTML ahead of unstructured text. If you choose not to establish (or to ignore) conventions neither is there is any way in HTML to implicitly detect these kind of structures. But then, if you choose to ignore conventions there is no way to implicitly detect these kind of structures in /any/ kind of text, so on this count HTML is at least as good as anything else. Luckily, I do not choose to restrict myself to "unclassified" HTML, and I am willing to adhere to conventions, so I've never been confronted with an e-book that I could not represent in HTML.
Nor is there an implicit way to declare such a structure as metadata.
True again (although see my comments above about implicit structures). Luckily, there is an /explicit/ way to declare metadata in HTML: the <meta> tag.
All we do is choose one of the ways we saw someone else do it, or make up another way ourselves. But it doesn't inhere to HTML, nor is it particularly easier than any other form of representation. It's equivalent to saying "It's a new chapter when there are four blank lines, and the first paragraph is the chapter title" in a text document.
Yes, this much should be obvious. There is no adherence to conventions (some would say that /are/ no conventions), and therefore there is nothing that an automated process can rely on. Without adherence to specifications and conventions any kind of automated processing, and this includes extracting metadata as much as it does file format conversions, is impossible. This is why I constantly return to (2), politics. There are many, many (1) technical solutions to the very problems you have identified here. I happen to prefer XHTML enhanced by classes and conventions -- it has a nice balance between theory and practice. But it's not the only possible solution. TEI is workable, as is ReStructured Text. Given time, even z.m.l. could evolve into a workable solution if BowerBird were not quite so dogmatic about it. But as Mr. Hart has recently quite vehemently stated (with the caveat that I usually have a hard time figuring out what Mr. Hart is actually saying), is that PG will not enforce a set of standards or conventions, that PG will not adopt a set of standards or conventions, that PG will not endorse a set of standards or conventions, that PG will not recommend a set of standards or conventions, that PG will not speak kindly of any set of standards or conventions and that PG will not acknowledge that there are any standards or conventions. Given this position, I find it difficult, if not impossible, to believe that Project Gutenberg can evolve to do anything more that what it is doing at present: store a mess'o'text and make it available for download. I will admit that Distributed Proofreaders is not quite so dogmatic in this regard, although it too suffers from political gridlock. DP texts are better structured than the majority of Project Gutenberg texts (especially the early PG texts that no-one seems to be interested in cleaning up) but it still refuses to adopt a complete enough set of standards and conventions to allow the kind of automated data processing that many here desire. The basic principle seems to be "we can't because we won't."
And including images by reference with a url is an explicit admission that it stands outside the HTML structure. There's no assurance that the reference is even available or legitimate if that document were copied elsewhere.
I guess that depends on your point of view. As I see it, the 'href' attribute is an explicit way to /include/ objects in the HTML structure by reference. I'm not so rigid in my mindset that I insist that a single "e-book" necessarily requires a single file. And if I need a single file I have ZIP or its fraternal twin ePub. In any case, including an 'href' inside HTML is better than anything unstructured text can offer.
And there's no way within the bounds of (X)HTML standards to provide such a capability.
Based on the requirements I have gleaned from your message, it seems that only TEI offers the capabilities you desire, and we already know the political firestorm that /that/ suggestion would engender. But I think your assertion, unsupported by evidence, is simply wrong. If you insist on embedding arbitrary binary objects inside an HTML file it can be done with the <object> element. It will be hard to find user agents to support them, but that's the fault of the implementation, not the specification. In fact, so far I have not seen evidence of /any/ book structure that cannot be accommodated by (X)HTML.
Certainly we can't publish or even describe an API that would provide unambiguous text and metadata sufficient to construct a properly structured ebook in any other representation than the one in which it is stored.
Why not? It certainly seems feasible to me. Can you provide examples why it cannot be done?
And among the storage formats we provide from which such information might be inferred, it seems to be in most cases the plain-text format that is most accessible.
Only to humans, which have highly developed neural networks. Unstructured text is only acceptable if you are willing to accept that all manipulations of the files will be done by humans, and even then some conventions are encouraged if not required. It seems to me that this argument boils down to, "We can't figure out how to structure texts without adopting certain standards and conventions, we refuse to adopt the standards and conventions that will allow these texts to be repurposed, therefore we're just going to give up. Our texts are made by humans, for humans, and nothing else is supported." Again, I have to say that the problems facing automated text conversions are political, not technical, and I don't see any movement among the PG despots that indicate any kind of willingness to help solve those problems. The best advice I can offer is to seek asylum elsewhere.

Again, I have to say that the problems facing automated text conversions are political, not technical, and I don't see any movement among the PG despots that indicate any kind of willingness to help solve those problems. The best advice I can offer is to seek asylum elsewhere.
I'd say they are not political, they are practical. The technical discussion is entirely moot until the editor problem is solved. Try as they might, and there's a big pot of gold waiting for the first to succeed, no one has been able to develop an HTML editor that normal people can use. Certainly not one with the sophistication to enable them to use the breadth of markup required to even edit the poor meagre subset of syntactical information (not even chapters) incorporated into the elegant products coming from DP. Microsoft has failed. Adobe has failed (and they have the only product that has real traction for highly technical users.) Google keeps trying with Google Docs, but that's clearly unsatisfactory. What is technically possible with HTML, X or otherwise, makes no difference at all unless there's an editor supporting it that is approximately as easy to use as what people write their emails with, and captures syntactic artifacts. The light-weight markup languages are all I've ever seen that try to address this most fundamental requirement. Everything else seems to grind to a halt in the great wysiwyg swamp - an impedance mismatch if there ever was one. Email hosts and blog engines are the word's two anvils for testing the reality of text editor usability. Anthologize is one attempt to take aim at ebooks from this direction. At least a couple of the POD/ebook-self-publishing outfits are asking people to submit using blog engines (including B&N.) I can't think of anyone who appears willing, much less eager, to accept HTML text. The format it eventually gets stored in inevitably will the be format it gets submitted in. Any other assumption just adds overhead and loses textual integrity. You can only take out what's put in.

Am 27.02.2011 um 09:34 schrieb don kretz:
I'd say they are not political, they are practical. The technical discussion is entirely moot until the editor problem is solved.
Try as they might, and there's a big pot of gold waiting for the first to succeed, no one has been able to develop an HTML editor that normal people can use. Certainly not one with the sophistication to enable them to use the breadth of markup required to even edit the poor meagre subset of syntactical information (not even chapters) incorporated into the elegant products coming from DP.
Microsoft has failed. Adobe has failed (and they have the only product that has real traction for highly technical users.) Google keeps trying with Google Docs, but that's clearly unsatisfactory.
MS code will generally, not validate! In other words it is non-standard. But, why! Because, they add a lot of fluff which looks nice and makes their editor simple to use. That is not technical, but political! I have not use Adobe for a long time. GoLive use to have a validater at least! Furthermore, you do not use HTML making printed matter! It is not designed for that. So what is NEEDED an EDITOR designed towards editing text for ebooks for the output media and it needs. you do NOT need the full standard. Yet, as long as people think they have to use the full standard to create aesthetically looking Books, the problem is political! regards Keith

MS code will generally, not validate! In other words it is non-standard. But, why! Because, they add a lot of fluff which looks nice and makes their editor simple to use.
That is not technical, but political! I have not use Adobe for a long time. GoLive use to have a validater at least!
Furthermore, you do not use HTML making printed matter! It is not designed for that. So what is NEEDED an EDITOR designed towards editing text for ebooks for the output media and it needs. you do NOT need the full standard. Yet, as long as people think they have to use the full standard to create aesthetically looking Books, the problem is political!
regards Keith
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
"A lot of fluff which looks nice and makes their editor simple to use." How inconsiderate. The problem is that people don't write well-formed text. They write prose and poetry. They write catalogs and plays and textbooks. All of which are often not "well-formed" structurally. Shakespeare especially so. Have you ever tried to map a Shakespeare play to XHTML? A "page" is often discontinguous with a Chapter or a Verse or Stanza or Scene. You have footnotes and sidenotes and stage directions willy-nilly. There's no "simple" way for a rigorously hierarchical method to accommodate that. (and X- is distinguished if anything for its rigor). So the problem is not Shakespeare, but those who carelessly mis-transcribed the structure of his work? I'll wait to see your special Editor. At least bb is willing to put it on the line.

On 02/27/2011 06:57 PM, don kretz wrote:
"A lot of fluff which looks nice and makes their editor simple to use." How inconsiderate.
I don't see any difference to DP: they routinely put in a lot of fluff which looks nice but breaks interoperability everywhere (excepting the very one monitor the PPer is using at the time.)
The problem is that people don't write well-formed text. They write prose and poetry. They write catalogs and plays and textbooks. All of which are often not "well-formed" structurally. Shakespeare especially so. Have you ever tried to map a Shakespeare play to XHTML?
No. But TEI has been used to capture all that and much more.
A "page" is often discontinguous with a Chapter or a Verse or Stanza or Scene. You have footnotes and sidenotes and stage directions willy-nilly. There's no "simple" way for a rigorously hierarchical method to accommodate that. (and X- is distinguished if anything for its rigor).
And still TEI, that was formerly expressed in SGML is now expressed in XML. -- Marcello Perathoner webmaster@gutenberg.org

On 2/27/2011 1:22 PM, Marcello Perathoner wrote:
On 02/27/2011 06:57 PM, don kretz wrote:
"A lot of fluff which looks nice and makes their editor simple to use." How inconsiderate.
I don't see any difference to DP: they routinely put in a lot of fluff which looks nice but breaks interoperability everywhere (excepting the very one monitor the PPer is using at the time.)
The PPers I hang out with routinely test with different browsers, and different browser window sizes, and different font sizes, and anything else we can think of to force problems, Marcello. And we often put our works up for preview by other PPers to see if they see anything wrong on their browsers with their window/monitor sizes. That may, of course, only demonstrate that a book works well in a web browser, but it's far better than breaking interoperability ~"everywhere except the one browser the PPer is using"~. And some of us are trying to learn how to produce work that looks good both in a browser and in epub format, as you're aware. -- Walt

And some of us are trying to learn how to produce work that looks good both in a browser and in epub format, as you're aware.
Again, these two books talk pretty correctly and honestly about what the practical issues are re html, epub, and mobi on ebook readers: EPUB Straight to the Point: Creating ebooks for the Apple iPad and other ereaders By: Elizabeth Castro Kindle Formatting: The Complete Guide To Formatting Books For The Amazon Kindle by Joshua Tallent

Pardon !!??????? Am 27.02.2011 um 18:57 schrieb don kretz:
"A lot of fluff which looks nice and makes their editor simple to use." How inconsiderate.
The problem is that people don't write well-formed text. They write prose and poetry. They write catalogs and plays and textbooks. All of which are often not "well-formed" structurally. Shakespeare especially so. Have you ever tried to map a Shakespeare play to XHTML?
Never heard of "well formed text", please define! What is so hard about prose and poetry! Like I said I would not use XHTML. But, it can be done. Also, you can extend XHTML to get what you need.
A "page" is often discontinguous with a Chapter or a Verse or Stanza or Scene. You have footnotes and sidenotes and stage directions willy-nilly. There's no "simple" way for a rigorously hierarchical method to accommodate that. (and X- is distinguished if anything for its rigor).
So the problem is not Shakespeare, but those who carelessly mis-transcribed the structure of his work?
I would not say that they mis-transcribe! It is a matter of layout! So you need a method for adding sidenotes use marginals use a pop-up use divs use frames link to somewhere else The questions is how can they be represented. One of the OLDEST METHODS to do this is to use TABLES. Sure it looks ugly in code, but is a adequate work around. Stage Direction are often injected to the flow of the play and formatted so and so.
I'll wait to see your special Editor. At least bb is willing to put it on the line.
The problem is not the editor. It is the internal structure you want to represent a text in. Then one can write a editor that is easy to use. The fact of the matter the user or author of a ebook should not need to care how it is done, just what it looks like in the end. regards Keith.

On 02/27/2011 09:34 AM, don kretz wrote:
... no one has been able to develop an HTML editor that normal people can use.
That's BS. There are lots of HTML editors out there that people use. The underlying problem is: That even if your machine for processing horse manure has windows and a mouse and drop-down menus and animated icons, its output is still horse manure.
Certainly not one with the sophistication to enable them to use the breadth of markup required to even edit the poor meagre subset of syntactical information (not even chapters) incorporated into the elegant products coming from DP.
ROTFL. They may be elegant but they are non-functional. They work on desktop-sized screens only, for suitably small values of 'work': Try to narrow your browser window to the typical 5-6 words per line of a mobile phone. Breakage galore! And, yes!, a substantial portion of PG downloads go to mobile phones.
Microsoft has failed. Adobe has failed (and they have the only product that has real traction for highly technical users.) Google keeps trying with Google Docs, but that's clearly unsatisfactory.
The fundamental problem of WYSIWYG is that you can 'see' only the presentation. The semantics you have to infer with your brains. That's hard if your brain has been rotted by a lifetime of WYSIWYG use. HTML is a purely presentational markup and shares all the problems of WYSIWYG and adds some of its own. It is practically impossible to teach good markup to people that have had a prior exposure to HTML: as potential markup editors they are mentally mutilated beyond hope of regeneration. (This doesn't preclude that HTML is a good machine format. It just isn't suited for authoring. You don't write your code in assembler.)
What is technically possible with HTML, X or otherwise, makes no difference at all unless there's an editor supporting it that is approximately as easy to use as what people write their emails with, and captures syntactic artifacts.
Machines cannot capture semantic yet. (And when they do, Google's automatic output will surpass DP's human output not only in quantity but also in quality, thus making DP obsolete.) DP should have educated their processors about semantic markup. They have failed this in the same way they have soundly slept thru the technological changes of the last 5 years. (At least I wasn't able to find a single FAQ about sematic markup at DP and DP's output doesn't look like they are getting it.) Until the average person at DP cannot tell a paragraph from not a paragraph, every discussion about formats and tools is moot. -- Marcello Perathoner webmaster@gutenberg.org

On Sun, Feb 27, 2011 at 9:54 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
ROTFL. They may be elegant but they are non-functional. They work on desktop-sized screens only, for suitably small values of 'work': Try to narrow your browser window to the typical 5-6 words per line of a mobile phone. Breakage galore! And, yes!, a substantial portion of PG downloads go to mobile phones.
Of course, you've advocated breaking it on my 1600x900 screen, and you totally trashed any chance we could have TEI-Lite. -- Kie ekzistas vivo, ekzistas espero.

David Starner <prosfilaes@gmail.com> writes:
On Sun, Feb 27, 2011 at 9:54 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
ROTFL. They may be elegant but they are non-functional. They work on desktop-sized screens only, for suitably small values of 'work': Try to narrow your browser window to the typical 5-6 words per line of a mobile phone. Breakage galore! And, yes!, a substantial portion of PG downloads go to mobile phones.
Of course, you've advocated breaking it on my 1600x900 screen, and you totally trashed any chance we could have TEI-Lite.
You wanted TEI-Lite as a so-call (output-)format... -- Karl Eichwalder

On 02/27/2011 08:07 PM, David Starner wrote:
Of course, you've advocated breaking it on my 1600x900 screen, and you totally trashed any chance we could have TEI-Lite.
Now that you mentioned it: I never ever read even one entire book on a desktop screen. OTOH my old Palm Treo (320x320) is still loaded with 200+ plucker books, most of which I've actually read, in the park, on the commute or standing in the queue. I don't know why DP must pighead itself into producing for a platform nobody uses for reading. Must be that you can show off to your friends easier on a desktop (after carefully adjusting the browser width to the only width the book actually works.) Care to amplify on that bit about TEI? I don't quite understand how actually writing a TEI converter that works for PG has trashed the chances? The only possible interpretation is that DP has a serious case of 'not invented here'. -- Marcello Perathoner webmaster@gutenberg.org

On Sun, Feb 27, 2011 at 11:41 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
Now that you mentioned it: I never ever read even one entire book on a desktop screen.
Exactly, Marcello. If you've never done it, then it's
a platform nobody uses for reading.
No more evidence needed!
Care to amplify on that bit about TEI? I don't quite understand how actually writing a TEI converter that works for PG has trashed the chances?
But see, it didn't work. It produced title pages that were distinctly non-standard, and distinctly ugly. And when people complained, you, as you usually do, claimed that it was perfect and refused to change anything. Even if you were right, you were still wrong; you let such issues derail TEI, instead of working with people to achieve something they would be willing to use, even if it isn't an exact fit to your tastes. And again, when I go to http://en.wikipedia.org/wiki/Tom_Sawyer and click on the link at the bottom, it comes up to http://www.gutenberg.org/ebooks/74 ... in German. Is this another it works right for Marcello? -- Kie ekzistas vivo, ekzistas espero.

On 02/27/2011 09:38 PM, David Starner wrote:
No more evidence needed!
The top selling item of the top bookseller on the planet is a portable book reading device. I'm sure all those people just buy a costly dedicated device to go and read books on desktop PCs they already owned. But if you still don't get it, then continue doing what you've always done. Its not *my* work that will gather dust and be forgotten in a format nobody wants.
Care to amplify on that bit about TEI? I don't quite understand how actually writing a TEI converter that works for PG has trashed the chances?
But see, it didn't work. It produced title pages that were distinctly non-standard, and distinctly ugly. And when people complained, you, as you usually do, claimed that it was perfect and refused to change anything.
I tell you now, as I told you then, and you won't listen now, as you didn't listen then, that if your aesthetic preferences are at variance with the TEI defaults, you are at liberty to use CSS styling all the way until it the output suits you. Fact is: You can style TEI titlepages as minutely as you want. Anybody who cares to take look here can see that: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf But, as usual, once David has has formed his (erroneous) opinion, he will continue spreading his misinformation completely unbothered by contradicting fact.
Even if you were right, you were still wrong; you let such issues derail TEI, instead of working with people to achieve something they would be willing to use, even if it isn't an exact fit to your tastes.
If I'm right, then I'm right. Period. What devious logic do they use on the planet you come from? I'm not working with you because you are not interested in achieving something, but only in destroying something that is not of your own invention.
And again, when I go to http://en.wikipedia.org/wiki/Tom_Sawyer and click on the link at the bottom, it comes up to http://www.gutenberg.org/ebooks/74 ... in German. Is this another it works right for Marcello?
Yes. If you configure your browser so that none of the languages you accept is available, it will send a random language. (I think it chooses in alphabetical order, which happens to be: de, en, es, fr, it.) If you prefer English over a random language then why don't you configure your browser so? You are like that woman at DP that complained that the site sent her a date in Inuktitut. Even though she had actually configured Inuktitut as her preferred language! -- Marcello Perathoner webmaster@gutenberg.org

On Sun, Feb 27, 2011 at 1:55 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
The top selling item of the top bookseller on the planet is a portable book reading device. I'm sure all those people just buy a costly dedicated device to go and read books on desktop PCs they already owned.
But it's not a cellphone, and we shouldn't be using the same HTML for it and a desktop.
if your aesthetic preferences are at variance with the TEI defaults,
And I told you then, and I'll tell you again, it's not my aesthetic preferences, it's the standard way to do title pages for the last few centuries. And nobody changes the defaults, so they must look decent out of the box.
Even if you were right, you were still wrong; you let such issues derail TEI, instead of working with people to achieve something they would be willing to use, even if it isn't an exact fit to your tastes.
If I'm right, then I'm right. Period. What devious logic do they use on the planet you come from?
On my planet, they invent something called the compromise, to let people who don't agree on what's right continue to work together. People who fail to achieve their goals because they fail to compromise are wrong. Metric and 110 voltage may be the right standards, but if you refuse to convert to Imperial units and 220 voltage for those markets, you're the one who's wrong.
And again, when I go to http://en.wikipedia.org/wiki/Tom_Sawyer and click on the link at the bottom, it comes up to http://www.gutenberg.org/ebooks/74 ... in German. Is this another it works right for Marcello?
Yes. If you configure your browser so that none of the languages you accept is available, it will send a random language. (I think it chooses in alphabetical order, which happens to be: de, en, es, fr, it.)
So, despite the fact that we have 37 times as many books in English instead of German, it defaults to German?
If you prefer English over a random language then why don't you configure your browser so?
I did; I always have. Right now I'm sending "en-us,en;q=0.8,eo;q=0.5,de;q=0.3".
You are like that woman at DP that complained that the site sent her a date in Inuktitut. Even though she had actually configured Inuktitut as her preferred language!
I suppose the concept that a webpage should be in one language if possible is beyond you. -- Kie ekzistas vivo, ekzistas espero.

I must agree with Marcello here, in that I have met MANY people who have read entire PG eBooks on Palm-powered devices or with Plucker or Mobi or any of the other formats. In fact, I have received message from people who doubled the number they read in toto per year, just by reading in line, commuting, etc. If you could do PG eBooks on the first iPod after only one week out then the rest should actually be gravy. . . . On Sun, 27 Feb 2011, Marcello Perathoner wrote:
On 02/27/2011 08:07 PM, David Starner wrote:
Of course, you've advocated breaking it on my 1600x900 screen, and you totally trashed any chance we could have TEI-Lite.
Now that you mentioned it: I never ever read even one entire book on a desktop screen.
OTOH my old Palm Treo (320x320) is still loaded with 200+ plucker books, most of which I've actually read, in the park, on the commute or standing in the queue.
I don't know why DP must pighead itself into producing for a platform nobody uses for reading. Must be that you can show off to your friends easier on a desktop (after carefully adjusting the browser width to the only width the book actually works.)
Care to amplify on that bit about TEI? I don't quite understand how actually writing a TEI converter that works for PG has trashed the chances? The only possible interpretation is that DP has a serious case of 'not invented here'.

Am 27.02.2011 um 18:54 schrieb Marcello Perathoner:
On 02/27/2011 09:34 AM, don kretz wrote:
Certainly not one with the sophistication to enable them to use the breadth of markup required to even edit the poor meagre subset of syntactical information (not even chapters) incorporated into the elegant products coming from DP.
ROTFL. They may be elegant but they are non-functional. They work on desktop-sized screens only, for suitably small values of 'work': Try to narrow your browser window to the typical 5-6 words per line of a mobile phone. Breakage galore! And, yes!, a substantial portion of PG downloads go to mobile phones. Good Point! BUT, that is because the are not developing towards smaller screen sizes then. WHICH, they they should be doing if targeting the epub format!
Microsoft has failed. Adobe has failed (and they have the only product that has real traction for highly technical users.) Google keeps trying with Google Docs, but that's clearly unsatisfactory.
The fundamental problem of WYSIWYG is that you can 'see' only the presentation. The semantics you have to infer with your brains. That's hard if your brain has been rotted by a lifetime of WYSIWYG use. Another, GOOD POINT. BUT, WYSIWYG is not the problem. WYSIWYG is basically, your standard light-markup up! Now, for many it is easier to FORMAT- text with mark-up "Tags"/Elements that have a relation to semantic structures.
HTML is a purely presentational markup and shares all the problems of WYSIWYG and adds some of its own.
It is practically impossible to teach good markup to people that have had a prior exposure to HTML: as potential markup editors they are mentally mutilated beyond hope of regeneration. WRONG! You have to tell the to forget everything the have learned so far and teach them what good mark-up practice is!
(This doesn't preclude that HTML is a good machine format. It just isn't suited for authoring. You don't write your code in assembler.) I do some assembler once and a while, but then for a very specialized purpose! But, you bring up another interesting point. Most do not need assembler anymore because the programming tools are so good that assembler optimization is hardly needed any more. So want is needed are GOOD TOOLS.
But, most do not mark-up or programming from the bottom up! They insist their tools are the best and know best. Though there are ways do going things better. Even if you can demonstrate it to them.
What is technically possible with HTML, X or otherwise, makes no difference at all unless there's an editor supporting it that is approximately as easy to use as what people write their emails with, and captures syntactic artifacts.
Machines cannot capture semantic yet. (And when they do, Google's automatic output will surpass DP's human output not only in quantity but also in quality, thus making DP obsolete.)
DP should have educated their processors about semantic markup. They have failed this in the same way they have soundly slept thru the technological changes of the last 5 years. (At least I wasn't able to find a single FAQ about sematic markup at DP and DP's output doesn't look like they are getting it.)
Until the average person at DP cannot tell a paragraph from not a paragraph, every discussion about formats and tools is moot.
The real question is true semantic markup needed! You yourself have mention that the semantics is in the mind. Besides, what is a Chapter title? It, too is a paragraph, in most cases! regards Keith.

It's what goes into the Table of Contents in an ebook. And ebooks for the most part don't let you pass "Go" without distinct chapters in hand. So you won't get far with only undifferentiated chapter headings.
The real question is true semantic markup needed! You yourself have mention that the semantics is in the mind. Besides, what is a Chapter title? It, too is a paragraph, in most cases!
regards Keith.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Excuse me. Since when is it a prequisite that chapter headings be put into the TOC!! I do admit that Books have a TOC. The TOC is just there to help navigate a Book even printed! The epub requires that there is a TOC like structure for navigating the ebooks. They do not have to correspond to the chapters of a book!! There are many books with out TOCs. Especially, older books. It is only a newer convention. A Chapter, is defined semantically differently. It marks a logical division in a book. A TOC on the other hand has a logical relation to the chapters of a book. NOT the other way around. Of course, it is nice to have the functionality that during processing to know that here is a chapter and please make a TOC entry! Yet, to infer the start of a chapter one does not need a explicit chapter marker. I have not tried it, but I should be possible to make an ebook that does not have any chapter navigation nodes except one to the beginning of the file containing content. At least there is not anything forbidding it! It is a convention. The epub has no way of identifying a chapter unless you tell it that a chapter is suppose at point X. regards Keith. Am 28.02.2011 um 09:18 schrieb don kretz:
It's what goes into the Table of Contents in an ebook. And ebooks for the most part don't let you pass "Go" without distinct chapters in hand. So you won't get far with only undifferentiated chapter headings.
The real question is true semantic markup needed! You yourself have mention that the semantics is in the mind. Besides, what is a Chapter title? It, too is a paragraph, in most cases!
regards Keith.

On Mon, February 28, 2011 11:14 am, Jim Adcock wrote:
Excuse me. Since when is it a prerequisite that chapter headings be put into the TOC!!
In my experience over at least the last couple years the whitewashers insist that there be a TOC linking to chapter headings.
But now you're talking about politics, not technology. What the PG white washers require has little or no relevance to the structures of books, and only matters if you're willing to play their games. What the white washers want has no relevance to the definition of the meaning of a chapter title. Personally, I simply don't care what the white washers think.

Personally, I simply don't care what the white washers think.
I care, because I would rather not make their lives unnecessarily harder, nor mine, when I am trying to submit a book. We're in this together, even when we agree to disagree, which is frequently vehemently.

On Mon, February 28, 2011 1:18 am, don kretz wrote:
Besides, what is a Chapter title? It, too is a paragraph, in most cases!
It's what goes into the Table of Contents in an ebook. And ebooks for the most part don't let you pass "Go" without distinct chapters in hand. So you won't get far with only undifferentiated chapter headings.
No, a Chapter title is "a descriptive name, caption, or heading of a division of a written work, especially a narrative." Sometimes, these chapter titles are collected into a Table of Contents (which, technically, is part of a book's metadata, and not part of the book itself) and sometimes they are not, but inclusion in a Table of Contents is not that which defines the essence of a Chapter title; it is only a manifestation thereof. So, it seems to me that the way to mark up a chapter title is to first identify "a division of a written work," and then to find the "name, caption or heading" within that division and mark it as a "title". Semantics, not presentation. What It Is, not What It Looks Like.

Am 28.02.2011 um 18:01 schrieb Lee Passey:
On Mon, February 28, 2011 1:18 am, don kretz wrote:
Besides, what is a Chapter title? It, too is a paragraph, in most cases!
It's what goes into the Table of Contents in an ebook. And ebooks for the most part don't let you pass "Go" without distinct chapters in hand. So you won't get far with only undifferentiated chapter headings.
No, a Chapter title is "a descriptive name, caption, or heading of a division of a written work, especially a narrative." Like I said in my other post: "very good". Function!
Sometimes, these chapter titles are collected into a Table of Contents (which, technically, is part of a book's metadata, and not part of the book itself) and sometimes they are not, but inclusion in a Table of Contents is not that which defines the essence of a Chapter title; it is only a manifestation thereof.
So, it seems to me that the way to mark up a chapter title is to first identify "a division of a written work," and then to find the "name, caption or heading" within that division and mark it as a "title".
Semantics, not presentation. What It Is, not What It Looks Like. Actually, it is semio-syntactic!
regards Keith.

On 02/28/2011 08:46 AM, Keith J. Schultz wrote:
Besides, what is a Chapter title? It, too is a paragraph, in most cases!
This pretty much illustrates my point about semantic illiteracy. -- Marcello Perathoner webmaster@gutenberg.org

Hi Marcello, Am 28.02.2011 um 11:28 schrieb Marcello Perathoner:
On 02/28/2011 08:46 AM, Keith J. Schultz wrote:
Besides, what is a Chapter title? It, too is a paragraph, in most cases!
This pretty much illustrates my point about semantic illiteracy. Really! Who is here illiterate here. What does the semantic definition of a paragraph have that a Chapter title does not ?
regards Keith

On 02/28/2011 12:07 PM, Keith J. Schultz wrote:
Really! Who is here illiterate here. What does the semantic definition of a paragraph have that a Chapter title does not ?
RTFW: "A paragraph [...] is a self-contained unit of a discourse in writing dealing with a particular point or idea." -- http://en.wikipedia.org/wiki/Paragraph -- Marcello Perathoner webmaster@gutenberg.org

Am 28.02.2011 um 12:34 schrieb Marcello Perathoner:
On 02/28/2011 12:07 PM, Keith J. Schultz wrote:
Really! Who is here illiterate here. What does the semantic definition of a paragraph have that a Chapter title does not ?
RTFW:
"A paragraph [...] is a self-contained unit of a discourse in writing dealing with a particular point or idea."
-- http://en.wikipedia.org/wiki/Paragraph So what does a CHAPTER TITLE NOT have that does not have? The particular point or idea is the introduction of a logical unit. It is just as "self contained" as any paragraph. Oh, before you start off. Many paragraphs do the same, too. So you can see it does meet the criteria.
Then again you probably say that a sentence can not be a paragraph, either! regards

On 02/28/2011 12:46 PM, Keith J. Schultz wrote:
So what does a CHAPTER TITLE NOT have that does not have?
You should type slower. Or think faster. Whetever. A title is not a paragraph in the same way that a sign that says "106 miles to Chicago" is not the same as the 106 miles of road to get there. -- Marcello Perathoner webmaster@gutenberg.org

Hi Marcello, Am 28.02.2011 um 12:54 schrieb Marcello Perathoner:
On 02/28/2011 12:46 PM, Keith J. Schultz wrote:
So what does a CHAPTER TITLE NOT have that does not have?
You should type slower. Or think faster. Whetever. Sorry.
A title is not a paragraph in the same way that a sign that says "106 miles to Chicago" is not the same as the 106 miles of road to get there. Yet, the semantics, are the same!!
You gave as a SEMANTIC definition:
RTFW:
"A paragraph [...] is a self-contained unit of a discourse in writing dealing with a particular point or idea."
Now, this definition as stated can be used to define a entire chapter, section or even a book. The problem is the definition, not the semantics. Furthermore, there is some confusion here. Semantics is not meaning! Another, point of confusion, is the difference between semantics and pragmatics. Many linguists tend to throw in pragmatics with semantics and that causes a lot of problems. Meaning, manifests itself in the combination of syntax, semantics and pragmatics. The so-called intent belong mainly in the realm of pragmatics. Semantically speaking the function of a chapter title is the same as a paragraph. Yet, the pragmatic usage is different. To explain my original point. Syntactically, one can represent a chapter title as a paragraph. Take the <p> tag in HTML and add a couple of attributes and you have a header! There are even enough attributes that one could can mark it up as a chapter! True, a normal HTML browser will not recognize it as such. But, a tool could recognize it and act accordingly!. In other words, just like you have mentioned, semantics happens in humans, here, the semantics happens in the tools. regards Keith.

"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> On 02/28/2011 12:07 PM, Keith J. Schultz wrote: >> Really! Who is here illiterate here. What does the semantic >> definition of a paragraph have that a Chapter title does not ? Marcello> RTFW: Marcello> "A paragraph [...] is a self-contained unit of a Marcello> discourse in writing dealing with a particular point or Marcello> idea." This seems to me perfectly coherent with most chapter titles. The point for me is rather that we have two different concepts: the chapter header, that can be empty, (displayed as vertical space) or contain several items, including a title (often a short paragraph, but may also consist of more than one paragraph), and identifier (usually a number), a summary, an epigraph, etc. The TOC should point to the chapter header, not the chapter title, that often is missing. A chapter body (that follows the header, and in extreme cases may be empty) is usually composed of paragraphs, but may contain other things, for example a chapter in a book of poetry does not contain paragraphs. And I think that the wikipedia definition is too narrow, if not just wrong. I have seen paragraphs (in dialogue) that consist just of --... The distinctive character of a paragraph is IMHO the display, you can run two paragraphs together and obtain one paragraph, only with changes in whitespace. Hence a paragraph is a typographical unit, not necessarily a logical unit (even if good typography usually makes paragraph breaks coincide with logical units). The fact that usually the paragraph breaks are decided by the author does not contradict that it is a typographical concept. Carlo

Hi Carlo, I do agree with you as far as definition is concerned. (Though not entirely) The argument was not about meaning, but semantics. They are two very distinct animals and not very well known to the lay person and hard for them to separate. There is also the question of register. regards Keith. Am 28.02.2011 um 13:45 schrieb Carlo Traverso:
"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> On 02/28/2011 12:07 PM, Keith J. Schultz wrote:
Really! Who is here illiterate here. What does the semantic definition of a paragraph have that a Chapter title does not ?
Marcello> RTFW:
Marcello> "A paragraph [...] is a self-contained unit of a Marcello> discourse in writing dealing with a particular point or Marcello> idea."
This seems to me perfectly coherent with most chapter titles.
The point for me is rather that we have two different concepts: the chapter header, that can be empty, (displayed as vertical space) or contain several items, including a title (often a short paragraph, but may also consist of more than one paragraph), and identifier (usually a number), a summary, an epigraph, etc. The TOC should point to the chapter header, not the chapter title, that often is missing.
A chapter body (that follows the header, and in extreme cases may be empty) is usually composed of paragraphs, but may contain other things, for example a chapter in a book of poetry does not contain paragraphs. And I think that the wikipedia definition is too narrow, if not just wrong. I have seen paragraphs (in dialogue) that consist just of
--...
The distinctive character of a paragraph is IMHO the display, you can run two paragraphs together and obtain one paragraph, only with changes in whitespace. Hence a paragraph is a typographical unit, not necessarily a logical unit (even if good typography usually makes paragraph breaks coincide with logical units). The fact that usually the paragraph breaks are decided by the author does not contradict that it is a typographical concept.

On Mon, February 28, 2011 5:45 am, Carlo Traverso wrote:
The point for me is rather that we have two different concepts: the chapter header, that can be empty, (displayed as vertical space) or contain several items, including a title (often a short paragraph, but may also consist of more than one paragraph), and identifier (usually a number), a summary, an epigraph, etc. The TOC should point to the chapter header, not the chapter title, that often is missing.
I /really/ like this concept, and would support it 110%.
A chapter body (that follows the header, and in extreme cases may be empty) is usually composed of paragraphs, but may contain other things, for example a chapter in a book of poetry does not contain paragraphs.
Yes, a chapter need not exclusively consist of paragraphs. Other structures may also be included in chapters.
And I think that the wikipedia definition is too narrow, if not just wrong.
And yet, this definition is repeated in virtually every dictionary I have looked at, sometimes drawn even more narrowly.
I have seen paragraphs (in dialogue) that consist just of
--...
It is usual in French texts to preface quoted text with a horizontal bar or quotation dash (―). In this case, I suspect that what you have identified as a paragraph is shorthand for "Noun stood silently, pensive." I'm willing to accept text marked as paragraphs which do not meet the formal requirements if they obviously serve the same purpose, and could be replace with a true paragraph without changing the meaning of the prose. In other cases, I think you have probably seen blocks of text which have many of the same presentational qualities as paragraphs, but which are not. (I generally refer to these things as "anonymous blocks", from the TEI specification). As you point out, chapters need not need to be composed /exclusively/ of paragraphs, so placing some of these other constructs in a chapter is not objectionable. To me, the key is to identify these constructs as what they are, not how they look on a page.
The distinctive character of a paragraph is IMHO the display, you can run two paragraphs together and obtain one paragraph, only with changes in whitespace. Hence a paragraph is a typographical unit, not necessarily a logical unit (even if good typography usually makes paragraph breaks coincide with logical units). The fact that usually the paragraph breaks are decided by the author does not contradict that it is a typographical concept.
Here we must simply agree to disagree (and you are also disagreeing with every English teacher I have ever had, who were constant berating "run-on paragraphs"). I simply cannot accept the notion that a paragraph is merely a typographical convention. Cheers, Lee

There are two skills that we all learned (or should have learned :) that required our ability to distinguish between semantics and presdntationi. 1. Diagramming sentences. 2. Writing an outline. I think this is common at least among speakers of western languages.

On Mon, February 28, 2011 11:08 am, don kretz wrote:
There are two skills that we all learned (or should have learned :) that required our ability to distinguish between semantics and presentation.
1. Diagramming sentences.
2. Writing an outline.
I think this is common at least among speakers of western languages.
This is an astute and cogent observation. I probably started learning about how to diagram sentences in the 3rd or 4th grade, but I think that diagramming sentences has now fallen from favor. I'm not sure if my children could explain to me what diagramming a sentence is (next time I talk to them, I'm going to find out). Any division of text that does not have a subject (at least implied) and a verb is not a paragraph.

On Monday, 28th February 2011 at 10:49:40 (GMT -0700 MST), Lee Passey wrote:
I simply cannot accept the notion that a paragraph is merely a typographical convention.
That notion is nonsense. Paragraphs are units of *meaning*, and very important ones. Anyone who has ever studied stylistics knows that. There's a big difference between: She wanted to kill him, but didn't. She wanted to kill him. But she didn't. She wanted to kill him. But she didn't. I believe it was a favourite stylistic means of Dashiell Hammett's to end many of his hard-boiled private-eye stories with the terse 3-word paragraph: "They hanged him." That's a big difference from simply attaching those 3 words to the preceding paragraph! The impact of such a terse, isolated paragraph is a lot stronger on the reader, and makes for a much better "punchline" than other endings the writer could have come up with. -- Yours, Alex. www.aboq.org [processed by "The Bat!", Version 4.2.10.12]

On 02/28/2011 01:45 PM, Carlo Traverso wrote:
"Marcello" == Marcello Perathoner<marcello@perathoner.de> writes:
Marcello> On 02/28/2011 12:07 PM, Keith J. Schultz wrote:
>> Really! Who is here illiterate here. What does the semantic >> definition of a paragraph have that a Chapter title does not ?
Marcello> RTFW:
Marcello> "A paragraph [...] is a self-contained unit of a Marcello> discourse in writing dealing with a particular point or Marcello> idea."
This seems to me perfectly coherent with most chapter titles.
CHAPTER I. Please explain to me which discourse, point or idea the above mentioned most common of all chapter titles deals with.
The distinctive character of a paragraph is IMHO the display, you can run two paragraphs together and obtain one paragraph, only with changes in whitespace. Hence a paragraph is a typographical unit, not necessarily a logical unit (even if good typography usually makes paragraph breaks coincide with logical units). The fact that usually the paragraph breaks are decided by the author does not contradict that it is a typographical concept.
You can run a whole book into one line just with changes of whitespace. Long before paragraphs where typographically represented as blocks of text marked by whitespace authors used the pilcrow sign to start a new train of thought. And before that the Greeks used a dash in the margin ('paragraphos') to signal a new paragraph. Thus the concept of paragraph is much older than typography. -- Marcello Perathoner webmaster@gutenberg.org

On Mon, Feb 28, 2011 at 10:00 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
CHAPTER I.
Please explain to me which discourse, point or idea the above mentioned most common of all chapter titles deals with.
"We are starting the main text of the book, the part labeled 1." -- Kie ekzistas vivo, ekzistas espero.

On 02/28/2011 07:17 PM, David Starner wrote:
On Mon, Feb 28, 2011 at 10:00 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
CHAPTER I.
Please explain to me which discourse, point or idea the above mentioned most common of all chapter titles deals with.
"We are starting the main text of the book, the part labeled 1."
That is called an "incipit". Care to look it up? -- Marcello Perathoner webmaster@gutenberg.org

On Mon, February 28, 2011 11:17 am, David Starner wrote:
On Mon, Feb 28, 2011 at 10:00 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
 CHAPTER I.
Please explain to me which discourse, point or idea the above mentioned most common of all chapter titles deals with.
"We are starting the main text of the book, the part labeled 1."
But note that the text "CHAPTER I" is still not a paragraph (which should not imply that a chapter title cannot be a paragraph, just that this one is not). It is a name or identifier. It is certainly possible to imply a paragraph that refers to this identifier, such as "The section of text that follows can be identified by the name 'CHAPTER I'", but the mere mention of the identifier does not necessarily imply the paragraph. Furthermore, any such implied paragraph is metadata; it is not part of the narrative. It is someone on the outside, looking in, saying "Look! I recognize an identifier." Cheers, Lee

On Mon, February 28, 2011 12:46 am, Keith J. Schultz wrote:
Am 27.02.2011 um 18:54 schrieb Marcello Perathoner:
On 02/27/2011 09:34 AM, don kretz wrote:
Certainly not one with the sophistication to enable them to use the breadth of markup required to even edit the poor meagre subset of syntactical information (not even chapters) incorporated into the elegant products coming from DP.
ROTFL. They may be elegant but they are non-functional. They work on desktop-sized screens only, for suitably small values of 'work': Try to narrow your browser window to the typical 5-6 words per line of a mobile phone. Breakage galore! And, yes!, a substantial portion of PG downloads go to mobile phones.
Good Point! BUT, that is because the are not developing towards smaller screen sizes then. WHICH, they they should be doing if targeting the epub format!
I disagree. Production of e-books should be a two-step process. First, the book should be marked up in a semantic way which preserves, to the greatest extent possible, the structure and metadata of the book and which does so in a machine-readable format (markup should be unique, explicit and unambiguous). Then, a computer process should be invoked which can transform the semantic markup into whatever presentation is required. If the person who is doing the initial markup is thinking about how it will look on a mobile phone, he or she is already being confused. The initial markup should focus on document structure and best encoding practices, and let the second, automated step worry about how to convert that markup to a format for a specific device. If Project Gutenberg were to adopt this "single source" strategy not only would its texts be "future-proofed" (compatible with software and devices which have not yet been invented) but it could save a single file for all devices and generate specific output for each device on-the-fly. Errors found and corrected in the master file would immediately be be corrected in all subsequent outputs. This won't happen, of course, but if it did it would be highly beneficial.
HTML is a purely presentational markup and shares all the problems of WYSIWYG and adds some of its own.
I disagree with Mr. Perathoner here. I think HTML started life as /mostly/, but not purely, presentational and has been evolving towards semantic ever since. HTML 4.01/XHTML 1.0 is now mostly /semantic/ and in HTML5 the proposed specification calls for /all/ presentational elements to be isolated into Cascading Style Sheets. When used carefully, it is possible to use HTML4 in a purely semantic way.
It is practically impossible to teach good markup to people that have had a prior exposure to HTML: as potential markup editors they are mentally mutilated beyond hope of regeneration.
WRONG! You have to tell the to forget everything the have learned so far and teach them what good mark-up practice is!
I am of two minds on this subject. I am nowhere near as pessimistic as Mr. Perathoner about the ability of humans to learn new techniques and paradigms. And yet, there is something about this presentation/semantic dichotomy that seems to go much deeper that just training. I am trying (really I am) to try and not get too deeply drawn in to arguments about the superiority of semantic markup. My experience suggests that some people recognize the distinction between semantic markup ("what it is") and presentational markup ("what it looks like") almost immediately, and that the others will almost /never/ get the difference. My current approach (to the degree that I can follow it) is to try and lay out the differences, but not try to convince someone with rational arguments that semantic markup is more useful. Either they get it or they don't, and even cordial and rational discussion doesn't seem to help. [snip]
What is technically possible with HTML, X or otherwise, makes no difference at all unless there's an editor supporting it that is approximately as easy to use as what people write their emails with, and captures syntactic artifacts.
Some good editing tools might be step in the right direction. Note, however, that these tools should /not/ be WYSIWYG, but WYSIWI (What You See Is What Is). These tools should be like braces, forcing the mind into a specific mindset and making the structure of a document explicit and visible. Pressing "Enter" should not implicitly start a new paragraph, but the tool should require that paragraphs be explicitly identified. Whenever a span of text is marked as italic, the tool should bring up a dialog asking /why/ the text should be italicized. Is it emphasized/stressed? Is it a foreign word or phrase? Is it a title? Is it simply intended to be an alternate font face? A good tool would make the user confront these issues at every step of the way, until the understanding is automatic.
Machines cannot capture semantic yet. (And when they do, Google's automatic output will surpass DP's human output not only in quantity but also in quality, thus making DP obsolete.)
DP should have educated their processors about semantic markup. They have failed this in the same way they have soundly slept through the technological changes of the last 5 years. (At least I wasn't able to find a single FAQ about semantic markup at DP and DP's output doesn't look like they are getting it.)
I fear that at DP the number of people who understand the concept of semantic markup are vastly outnumbered by the number of people who do not and cannot, and I fear that those who do not understand the distinction cannot be made to understand it through rational discourse. An alternative might be to put together a group of people who /do/ understand the distinction. It ought to be possible at this point to extract files from DP before they have been degraded for PG use (just before Post-Processing?) and store them in a new repository where they can be Post-Processed by the semantic volunteers. Project Gutenberg may not want these new files, but I'm sure Internet Archive would store them for us. And Mr. Newby might be willing to provide some online storage and a web interface to access them.
Until the average person at DP cannot tell a paragraph from not a paragraph, every discussion about formats and tools is moot.
Besides, what is a Chapter title? It, too is a paragraph, in most cases!
No it is not. You are thinking presentationally, not semantically. A paragraph is "one or more complete sentences, usually devoted to one idea and usually marked by the beginning of a new line, indentation, or increased interlinear space." A chapter is "a division of a written work, especially a narrative, usually titled or numbered." A title is "a descriptive name, caption, or heading of a section of a book." A chapter title is no more a paragraph than the phrase "Keith J. Schultz" is a paragraph, and anyone who hopes to engage in semantic markup must understand this. Cheers, Lee

Am 28.02.2011 um 17:50 schrieb Lee Passey:
On Mon, February 28, 2011 12:46 am, Keith J. Schultz wrote:
Am 27.02.2011 um 18:54 schrieb Marcello Perathoner:
On 02/27/2011 09:34 AM, don kretz wrote:
Certainly not one with the sophistication to enable them to use the breadth of markup required to even edit the poor meagre subset of syntactical information (not even chapters) incorporated into the elegant products coming from DP.
ROTFL. They may be elegant but they are non-functional. They work on desktop-sized screens only, for suitably small values of 'work': Try to narrow your browser window to the typical 5-6 words per line of a mobile phone. Breakage galore! And, yes!, a substantial portion of PG downloads go to mobile phones.
Good Point! BUT, that is because the are not developing towards smaller screen sizes then. WHICH, they they should be doing if targeting the epub format!
I disagree. Production of e-books should be a two-step process. First, the book should be marked up in a semantic way which preserves, to the greatest extent possible, the structure and metadata of the book and which does so in a machine-readable format (markup should be unique, explicit and unambiguous). Then, a computer process should be invoked which can transform the semantic markup into whatever presentation is required. First, evidently, all do not understand what semantic markup is. All talk about semantic markup, which is text semantics and is actually, syntactical, or semio-syntactic. Tags such as chapter, verse, etc are syntactic by nature. Or for the less involved describe the semantics that its syntax. [snip, snip]
I am of two minds on this subject. I am nowhere near as pessimistic as Mr. Perathoner about the ability of humans to learn new techniques and paradigms. And yet, there is something about this presentation/semantic dichotomy that seems to go much deeper that just training.
You do not need to train the to a new paradigm, Give a tool that does not let them go outside of the paradigm. Please, do not mix the actual author of a new text/book in TeX, TEI or XHTML with grasping the structure of a text or book. The latter is simpler than most would think. [snip, snip]
Besides, what is a Chapter title? It, too is a paragraph, in most cases!
No it is not. You are thinking presentationally, not semantically. A paragraph is "one or more complete sentences, usually devoted to one idea and usually marked by the beginning of a new line, indentation, or increased interlinear space."
That is exactly what a chapter title/header is one sentence. Though the content is highly compress!
A chapter is "a division of a written work, especially a narrative, usually titled or numbered." A title is "a descriptive name, caption, or heading of a section of a book." O.K. This IS a bit harder that Maecellos argument! It describes very well the nature of a chapter title. Yet, semantically, it is on the same level as a paragraph. You have giving the pragmatics I can not refute you here. See my other post for clarification.
A chapter title is no more a paragraph than the phrase "Keith J. Schultz" is a paragraph, and anyone who hopes to engage in semantic markup must understand this. You are right in that Keith J. Schultz is not a paragraph. Not even an open book! But, do you know how to semantically mark-up "Keith J. Schultz"?
regards Keith

The marketplace disagrees. If commercial companies can make it work, why can't DP?
From what I've seen purchasing commercial e-books, the commercial companies are not doing much better [any better] on these issues than PG/DP. Many "commercial" formatters have a weaker grasp on the issues that the PG/DP volunteers, and coming from a paper background, and using paper-oriented development tools, come to the table with their own sets of prejudices and misunderstandings of the issues. What is particularly sad is to see commercial firms spend great time, money and effort to "make it look just like the paper edition" -- with the resulting predictably catastrophic results.

Am 25.02.2011 um 04:09 schrieb David Starner:
On Thu, Feb 24, 2011 at 6:31 PM, John Redmond <john_redmond@optusnet.com.au> wrote:
3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype.
But XHTML sucks for books. There's no sidenote/footnote/endnote markups, there's no titlepage mark-up (which would make title and author automatically readable in most cases), etc. It depends how you look at XHMTL if you take XHTML proper it can be expanded to accommodate these features. But, most see XHMTL as just a wrapper around HMTL which it is not!
As any user of LaTeX will know, it is just a matter of hand-polishing the intermediate LaTeX files before the final conversion to PDFs.
It's never "just" a matter of hand-polishing; that's a serious flaw. Especially as it would have to be done for every separate size of PDF. By the use of LaTeX ( XeLaTeX better) as an intermediate for the PDF is nice, but introduces some assumptions that cause the hand polishing. Though most of the "polishing" could be avoided by crafting a class that is aimed towards the XML -> LaTeX -> PDF process. A better approach would be a tool specifically designed to create the PDF from the XML itself, using XSLTs for added control.
SO, if PG were to establish and maintain canonical XML texts, would it not be future-proofed?
XML is not magic; many blobs of XML are opaque to any but their generators. HTML will probably be around for forever, but as I said above, it's suboptimal for books. Which version of HTML are you talking about. XML is used for alot of things. I do not think it will die as soon as you think. Also, XML is text and structured so it will pose the potential to be used for just as long as HTML!
regards Keith.

Hi John, Excuse me, but do you really know what know what XML is? I can only assume you do not. I do agree that XML can be used as basis for a master format. Am 25.02.2011 um 03:31 schrieb John Redmond:
I am reassured to discover that I am not the only one turned off by the tedious and reiterative banter of the Usual Suspects. Let me state my views:
1. XML should be at the centre of any text management (doh). XML is good for storage in a structured form. But for text management? But, then again what do you mean by text management?
2. There should be one or more markup schemes for simple embellishment of text to trivialise conversion of existing text to XML. One of these MAY be zml, or my dottedfile format--or any other system that others prefer. The only requirement should be the easy conversion to good XML. Easy! Express your markup scheme in terms of XML! XML is not a particular mark scheme. You write the "scheme" in XML. Though term scheme is used differently in XML.
3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype.
WHY? For example epub does require any adherence to the XHTML, Just to XML version 1.0 !!!
4. From that point on, (X)HTML and HTML/ePub are trivial to generate, while LaTeX and PDF come from XSLT--which is surprisingly easy.
For that matter using any other markup an XSLT can be used to get to XHMTL. regards Keith.

Reply to Keith Schultz: Starting near the end of your response, Keith: 3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype. WHY? For example epub does require any adherence to the XHTML, Just to XML version 1.0 !!! Wrong. XHTML IS specified! Besides HTML is a backwater and the W3C would like it to disappear. 2. Easy! Express your markup scheme in terms of XML! XML is not a particular mark scheme. You write the "scheme" in XML. Though term scheme is used differently in XML. But XHTML _IS_ XML, specialised for text, so why not use it. 4. For that matter using any other markup an XSLT can be used to get to XHMTL. I think that you are agreeing with me. And note that one option is to back-convert to prettified text, or even zml. And the XML can be used to generate other content, like a TOC and endnotes. Regards, John

That's not the way I read it. http://www.w3.org/2009/06/xhtml-faq.html One of the problems is that XHTML requires "well-formed XML", and many books simply are not structured in a rigorously well-formed way. But in any case it seems to be the common view that it simply hasn't gained useful traction and now it's just overhead. Don
Wrong. XHTML IS specified! Besides HTML is a backwater and
the W3C would like it to disappear.

Am 26.02.2011 um 04:53 schrieb dakretz@gmail.com:
That's not the way I read it.
http://www.w3.org/2009/06/xhtml-faq.html
One of the problems is that XHTML requires "well-formed XML", and many books simply are not structured in a rigorously well-formed way. That is why many things do not work, because authors/users refuse adhere to the rules.

Very witty. It's the damn authors. Someone should have tried harder to keep e.g. Shakespeare on the tracks. On Sat, Feb 26, 2011 at 12:17 PM, Keith J. Schultz <schultzk@uni-trier.de>wrote:
Am 26.02.2011 um 04:53 schrieb dakretz@gmail.com:
That's not the way I read it.
http://www.w3.org/2009/06/xhtml-faq.html
One of the problems is that XHTML requires "well-formed XML", and many books simply are not structured in a rigorously well-formed way. That is why many things do not work, because authors/users refuse adhere to the rules.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Excuse me! What does Shakespeare have to with it. It those authors of the HTML versions that do not adhere to the STANDARD! Please do not tell me that they do not. Just because browser display well does not mean it does not adhere to the standard! Just so that you know Shakespeare has nothing to do with it! Shakespeare did NOT write the folios! So we can not blame him! regards Keith. Am 27.02.2011 um 10:04 schrieb don kretz:
Very witty. It's the damn authors. Someone should have tried harder to keep e.g. Shakespeare on the tracks.
On Sat, Feb 26, 2011 at 12:17 PM, Keith J. Schultz <schultzk@uni-trier.de> wrote:
Am 26.02.2011 um 04:53 schrieb dakretz@gmail.com:
That's not the way I read it.
http://www.w3.org/2009/06/xhtml-faq.html
One of the problems is that XHTML requires "well-formed XML", and many books simply are not structured in a rigorously well-formed way. That is why many things do not work, because authors/users refuse adhere to the rules.

Very witty. It's the damn authors. Someone should have tried harder to keep e.g. Shakespeare on the tracks.
FWIW most of us volunteering to help create PG books are not "authors". What we are is some sort of transcriber who is trying on some level to understand the intent of the original author/publisher, so that we can transcribe to a "modern" format while respecting the intent of the original author/publisher -- where it is *important* to respect the intent of the original author/publisher, and equally importantly so that we can *ignore* details of the original paper layout where those details of the original layout were *not* important to the intent of the original author/publisher. Of course, beyond title, chapter title, and paragraph, there's probably never going to be 100% agreement between two transcribers about what the "original intent" *was* of the original author/publisher. The real problem lies when one scratches one's head trying to figure out the "original intent" of the author/publisher. In that case one ends up more-or-less transcribing the paper page literally -- because what else can one do? -- and modern display devices seldom are very friendly when it comes to displaying a page transcribed "literally."

Shudder, Shudder, Shiudder! I have finally come to the opinion, that all have NO IDEA what they are doing! You do not need the actual intent of the orginal publisher. You only need to emulate the outer visual representation. How many of the readers of poetry and read it out loud to an audience. Do most really recognize a pentameter etc. !! As a PROGRAMMER I realize there is a outer representation and inner structure. More often that most the inner structure has little to do with the outer representation. But, I can manipulate and calculate what is desired. In other words forget about intend when encoding the text, it is only important that the READER can congrue the themselves. Afterall, a printed book is a piece of paper with pictures on it! Letters and words are pictures you know !! regards Keith. Am 28.02.2011 um 00:56 schrieb Jim Adcock:
Very witty. It's the damn authors. Someone should have tried harder to keep e.g. Shakespeare on the tracks.
FWIW most of us volunteering to help create PG books are not "authors". What we are is some sort of transcriber who is trying on some level to understand the intent of the original author/publisher, so that we can transcribe to a "modern" format while respecting the intent of the original author/publisher -- where it is *important* to respect the intent of the original author/publisher, and equally importantly so that we can *ignore* details of the original paper layout where those details of the original layout were *not* important to the intent of the original author/publisher. Of course, beyond title, chapter title, and paragraph, there's probably never going to be 100% agreement between two transcribers about what the "original intent" *was* of the original author/publisher.
The real problem lies when one scratches one's head trying to figure out the "original intent" of the author/publisher. In that case one ends up more-or-less transcribing the paper page literally -- because what else can one do? -- and modern display devices seldom are very friendly when it comes to displaying a page transcribed "literally."
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi John, Am 26.02.2011 um 00:22 schrieb John Redmond:
Reply to Keith Schultz:
Starting near the end of your response, Keith:
3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype. WHY? For example epub does require any adherence to the XHTML, Just to XML version 1.0 !!!
Wrong. XHTML IS specified! Besides HTML is a backwater and the W3C would like it to disappear. XHMTL is mention (specified), but not a requirement. Read the specification!
2. Easy! Express your markup scheme in terms of XML! XML is not a particular mark scheme. You write the "scheme" in XML. Though term scheme is used differently in XML.
But XHTML _IS_ XML, specialised for text, so why not use it. You are not getting my point. Does not matter.
4. For that matter using any other markup an XSLT can be used to get to XHMTL.
I think that you are agreeing with me. And note that one option is to back-convert to prettified text, or even zml. And the XML can be used to generate other content, like a TOC and endnotes.
I do not see the point in using a master format and then converting to to a another format that is not one of a intended reader device! regards Keith.

On 2/26/2011 1:13 PM, Keith J. Schultz wrote:
Hi John,
Am 26.02.2011 um 00:22 schrieb John Redmond:
[snip]
I think that you are agreeing with me. And note that one option is to back-convert to prettified text, or even zml. And the XML can be used to generate other content, like a TOC and endnotes.
I do not see the point in using a master format and then converting to to a another format that is not one of a intended reader device!
I agree, but in this case the proposed conversion /is/ for an intended reader device. BowerBird has promised us reader software (technically called a 'User Agent') which will use z.m.l. as a native format Real Soon Now, and prettified text is the native format for those using the DEC VT-100 "smart terminal." I think that everyone who is in favor of a master format would agree that transformations to Yet Another Master Format is wasted effort -- which is perhaps one of the reasons that there is so much acrimonious debate over what that master format should be.

Hi Lee, True a encoding for a master format does not need be created! JUST set set of conventions that must be adhered to in order to produce the desired output. ZML and RST are that restrictive by nature as they do not contain that much extra information! Yet, they do many aspects which are requested. Heavy Markup on the other hand is to broad that it is easy to use something thhat does not convert well to different devices. That is why I purport using a heavy markup like language to facilitate the task. regards Keith Am 28.02.2011 um 01:06 schrieb Lee Passey:
On 2/26/2011 1:13 PM, Keith J. Schultz wrote:
Hi John,
Am 26.02.2011 um 00:22 schrieb John Redmond:
[snip]
I think that you are agreeing with me. And note that one option is to back-convert to prettified text, or even zml. And the XML can be used to generate other content, like a TOC and endnotes.
I do not see the point in using a master format and then converting to to a another format that is not one of a intended reader device!
I agree, but in this case the proposed conversion /is/ for an intended reader device. BowerBird has promised us reader software (technically called a 'User Agent') which will use z.m.l. as a native format Real Soon Now, and prettified text is the native format for those using the DEC VT-100 "smart terminal."
I think that everyone who is in favor of a master format would agree that transformations to Yet Another Master Format is wasted effort -- which is perhaps one of the reasons that there is so much acrimonious debate over what that master format should be.

On 2/25/2011 12:08 AM, Keith J. Schultz wrote:
3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype.
WHY? For example epub does require any adherence to the XHTML, Just to XML version 1.0 !!!
See: http://idpf.org/epub/20/spec/OPS_2.0.1_draft.htm#TOC2.0 "This specification no longer creates its own subset of XHTML 1.1, but instead references entire XHTML modules, as described in XHTML Modularization 1.1" The spec goes on to also accept DTBook (which is also based on XML 1.0) as a "Preferred Vocabulary," but I think that was included primarily for political reasons; to my knowledge no-one is producing ePubs using DTBook, so for all practical purposes ePub /is/ XHTML. The ePub 3.0 draft document indicates the desire to adopt HTML5, /in toto/ and specifies that HTML5 must be expressed in XML syntax (http://idpf.org/epub/30/spec/epub30-contentdocs.html#sec-contentdocs). If you want to do ePub, you must do XML.

Hi Lee, Just read a little further! XML Islands! So it is in the standard! regards Keith. Am 26.02.2011 um 22:43 schrieb Lee Passey:
On 2/25/2011 12:08 AM, Keith J. Schultz wrote:
3. Given that we share a primary interest in books (text), it makes sense that the XML should conform to the XHTML doctype.
WHY? For example epub does require any adherence to the XHTML, Just to XML version 1.0 !!!
See: http://idpf.org/epub/20/spec/OPS_2.0.1_draft.htm#TOC2.0
"This specification no longer creates its own subset of XHTML 1.1, but instead references entire XHTML modules, as described in XHTML Modularization 1.1"
The spec goes on to also accept DTBook (which is also based on XML 1.0) as a "Preferred Vocabulary," but I think that was included primarily for political reasons; to my knowledge no-one is producing ePubs using DTBook, so for all practical purposes ePub /is/ XHTML.
The ePub 3.0 draft document indicates the desire to adopt HTML5, /in toto/ and specifies that HTML5 must be expressed in XML syntax (http://idpf.org/epub/30/spec/epub30-contentdocs.html#sec-contentdocs).
If you want to do ePub, you must do XML. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I am not trying to patronise--and I know full well that none of this is new. But I want to emphasise how practicable the approach is. I refer to my site (http://www.limpidsoft.com), which now has 60 books derived from PG source, generated in my spare time over the course of a few weeks.
Suggest you try a greater variety of book formatting challenges, including books with poetry, and/or images, and/or initial letters.
participants (13)
-
a@aboq.org
-
dakretz@gmail.com
-
David Starner
-
don kretz
-
Jim Adcock
-
John Redmond
-
Karl Eichwalder
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner
-
Michael S. Hart
-
traverso@posso.dm.unipi.it
-
Walt Farrell