Re: why the plain-text format is the most useful format for eliciting beauty (and more)

jim said:
Leading and following underscores are not plain text.
sure they are. indeed, the underscore even falls in the 7-bit range. so it's as plain-text as plain-text can be, and it has a long and glorious tradition of indicating emphasis.
It is an encoding to signal to the reader that something is missing -- namely italics.
actually, i think of it as an indicator to the "rendering agent" -- a.k.a. the viewer-program -- that the surrounded text is to be displayed with emphasis. (which generally means italics.)
One could have just as well -- or as badly -- used [i]and[/i] as the signals to indicate to the reader that italics is missing.
i used square-brackets rather than angle-brackets in the quote, but i could have used angle-brackets just like you did, jim... and yes, sir, any of those will work. indeed, .html uses the angle-brackets, and many bulletin-board systems use the square-brackets. and this is fine, because they use those brackets as _markup_, with no intention that the brackets will actually be _seen_ by any human beings. and likewise, i don't intend my underscores to be seen by human beings. just like .html, or forum markup, i expect that a viewer-app will intercede and display the emphasis just as i had intended. however -- and this is a very big _however_ -- in the case that those underscores _are_ being seen by actual human beings, it's not really all that much of a problem, because underscores are relatively non-intrusive, and they seem to provide emphasis, which is why they developed -- spontaneously -- for that purpose. the brackets, on the other hand, are terribly intrusive, and only obliterate the text to be emphasized, rather than emphasize it. likewise, the other bracket commands all serve as _obstacles_ to a human being who happens to be reading the text, and even to those human beings who have to work with the text in other capacities, such as editing it. z.m.l., on the other hand, is zen. that's why light-markup systems are taking over the world now.
I don't doubt that eventually the reader can get used to what they're missing -- but why should they have to?
they shouldn't. that's why i have programmed the viewer-apps that ensure that people don't have to read z.m.l. in its raw form.
If it were really that hard to much more closely follow author's intent then I could understand the trade-offs. But with today's technology it really wouldn't be hard to do much better.
i agree. we can do much better than what we have been handed. -bowerbird

jim said:
Leading and following underscores are not plain text.
sure they are. indeed, the underscore even falls in the 7-bit range. so it's as plain-text as plain-text can be, and it has a long and glorious tradition of indicating emphasis. There are many long and inglorious methods of indicating emphasis in plain-text including *asterix* and SHOUTING and _underscore_ and <i></i> and [i][/i] and they all suffer from the same problem: They are all not what the author wrote, at least not as implemented by the typically concurrently existing publisher. Now say 100 years later PG says ignore those previous efforts we as the publisher of this day knows better than the original intent so we will substitute something else for what was actually printed. Now if someone really only has a 7-bit teletype to print their PG on, then I can understand this. I can also understand PG's desire to continue to support such teletypists [[I tried using one when I was in college which tells you how old I am but it kept overheating and burning out based on my demands]] What I don't understand is why PG continues to be wedded to plain-text as an *input* encoding format demanded of people submitting texts to PG. Plain-text is too constrained to do the job well. HTML is too ambiguous, and too ill-matched to books to do well. We need something else, something that CAN be correctly and automagically converted "correctly" to one or another formats including plain-text, and Unicode, and HTML, and mobi, etc. And something that allows the simple every day tasks of the encoder, including italics and m-dash and poetry, titles and chapters and subchapters, publisher info, dates, etc to be handled correctly and easily. PS: Bit curious which blind reader handles _the underscore "convention"_ correctly - I've not seen _that_ one!

James Adcock wrote:
There are many long and inglorious methods of indicating emphasis in plain-text including **asterix** and SHOUTING and _/underscore/_ and <i></i> and [i][/i] and they all suffer from the same problem: They are all not what the author wrote, at least not as implemented by the typically concurrently existing publisher.
No author wrote italics before word processors became available to the end user. They _underlined_ the passages that they wanted the publisher to highlight. The publisher then choose an appropriate way of highlighting: /italics/ or s p a c e o u t or SMALLCAPS. Mediaeval copysts usually rubricated passages they wished to highlight.
Now say 100 years later PG says ignore those previous efforts we as the publisher of this day knows better than the original intent so we will substitute something else for what was actually printed.
So what? The brick-and-mortar publishers of yore ignored the previous efforts of the monastic scribes because it was too expensive to print twice with different inks. They also ignored the underlining of the author and substituted an artifact of their choosing. Also that artifact was largely a function of the cultural environment: italics or spaceout.
What I don’t understand is why PG continues to be wedded to plain-text as an **input** encoding format demanded of people submitting texts to PG.
Nobody understands that. It is a waste of resources pure and simple. Consider that: * The bottleneck at DP is the post-processing stage. * The post-processor is burdened with the creation of one surplus txt file. * The whitewasher is burdened with one or more surplus txt files. * Every error needs to be fixed in more than one place (in html and up to three txt files, plus as many zips) * We could easily produce a (good enough) txt version from html on the fly with lynx in any encoding the user may want. -- Marcello Perathoner webmaster@gutenberg.org

The final output from DP is a text. This is processed through Guiguts. Most of the Post Processors in DP use Guiguts for post processing. The html is generated from this text file. So no additional work is involved in producing a text file. Again there is no additional work in White Washing because of the text file. On Sat, Sep 12, 2009 at 3:17 PM, Marcello Perathoner <marcello@perathoner.de
wrote:
James Adcock wrote:
There are many long and inglorious methods of indicating emphasis in
plain-text including **asterix** and SHOUTING and _/underscore/_ and <i></i> and [i][/i] and they all suffer from the same problem: They are all not what the author wrote, at least not as implemented by the typically concurrently existing publisher.
No author wrote italics before word processors became available to the end user. They _underlined_ the passages that they wanted the publisher to highlight. The publisher then choose an appropriate way of highlighting: /italics/ or s p a c e o u t or SMALLCAPS.
Mediaeval copysts usually rubricated passages they wished to highlight.
Now say 100 years later PG says ignore those previous efforts we as the
publisher of this day knows better than the original intent so we will substitute something else for what was actually printed.
So what? The brick-and-mortar publishers of yore ignored the previous efforts of the monastic scribes because it was too expensive to print twice with different inks.
They also ignored the underlining of the author and substituted an artifact of their choosing. Also that artifact was largely a function of the cultural environment: italics or spaceout.
What I don’t understand is why PG continues to be wedded to plain-text as
an **input** encoding format demanded of people submitting texts to PG.
Nobody understands that. It is a waste of resources pure and simple.
Consider that:
* The bottleneck at DP is the post-processing stage.
* The post-processor is burdened with the creation of one surplus txt file.
* The whitewasher is burdened with one or more surplus txt files.
* Every error needs to be fixed in more than one place (in html and up to three txt files, plus as many zips)
* We could easily produce a (good enough) txt version from html on the fly with lynx in any encoding the user may want.
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
-- Sankar Service to Humanity is Service to God

Sankar Viswanathan wrote:
The final output from DP is a text. This is processed through Guiguts. Most of the Post Processors in DP use Guiguts for post processing. The html is generated from this text file.
If this is true its all the more waste. If you output a text file from the OCR and later use a human to re-create HTML this is more work than letting the OCR output the HTML directly. And all this crooked workflow is needed because PG requires a txt file for hysterical reasons. No wonder Google is eating our lunch ... they know how to put software to work instead of people.
So no additional work is involved in producing a text file.
Nice sophism. Additional work is required to produce the HTML file. So what?
Again there is no additional work in White Washing because of the text file.
I don't believe you. Working 2 files (3, maybe 4) IS more work than working one file. Even if you just open the file to see if it is the right one, its work. -- Marcello Perathoner webmaster@gutenberg.org

Most of the post processors in D.P depend on Guiguts for post processing. More than 80% of the texts have been produced by using Guiguts. But for the availability of the Guiguts program many of the post processors would have never ventured to post process. The Guiguts program has been written for the specific purpose of post processing of DP books. It is well supported with additional programs like Gutcheck and Jeebies. Guiguts generates the html from the text automatically. Guiguts has been written taking into account the DP process. Most post processors in DP are not technical people. Again the question is what do the users want? I am talking about people who download books from PG and not producers of other formats. Most of the users download text files. Just to quote an example the text only format of Alice in Wonderland is downloaded more often than the illustrated html version. The text version is the LCM. Do we have statistics about downloading of html and text versions? I am sure most users download the text version. So even if we have put in additional effort to produce a text version it is justified. Do we have any feedback from the actual users? Letters from users who submit detailed Errata shows that the text files are being used for teaching school children in the remote areas of U.S. These are the people who make the effort worthwhile. May be it also benefits people who are still on Dial Up. Plain text can be read in any computer. HTML? With all the quirks of IE6 and other browsers it is not easy to produce html which will render perfectly in all the browsers. The earlier discussion was about whether a ASCII text is necessary? DP does produce TEI text. But there are very few post processors who can do TEI format. The main reason is the absence of a software like Guiguts. On Sat, Sep 12, 2009 at 5:34 PM, Marcello Perathoner <marcello@perathoner.de
wrote:
Sankar Viswanathan wrote:
The final output from DP is a text. This is processed through Guiguts.
Most of the Post Processors in DP use Guiguts for post processing. The html is generated from this text file.
If this is true its all the more waste.
If you output a text file from the OCR and later use a human to re-create HTML this is more work than letting the OCR output the HTML directly.
And all this crooked workflow is needed because PG requires a txt file for hysterical reasons.
No wonder Google is eating our lunch ... they know how to put software to work instead of people.
So no additional work is involved in producing a text file.
Nice sophism. Additional work is required to produce the HTML file. So what?
Again there is no additional work in White Washing because of the text
file.
I don't believe you.
Working 2 files (3, maybe 4) IS more work than working one file. Even if you just open the file to see if it is the right one, its work.
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
-- Sankar Service to Humanity is Service to God

Sankar Viswanathan wrote:
Most of the post processors in D.P depend on Guiguts for post processing. More than 80% of the texts have been produced by using Guiguts. But for the availability of the Guiguts program many of the post processors would have never ventured to post process.
That's the more water to my mill. You need a custom program to proof the txt file while any old editor can proof html.
The Guiguts program has been written for the specific purpose of post processing of DP books. It is well supported with additional programs like Gutcheck and Jeebies.
Bad enough that a special program had to be written while many free editors excel at doing html.
Guiguts generates the html from the text automatically. Guiguts has been written taking into account the DP process.
Yeah, for suitably small values of `HTML´. I installed guiguts and downloaded Hamlet #1524. Then I pushed the 'Autogenerate HTML' button in guiguts. This is part of what I got: <p>Ham. To be, or not to be,—that is the question:— Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them?—To die,—to sleep,— No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,—'tis a consummation Devoutly to be wish'd. To die,—to sleep;— To sleep! perchance to dream:—ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But that the dread of something after death,— The undiscover'd country, from whose bourn No traveller returns,—puzzles the will, And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought; And enterprises of great pith and moment, With this regard, their currents turn awry, And lose the name of action.—Soft you now! The fair Ophelia!—Nymph, in thy orisons Be all my sins remember'd.</p> guiguts takes its place in the long file of products and services who tried to make something of PG plain text and failed. Mind you, I'm not saying that guiguts is a bad program, I'm saying that it is impossible to recover the formatting once a text has been dumbed down to PG plain text.
Again the question is what do the users want?
Users want as many formats as possible to choose from.
So even if we have put in additional effort to produce a text version it is justified.
Not so. We can do that automatically with lynx --dump. lynx is free, so anybody can do that. If you produce a `smart´ version you can dumb it down with software. If you produce a `dumb´ version, it is impossible to smart it up again with software.
Do we have any feedback from the actual users? Letters from users who submit detailed Errata shows that the text files are being used for teaching school children in the remote areas of U.S. These are the people who make the effort worthwhile. May be it also benefits people who are still on Dial Up.
Why do *those* people make the effort worthwile? Are you a bit prejudiced against better-off people? "War and Peace" is 1.18M in HTML and 1.16M in TXT. How can that benefit people on dial-up?
Plain text can be read in any computer. HTML? With all the quirks of IE6 and other browsers it is not easy to produce html which will render perfectly in all the browsers.
It is very easy indeed. Stick to the basic tags and even plucker on a cell phone will render perfectly. -- Marcello Perathoner webmaster@gutenberg.org

On Sat, Sep 12, 2009 at 11:30 AM, Marcello Perathoner < marcello@perathoner.de> wrote:
Sankar Viswanathan wrote:
Most of the post processors in D.P depend on Guiguts for post processing.
More than 80% of the texts have been produced by using Guiguts. But for the availability of the Guiguts program many of the post processors would have never ventured to post process.
That's the more water to my mill. You need a custom program to proof the txt file while any old editor can proof html.
Guiguts processes the output of the DP proofing process. That output is neither raw text, nor raw HTML. It's a mix of different markups that struggles to find a balance between unambiguous output, and ease of the actual proofing process. The format is one that's relatively easy to pick-up, as unobtrusive as possible to the proofing process, and one that can be fairly automatically converted to both text and html by the tools that have been designed.
I installed guiguts and downloaded Hamlet #1524. Then I pushed the 'Autogenerate HTML' button in guiguts. This is part of what I got:
<p>Ham. To be, or not to be,—that is the question:— Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them?—To die,—to sleep,— No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,—'tis a consummation Devoutly to be wish'd. To die,—to sleep;— To sleep! perchance to dream:—ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But that the dread of something after death,— The undiscover'd country, from whose bourn No traveller returns,—puzzles the will, And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought; And enterprises of great pith and moment, With this regard, their currents turn awry, And lose the name of action.—Soft you now! The fair Ophelia!—Nymph, in thy orisons Be all my sins remember'd.</p>
Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP PPer turn the output of the DP rounds into the final product seen on PG. In this case, the DP text for a piece of poetry would have had the poetry wrapped in poetry markers, signifying to Guiguts that it had to treat the block of text as non-wrappable poetry, and not just a straight paragraph of prose.
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Scott Olson wrote:
Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP PPer turn the output of the DP rounds into the final product seen on PG. In this case, the DP text for a piece of poetry would have had the poetry wrapped in poetry markers, signifying to Guiguts that it had to treat the block of text as non-wrappable poetry, and not just a straight paragraph of prose.
I see. I was told the output of DP was text and the html generated from it. Now I gather DP uses some sort of proprietary internal markup and can produce HTML without having to produce TXT? Am I right? -- Marcello Perathoner webmaster@gutenberg.org

On Sat, Sep 12, 2009 at 08:13:24PM +0200, Marcello Perathoner wrote:
Scott Olson wrote:
Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP PPer turn the output of the DP rounds into the final product seen on PG. In this case, the DP text for a piece of poetry would have had the poetry wrapped in poetry markers, signifying to Guiguts that it had to treat the block of text as non-wrappable poetry, and not just a straight paragraph of prose.
I see. I was told the output of DP was text and the html generated from it.
Now I gather DP uses some sort of proprietary internal markup and can produce HTML without having to produce TXT? Am I right?
I'm also very interested in the answer to this question - and assuming the answer is "Yes, DP has an internal format that is used before the final .txt is rendered to PG", my follow-up question becomes "How can I get access to the ebooks in this 'internal markup' format?" I'm hoping dearly that the answer isn't "We throw it away when we're done..."

Probably best to email dphelp_AT_pgdp.net with your question. Al ----- Original Message ----- From: "Joey Smith" <joey@joeysmith.com> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Wednesday, December 16, 2009 6:46 PM Subject: [gutvol-d] DP's "internal markup" (was Re: why the plain-text format is the most useful)
On Sat, Sep 12, 2009 at 08:13:24PM +0200, Marcello Perathoner wrote:
Scott Olson wrote:
Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP PPer turn the output of the DP rounds into the final product seen on PG. In this case, the DP text for a piece of poetry would have had the poetry wrapped in poetry markers, signifying to Guiguts that it had to treat the block of text as non-wrappable poetry, and not just a straight paragraph of prose.
I see. I was told the output of DP was text and the html generated from it.
Now I gather DP uses some sort of proprietary internal markup and can produce HTML without having to produce TXT? Am I right?
I'm also very interested in the answer to this question - and assuming the answer is "Yes, DP has an internal format that is used before the final .txt is rendered to PG", my follow-up question becomes "How can I get access to the ebooks in this 'internal markup' format?" I'm hoping dearly that the answer isn't "We throw it away when we're done..." _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Wed, Dec 16, 2009 at 9:46 PM, Joey Smith <joey@joeysmith.com> wrote:
On Sat, Sep 12, 2009 at 08:13:24PM +0200, Marcello Perathoner wrote:
Scott Olson wrote:
Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP PPer turn the output of the DP rounds into the final product seen on PG. In this case, the DP text for a piece of poetry would have had the poetry wrapped in poetry markers, signifying to Guiguts that it had to treat the block of text as non-wrappable poetry, and not just a straight paragraph of prose.
I see. I was told the output of DP was text and the html generated from it.
Now I gather DP uses some sort of proprietary internal markup and can produce HTML without having to produce TXT? Am I right?
I'm also very interested in the answer to this question - and assuming the answer is "Yes, DP has an internal format that is used before the final .txt is rendered to PG", my follow-up question becomes "How can I get access to the ebooks in this 'internal markup' format?" I'm hoping dearly that the answer isn't "We throw it away when we're done..."
It's archived when the project is posted to PG. It won't have the final corrections and adjustments from the PP stage. You can download a concatenated text file at any point in the process, until the project is archived. R C

Well, be careful you are not laboring under misunderstandings. The output of DP (that is to say, what comes out of round F2) is _not_ a finished text. Yes, there is some propriatory markup used at DP. However calling it a "format" is going too far. I would call it "suggestive markup". That is, its purpose is to record some information about layout and format for the post-processor to use when they produce html and/or text files. The output of the rounds at DP usually contains plenty of ambiguities (such as the propriatoy <tb>, which you may want to ignore or render in different ways depending on the text), proofer's notes (such as "is this a typo?"), and other inconsistancies. It is the job of the post-processor to take all this, ask for help if needed with any of the issues with the particular text, and produce the final texts which are submitted for posting. --Andrew On Wed, 16 Dec 2009, Joey Smith wrote:
I see. I was told the output of DP was text and the html generated from it.
Now I gather DP uses some sort of proprietary internal markup and can produce HTML without having to produce TXT? Am I right?
I'm also very interested in the answer to this question - and assuming the answer is "Yes, DP has an internal format that is used before the final .txt is rendered to PG", my follow-up question becomes "How can I get access to the ebooks in this 'internal markup' format?" I'm hoping dearly that the answer isn't "We throw it away when we're done..."

And though each project's final phase involves a great deal of manual work resulting in a polished text that is the basis for both the versions released to PG, it's interesting to note that there is no facility provided for actually preserving this foundation text version. Oversimplifying a bit, the text version removes a bunch of information, and the html version adds a bunch of stuff. Arguably the most valuable text (for content and metadata) is, at best, on someone's PC somewhere. Or more likely discarded. On Wed, Dec 16, 2009 at 8:19 PM, Andrew Sly <sly@victoria.tc.ca> wrote:
Well, be careful you are not laboring under misunderstandings. The output of DP (that is to say, what comes out of round F2) is _not_ a finished text.
Yes, there is some propriatory markup used at DP. However calling it a "format" is going too far. I would call it "suggestive markup". That is, its purpose is to record some information about layout and format for the post-processor to use when they produce html and/or text files.
The output of the rounds at DP usually contains plenty of ambiguities (such as the propriatoy <tb>, which you may want to ignore or render in different ways depending on the text), proofer's notes (such as "is this a typo?"), and other inconsistancies. It is the job of the post-processor to take all this, ask for help if needed with any of the issues with the particular text, and produce the final texts which are submitted for posting.
--Andrew

Arguably the most valuable text (for content and metadata) is, at best, on someone's PC somewhere. Or more likely discarded.
Just to state the obvious, it sure would be cool if the work could be saved somewhere at that point in time when the work is still "reversible" -- i.e. at a stage where it could be (in theory) resubmitted to DP or similar process for "another pass" or as a foundation for future work 10 years from now when we might have some better and agreed-upon file format for representing books than HTML. Once line break information and page break information has been thrown away, then it is very difficult to go back and make another pass on a book -- although one ought to be able to write a tool that would re-insert line breaks and page breaks based on OCR alignment with the PG or DP text. Here's a simple "real world example" of why one might care: I submit a work to PG including careful HTML representation of how the real author and/or publisher represented their work, including "correct" block text quotes and poetry representation. A day later it shows up on a different site, [which is fine] now represented in MOBI file format, but with the "correct" block text quotes and poetry representation now trashed. Why? Because presumably the person doing the file format translations at this other site is using a tool that doesn't know how to "correctly" deal with the HTML representation of block quotes and poetry. And WHY does that tool not know how to deal correctly with block quotes and poetry? -- Because there IS no format in HTML which says "This is a block quote" or "This is poetry" which in turn means it's a crap shoot whether a given translation tool will handle these issues "correctly" or not. Why would a reader then choose this "inferior" MOBI version from another site? Because that site correctly fills in "Spine" information that the PG version is missing. But if the user chooses the version with the correct "Spine" information, then the block quote and poetry formatting is trashed... ...Of course, presumably there are people at PG who consider issues of block quotes and poetry "just formatting"....

Arguably the most valuable text (for content and metadata) is, at best, on someone's PC somewhere. Or more likely discarded.
Just to state the obvious, it sure would be cool if the work could be saved somewhere at that point in time when the work is still "reversible" -- i.e. at a stage where it could be (in theory) resubmitted to DP or similar process for "another pass" or as a foundation for future work 10 years from now when we might have some better and agreed-upon file format for representing books than HTML.
Once line break information and page break information has been thrown away, then it is very difficult to go back and make another pass on a book -- although one ought to be able to write a tool that would re-insert line breaks and page breaks based on OCR alignment with the PG or DP text.
Here's a simple "real world example" of why one might care:
I submit a work to PG including careful HTML representation of how the real author and/or publisher represented their work, including "correct" block text quotes and poetry representation.
A day later it shows up on a different site, [which is fine] now represented in MOBI file format, but with the "correct" block text quotes and poetry representation now trashed. Why? Because presumably the person doing the file format translations at this other site is using a tool that doesn't know how to "correctly" deal with the HTML representation of block quotes and poetry. And WHY does that tool not know how to deal correctly with block quotes and poetry? -- Because there IS no format in HTML which says "This is a block quote" or "This is poetry" which in turn means it's a crap shoot whether a given translation tool will handle these issues "correctly" or not.
Why would a reader then choose this "inferior" MOBI version from another site? Because that site correctly fills in "Spine" information that the PG version is missing. But if the user chooses the version with the correct "Spine" information, then the block quote and poetry formatting is trashed... -- Paul Maas
Maybe the changes were made to make the book viewable on a small screen, such as an iPhone. On Thu, 17 Dec 2009 13:52:23 -0800, "Jim Adcock" <jimad@msn.com> said: paulmaas@airpost.net -- http://www.fastmail.fm - Faster than the air-speed velocity of an unladen european swallow

Jim Adcock wrote:
Arguably the most valuable text (for content and metadata) is, at best, on someone's PC somewhere. Or more likely discarded.
Just to state the obvious, it sure would be cool if the work could be saved somewhere at that point in time when the work is still "reversible" -- i.e. at a stage where it could be (in theory) resubmitted to DP or similar process for "another pass" or as a foundation for future work 10 years from now when we might have some better and agreed-upon file format for representing books than HTML.
My friend, what you say is obviously and incontrovertibly true. But PG is uninterested, and DP is unwilling, so what can you do?

don kretz <dakretz@gmail.com> writes:
And though each project's final phase involves a great deal of manual work resulting in a polished text that is the basis for both the versions released to PG, it's interesting to note that there is no facility provided for actually preserving this foundation text version. Oversimplifying a bit, the text version removes a bunch of information, and the html version adds a bunch of stuff. Arguably the most valuable text (for content and metadata) is, at best, on someone's PC somewhere.
Probably, but not necessarily. I produced an XML master file (TEI) where e. g., I kept corrections as follows (German example, Handbuch der deutschen Kunstdenkmäler, Bd.1, Mitteldeutschland, 1914 by Georg Dehio, http://www.gutenberg.org/etext/19460): Die <corr sic="Mormarstatuen">Marmorstatuen</corr> und Vasen des Gartens (einst 150) in Rom und Venedig bestellt, Or dubious words or phrases with simple sic! statements: <sic>wagerechte</sic> I even kept page-breaks inbetween words and references to the original scans (Jon Noring once recommended something along these lines for every hyphenated word!): Besonders <reg orig='merk-|würdig'>merkwürdig</reg>'> <pb n='105' id='i114.png'/> <!-- [P: 105] --> ein bogenförmiger rom. <hi rend="g">Altaraufsatz</hi> aus Stuck, ====================================================================== You wont notice all this in the offered TXT and HTML, but if wanted you can customize the XSL style-sheet and produce special purpose output files in HTML or PDF. My output files are by no means perfect, but good enough for demonstration. -- Karl Eichwalder

I'm not sure what you mean by this. Most screen readers will read underlines or periods as underlines or periods, so there is no emphasis on bold or italics. If you mean something else, please accept my apologies and elaborate further. I personally turn off all punctuation when reading because I don't want to hear the periods and such. In Windows and MS Word, it will tell you if something is formatted differently. In English Braille which is also 7-bit, there is usually an accent mark (the equivalent to "`" to the sighted) to show any accented letter and a similar underline (or underscore if you prefer) convention for other emphasis. On 9/11/2009 9:26 PM, James Adcock wrote:
PS: Bit curious which blind reader handles _/the underscore “convention”/_ correctly – I’ve not seen _/that/_ one!

Yes, you understand me the way I understand the blind readers I know of, namely that they will read _an ironic reference_ as "underscore an ironic reference underscore" not read with prosodic emphasis "an ironic reference". Thus when you turn off punctuation you also lose any representation of prosodic emphasis that the author originally encoding in the original printed text. This is not a small deal, IMHO. There are some books such as "The Dove" by Henry James which are virtually impossible to even scan without maintaining the author's original proper representation of prosodic emphasis.
I'm not sure what you mean by this. Most screen readers will read underlines or periods as underlines or periods,

James Adcock wrote: [snip]
What I don't understand is why PG continues to be wedded to plain-text as an *input* encoding format demanded of people submitting texts to PG. Plain-text is too constrained to do the job well.
I find that you are generally correct in everything you have said to date. But the reality is that PG <em>does</em> continue to be wedded to plain (impoverished) text. This topic has come up regularly over the years, and in every case has ended without any improvement to PG. While I hesitate to say that your advocacy is futile, your advocacy is futile.
HTML is too ambiguous, and too ill-matched to books to do well. We need something else, something that CAN be correctly and automagically converted "correctly" to one or another formats including plain-text, and Unicode, and HTML, and mobi, etc.
HTML, <i class="foreign>per se</i>, is indeed too ambiguous, although I have successfully developed a fairly complete set of standard usages and class definitions (encapsulated in a CSS file) that allow me to do lossless translation back and forth between HTML and TEI. For PG to adopt such a scheme, however, would require that PG adopt a set of standards, and Mr. Hart has been adamant that PG will <em>never</em> adopt <em>any</em> standard, fearing that it may alienate or intimidate some speculative volunteer that would otherwise contribute an as-yet-unarchived impoverished text file. (Obviously, the implicit standards developed and enforced by the Council Of Whitewashers cannot be considered as <i class="socalled">true</i> standards.) I have concluded that Project Gutenberg is impervious to improvement. While Bowerbird rejects the notion, I am not afraid to say that for what you are attempting to do Project Gutenberg may not be the correct archive. I would suggest, rather, perfecting your HTML file, uploading it to the Internet Archive (http://www.archive.org/create/) and then posting a message here indicating where it can be found if any other volunteer wants to create a degraded version of your master copy.

What I don't understand is why PG continues to be wedded to plain-text as an *input* encoding format demanded of people submitting texts to PG. Plain-text is too constrained to do the job well.
I find that you are generally correct in everything you have said to date. But the reality is that PG <em>does</em> continue to be wedded to plain (impoverished) text.
I have heard reasonable rational (whether one agrees or not) why PG remains wedded to PG TXT format as an OUTPUT file format. I have not heard a reasonable rational why PG REQUIRES me to submit BOTH an HTML AND a PG TXT file if what I as a volunteer really want to submit is just an HTML file. If I were allowed to just submit an HTML file then I could reasonably encode MOST of what I as a transcriber would like to transcribe, and I could avoid the abuse that I currently receive from Bowerbird when I don't put in the extraneous marks and spaces and smiley faces not found in the author's work but which Bowerbird would like to see in the PG TXT in order to support his pet theories about how the input file format and the rendered file format need to be one and the same thing. In turn Bowerbird could use his time and energies in a positive manner transcribing my HTML input format file into any particular flavor of PG TXT output file format that Bowerbird likes and can and will in turn pat himself on the back for, rather than abusing me of efforts that I didn't want to have to do in the first place.
For PG to adopt such a scheme, however, would require that PG adopt a set of Standards...
How about a VOLUNTARY set of "suggested" standards for HTML, such that when a volunteer voluntarily codes to those HTML standards the results can be translated and displayed on a larger class of machines successfully? Certainly PG in practice already enforces a number of standards on submitted input files which if you don't follow your files don't get accepted -- even though those standards aren't really written down so one ends up having to rework one's submissions not infrequently in order to get them accepted -- surprise!
I have concluded that Project Gutenberg is impervious to improvement.
I would suggest, rather, perfecting your HTML file, uploading it to the Internet Archive (http://www.archive.org/create/) and then
I don't think its impervious to improvement, it's just that changes are very slow to come and very hard won. Certainly from my point of view the recent decision to support, or at least partially support, EPUB and MOBI has made my life much more enjoyable. posting a message here indicating where it can be found if any other volunteer wants to create a degraded version of your master copy. Sigh -- I would hate to think that I have to "route around damage" -- again.

On Wed, Sep 23, 2009 at 4:24 PM, James Adcock <jimad@msn.com> wrote:
I could avoid the abuse that I currently receive from Bowerbird
That's like an abused woman saying that if she just had a better dishwasher her husband would stop hitting her. Bowerbird will abuse you no matter what. At some point, it's your fault for not putting him in a killfile. -- Kie ekzistas vivo, ekzistas espero.

LOL ok you win your point! I will attempt to filter him out.
participants (16)
-
Al Haines (shaw)
-
Andrew Sly
-
Bowerbird@aol.com
-
David Starner
-
don kretz
-
James Adcock
-
Jim Adcock
-
Joey Smith
-
Karl Eichwalder
-
Lee Passey
-
Marcello Perathoner
-
Paul Maas
-
Robert Cicconetti
-
Sankar Viswanathan
-
Scott Olson
-
Tony Baechler