What is the intended use of TXT format-- why line breaks?

Hi. I have looked at PG's books in both HTML and TXT formats, on several different devices, from large-screen laptops to netbooks to an Ipod Touch. In just about every possible scenario, I had the line breaks creating an irregular right margin down the screen that made for unpleasant reading. I also tried taking one of the raw TXT files to make "my own" HTML file, and was tripped up by the line breaks. In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen? -- Greg M. Johnson http://pterandon.blogspot.com

2009/9/8 Greg M. Johnson <pterandon@gmail.com>
Hi.
Hi Greg M.
I have looked at PG's books in both HTML and TXT formats, on several different devices, from large-screen laptops to netbooks to an Ipod Touch.
In just about every possible scenario, I had the line breaks creating an irregular right margin down the screen that made for unpleasant reading. I also tried taking one of the raw TXT files to make "my own" HTML file, and was tripped up by the line breaks.
You can visit: http://www.gutenberg.org/wiki/Gutenberg:Readers%27_FAQ#R.30._When_I_print_ou.... First, all paragraphs and separate lines should be separated by two HRs, so that you can see one blank line between them. Where they aren't, as in the case of a table of contents or lines of verse, add the extra HRs to make them so. Replace All occurrences of two HRs with some nonsense character or string that doesn't exist in the text, like ~$~. Replace All remaining HRs with a space. Replace your inserted string ~$~ with one HR.
In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen?
You can visit: http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#About_the_formatti... The idea is: using a pure text format, with a number of lines per page that can readable in most computers and preserved for the future to come. Ricardo F. Diogo

(Of course, in my last message I meant "characters per line", not "lines per page". Ricardo F. Diogo)

Greg M. Johnson wrote:
In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen?
In 1985 virtually all interactions with computers were performed via "smart terminals," predominately the DEC VT-52 and VT-100, which presented only text in an 80 x 25 array, that is, 25 lines each having at most 80 characters. The characters could be highlighted by reversing the electron output on the CRT (i.e. using 'on' instead of 'off', and vice-versa) but any other manipulation of the font, such a italic, bolding, or even a different font, was simply not possible. Even most personal computers of that day used VT-100 emulation. At that same time, we were being taught in typing class that left margins should be 66 characters; the bell would be set at 60, at which point the typist needed to decide whether the current word would fit in the 66 character limit, or whether it needed to be hyphenated. In 1985 the principals at Project Gutenberg did not want to deal with hyphenation, so no words were hyphenated. The current line length of Project Gutenberg files was designed so no word in unhyphenated form would ever cause a line to exceed 80 characters and wrap to a new line on a typical 1985-era smart terminal. In most ways, Project Gutenberg has not progressed beyond 1985.

Too many search engines fail when words are hyphenated. There are all sorts of ways to remove hard returns in one second. It takes more time to complain than to actually find one of these and use it. . . . In many ways complainers have not evolved past Medieval Times. On Tue, 8 Sep 2009, Lee Passey wrote:
Greg M. Johnson wrote:
In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen?
In 1985 virtually all interactions with computers were performed via "smart terminals," predominately the DEC VT-52 and VT-100, which presented only text in an 80 x 25 array, that is, 25 lines each having at most 80 characters. The characters could be highlighted by reversing the electron output on the CRT (i.e. using 'on' instead of 'off', and vice-versa) but any other manipulation of the font, such a italic, bolding, or even a different font, was simply not possible. Even most personal computers of that day used VT-100 emulation.
At that same time, we were being taught in typing class that left margins should be 66 characters; the bell would be set at 60, at which point the typist needed to decide whether the current word would fit in the 66 character limit, or whether it needed to be hyphenated.
In 1985 the principals at Project Gutenberg did not want to deal with hyphenation, so no words were hyphenated. The current line length of Project Gutenberg files was designed so no word in unhyphenated form would ever cause a line to exceed 80 characters and wrap to a new line on a typical 1985-era smart terminal.
In most ways, Project Gutenberg has not progressed beyond 1985. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Michael S. Hart wrote:
There are all sorts of ways to remove hard returns in one second.
But no way to decide which ones to drop and which one to keep.
It takes more time to complain than to actually find one of these and use it. . . .
Actually nobody has yet come up with a satisfactory solution to this problem.
In many ways complainers have not evolved past Medieval Times.
Here we go again. Blaming your customers is still cheaper than fixing your bugs. I guess that's your only way out if you can't find a single argument to uphold the boneheaded plain text format that PG is still producing. And you can't find one, because there ain't one. -- Marcello Perathoner webmaster@gutenberg.org

I actually agree with the complaints if you don't mind my input (no flaming please). I try to do some sort of correction for my ebook reader, but its very primitive (and breakable) if the first alphabetic character in the new line is uppercase, keep the line, otherwise join them. First i tried if the last character of the previous line before a alphanumeric is a punctuation, keep the line, otherwise join it, but hey, more false positives. The one i uses at least corrects normal errors (Noun names non-withstanding) while keeping things like Chapter headings mostly intact (except lowercase off course). They can't be both applied i think. If some has a better algorithm, please share hey? This is one of the reasons i prefer html formats. A space is a space is not dozens of spaces and \n is nothing at all and <p> is king.

On Tue, Sep 8, 2009 at 4:27 PM, Paulo Levi <i30817@gmail.com> wrote:
If some has a better algorithm, please share hey?
If the line begins with whitespace, don't re-wrap it unless you have to (usually poetry, or something else where the linebreaks matter). If there are blank lines (two or more \n's), don't wrap the stuff together (paragraphs, section/chapter breaks). Otherwise \n = space.

On Tue, Sep 8, 2009 at 6:07 PM, Marcello Perathoner<marcello@perathoner.de> wrote:
But no way to decide which ones to drop and which one to keep.
Given the previous example, negative lookahead assertions seem to fit well here: s/(?<!\n)\n(?!\n)/.../g or perhaps ^\n(?!\n) if you need to anchor it or /\n(?!\n)/ for zero width and /\n[^\n]/ for width=1 and so on. Plenty of ways to skin that cat in most regex-capable languages.

David A. Desrosiers wrote:
On Tue, Sep 8, 2009 at 6:07 PM, Marcello Perathoner<marcello@perathoner.de> wrote:
But no way to decide which ones to drop and which one to keep.
Given the previous example, negative lookahead assertions seem to fit well here:
s/(?<!\n)\n(?!\n)/.../g or perhaps ^\n(?!\n) if you need to anchor it or /\n(?!\n)/ for zero width and /\n[^\n]/ for width=1 and so on.
Plenty of ways to skin that cat in most regex-capable languages.
ROTFL! Apply that algorithm to Hamlet and see. See if you can come up with an algorithm that doesn't make mincemeat of the following small excerpt. The algorithm should at least: 1. Recognizes that "HAMLET, PRINCE OF DENMARK by William Shakespeare" is the title statement of the work. This should be marked up like: <h1>Hamlet, Prince of Denmark<br/><br/> by William Shakespeare</h1> and NOT: <h1>Hamlet, Prince of Denmark</h1> <h2>by William Shakespeare</h2> 2. Not wrap the list of persons proper, BUT wrap <p>Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other Attendants.</p> 3. Recognize that <p>SCENE. Elsinore</p> is a stage direction, not the start of scene 1. 4. Recognize <h2>ACT I.</h2> 5. Recognize <h3>Scene I. Elsinore. A platform before the Castle.</h3> (Even if it lacks spacing.) --- start excerpt from #1524 ---- HAMLET, PRINCE OF DENMARK by William Shakespeare PERSONS REPRESENTED. Claudius, King of Denmark. Hamlet, Son to the former, and Nephew to the present King. Polonius, Lord Chamberlain. Horatio, Friend to Hamlet. Laertes, Son to Polonius. Voltimand, Courtier. Cornelius, Courtier. Rosencrantz, Courtier. Guildenstern, Courtier. Osric, Courtier. A Gentleman, Courtier. A Priest. Marcellus, Officer. Bernardo, Officer. Francisco, a Soldier Reynaldo, Servant to Polonius. Players. Two Clowns, Grave-diggers. Fortinbras, Prince of Norway. A Captain. English Ambassadors. Ghost of Hamlet's Father. Gertrude, Queen of Denmark, and Mother of Hamlet. Ophelia, Daughter to Polonius. Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other Attendants. SCENE. Elsinore. ACT I. Scene I. Elsinore. A platform before the Castle. [Francisco at his post. Enter to him Bernardo.] Ber. Who's there? Fran. Nay, answer me: stand, and unfold yourself. Ber. Long live the king! Fran. Bernardo? Ber. He. Fran. You come most carefully upon your hour. --- end excerpt #1524 ---- -- Marcello Perathoner webmaster@gutenberg.org

On Wed, Sep 9, 2009 at 8:12 AM, Marcello Perathoner<marcello@perathoner.de> wrote:
ROTFL! Apply that algorithm to Hamlet and see.
See if you can come up with an algorithm that doesn't make mincemeat of the following small excerpt. The algorithm should at least:
As you already know, parsing HTML is a much easier matter than parsing semi-freeflow text (which was the original poster's request). Also remember, I do this all the time for spiders we write for Plucker. I slice, I dice, and I make beautiful, automated works of art from the worst, most semantically-incorrect HTML out there. See some examples here: http://projects.plkr.org/

Want the simple way? Try unzipping to the Apple text format. . . . On Wed, 9 Sep 2009, David A. Desrosiers wrote:
On Wed, Sep 9, 2009 at 8:12 AM, Marcello Perathoner<marcello@perathoner.de> wrote:
ROTFL! Apply that algorithm to Hamlet and see.
See if you can come up with an algorithm that doesn't make mincemeat of the following small excerpt. The algorithm should at least:
As you already know, parsing HTML is a much easier matter than parsing semi-freeflow text (which was the original poster's request).
Also remember, I do this all the time for spiders we write for Plucker. I slice, I dice, and I make beautiful, automated works of art from the worst, most semantically-incorrect HTML out there. See some examples here:
http://projects.plkr.org/ _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

David A. Desrosiers wrote:
On Wed, Sep 9, 2009 at 8:12 AM, Marcello Perathoner<marcello@perathoner.de> wrote:
ROTFL! Apply that algorithm to Hamlet and see.
See if you can come up with an algorithm that doesn't make mincemeat of the following small excerpt. The algorithm should at least:
As you already know, parsing HTML is a much easier matter than parsing semi-freeflow text (which was the original poster's request).
Do you read a post before replying? That's exactly what I requested you to do: To parse a plain text version of Hamlet into wrapped and non-wrapped paragraphs. -- Marcello Perathoner webmaster@gutenberg.org

On Wed, Sep 9, 2009 at 9:45 AM, Marcello Perathoner<marcello@perathoner.de> wrote:
Do you read a post before replying?
Of course... do you?
That's exactly what I requested you to do: To parse a plain text version of Hamlet into wrapped and non-wrapped paragraphs.
You did? The following looks pretty much like HTML to me, not plain ASCII text that wraps at 70 columns (like the original poster who started this thread requested).
See if you can come up with an algorithm that doesn't make mincemeat of the following small excerpt. The algorithm should at least:
1. Recognizes that "HAMLET, PRINCE OF DENMARK by William Shakespeare" is the title statement of the work. This should be marked up like:
<h1>Hamlet, Prince of Denmark<br/><br/> by William Shakespeare</h1>
and NOT:
<h1>Hamlet, Prince of Denmark</h1> <h2>by William Shakespeare</h2>
2. Not wrap the list of persons proper, BUT wrap <p>Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other Attendants.</p>
3. Recognize that <p>SCENE. Elsinore</p> is a stage direction, not the start of scene 1.
4. Recognize <h2>ACT I.</h2>
5. Recognize <h3>Scene I. Elsinore. A platform before the Castle.</h3> (Even if it lacks spacing.)

On Wed, 9 Sep 2009, David A. Desrosiers wrote: ... Now THAT is the plainest text message I've ever seen!

On Wed, Sep 09, 2009 at 12:07:06AM +0200, Marcello Perathoner wrote:
Michael S. Hart wrote:
In many ways complainers have not evolved past Medieval Times.
Here we go again. Blaming your customers is still cheaper than fixing your bugs.
Between David Widger & Al Haines, thousands of older text only titles were updated and HTML added. Nearly all new titles come with text and HTML. Many, many bugs were fixed. Some people might characterize the lack of HTML as a bug. The discussion thread had turned to different ways of identifying paragraph breaks in plain text. As Marcello rightly points out, it's tough to do automatically with complete accuracy (though many methods can produce less than complete accuracy). Nobody has argued that text is the master format, or should be. Or that it is somehow richer, or preserves more of the original formatting. There are some advantages...there are many limitations. PG's reasons for an emphasis on including plain text, if feasible, are well documented. One of the first responses in this thread pointed this out. I've adjusted the Subject line in my response, for people who want to talk about why text sux, among themselves. If there's enough interest, I can set up a separate mailing list for you. I won't be joining, however. -- Greg
participants (9)
-
David A. Desrosiers
-
Greg M. Johnson
-
Greg Newby
-
Lee Passey
-
Marcello Perathoner
-
Michael S. Hart
-
Paulo Levi
-
Ricardo F Diogo
-
Scott Olson