
Here's an attempt to get at an objective truth behind the format wars, in order to make it easier for folks to make informed if their own subjective decisions. Library science peeps and/or computer anthropologists might already know something I don't, which would make me give up an axe I keep grinding. Suppose you were to take all of the reading devices that STILL exist today and are likely to ever exist in the future. This would include Kindle's, really old computer terminals in behind-the-times university CS departments, reasonable speculations about 200 years from now, AND, AND the software that everyone in the world has on their computer right NOW. Now take all the formats that ever existed, and give them a big veto for every time someone i) truly cannot see it, and ii) has an inconvenient time on the screen reading the thing (from eye nuisance to ability to enlarge for those with poor vision). Microsoft Word 1995 format (if there were such a thing) is probably out. WordPerfect for 3.1 Windows format is probably out. And so on. Maybe you can see where I'm going with this. I truly believe that "text has gotta stay"-- someone said this years ago and it is the right decision, because some TXT formats will win the elimination contest I just described above. But I am wondering who wins the elimination round between: i) PG's current format for "TXT" with 80 chars ii) a format that is still all ASCII but removes all the end-of-column CR's, and only has double returns at the end of paragraphs. IMO, this ii) wins 100% hands down for "convenient time on the screen for reading the thing", based on software that everyone has right now. Yeah, some peeps might be able to write a program that takes out these CR's, but I know one time I tried it in something like MS Word, I messed it up worse. But if someone has to write a program to read, that ruins the criteria of "convenient time right now." IMO, I'm willing to toss out of the lifeboat the University CS historian on his old terminal in favor of the millions of peeps with HTML browsers. I'm not advocating for HTML, but for a TXT version without the 80 char line breaks. Just an idea. -- Greg M. Johnson http://pterandon.blogspot.com

On 10/29/2011 02:18 PM, Greg M. Johnson wrote:
... reasonable speculations about 200 years from now, ...
Hilarious. Could a person from 1800 reasonably have foreseen portable devices that could communicate with every person on earth and ebooks delivered over cellular networks? (In 1800 they used gas for illumination, and electricity, thou known to scientists, had no practical application. Radio telegraphy came a hundred years later.)
Microsoft Word 1995 format (if there were such a thing) is probably out. WordPerfect for 3.1 Windows format is probably out.
The works worth preserving -- those that age slower than the substrate they are written on -- were preserved by copying them again and again to new substrates. Every file format will age and become unreadable eventually. The only hope for preserving the file's contents will be timely conversion into a newer format. Plain text lacks some important information about the text structure and does not convert gracefully to any other format. Many have tried, all have failed. By the time OCR software will recognize more textual features than plain text, plain text will be dead. Plain text sucks as archival format because it forgets too much. Plain text sucks as distribution format too. It displays well on an 80-column teletype (some 20 people still have one in the basement) and nowhere else (some 6 billion people use something else). "Reasonable speculation:" in 100 years, machine OCR will be so accurate that it will find and correct typos in the printed book, it will infer semantics and mark them up, and, while PG plain text files can still be found in the Google cache, nobody will use them, as it will take about 2ms to OCR any scanned book from the IA.
But I am wondering who wins the elimination round between: i) PG's current format for "TXT" with 80 chars ii) a format that is still all ASCII but removes all the end-of-column CR's, and only has double returns at the end of paragraphs.
So you think that TXT *without* hard CRs is better than text with hard CRs. After all that pomp about "objective truth" and "200 years from now" I expected something more thoroughly enlightening. -- Marcello Perathoner webmaster@gutenberg.org

First, note that the PG format isn't really "TXT80" but rather "TXT70" since the PG linelength rules requires something like 70 chars per line. Second, note that what PG does isn't even "ascii", since PG requires interpretation of the code points in ways that differ in meaning from those standards. Third, the war between newline meaning "newline means line break within sentence" vs. "newline means break between paragraphs" was fought in the early 1970's with the near-universal adoption of the rule "newline means break between paragraphs" being adopted "everywhere" in the personal computer world. Somehow PG didn't hear the message. Simple bottom line: if a real world customer opens a PG TXT70 file on their device, will it display "correctly" ? Simple Answer: On close to 100% of real world customers' devices, the answer is no, it will not display "correctly." PS: But the problem is, is that the "old timers" at PG are so used to seeing PG generated files being displayed *incorrectly* on this that or the other device, that they don't even recognize these problems as *being* problems! IE they think the rest of the world should simply learn to adapt itself to PG's self-imposed problems.

PPS: I used to see lots of complaints from E-Book users: "Hey I opened this simple txt file from PG because I thought if anything should work on my E-Book reader a simple txt file should, but PG can't even get that right!" But I don't see those kinds of complaints anymore. I'm not sure if that is because E-Book users have given up on PG entirely, or if they have learned enough about E-Book file formats to understand they should be downloading an EPUB or a Mobi formatted book instead. Certainly I don't understand why E-Book users continue to tout feedbook, for example, when the feedbook files tend to be even more damaged than PG's.

On Sat, Oct 29, 2011 at 1:48 PM, Jim Adcock <jimad@msn.com> wrote:
Third, the war between newline meaning "newline means line break within sentence" vs. "newline means break between paragraphs" was fought in the early 1970's with the near-universal adoption of the rule "newline means break between paragraphs" being adopted "everywhere" in the personal computer world.
I don't know what planet you're from, but it's not mine. The best selling single personal computer model of time, the Commodore 64, started selling in 1982, and I'm pretty sure it didn't do word wrapping. Most programs on MS-DOS didn't do word wrapping. Still today, Notepad on Windows makes word-wrap optional, so it's clearly not universal. On my Unix system today, most programs don't do word wrapping; even on Firefox I sometimes open a text over the web and find that I can't read it because the lines aren't word-wrapped. Seriously, early 1970s? Wikipedia lists the Commodore PET as the first successfully mass-marketed computer, and that was in 1977. -- Kie ekzistas vivo, ekzistas espero.

On 10/29/2011 6:18 AM, Greg M. Johnson wrote:
But I am wondering who wins the elimination round between: i) PG's current format for "TXT" with 80 chars ii) a format that is still all ASCII but removes all the end-of-column CR's, and only has double returns at the end of paragraphs.
This is a hard question to answer, because you haven't told us about your assumptions regarding rendering software (and all rendering requires /some/ sort of software). In true ASCII, for example, the way to make an "enyé" (e.g. the ñ in mañana ) is 110 (e) 8 (backspace) 126 (~), or as interpreted, "strike the paper with the print head in position 110, move the print head back one character, then strike the paper with the print head in position 126." To start a new line, you would use 13 (carriage return) 10 (line feed) or "move the print head to the far left position, then roll the platen up one line." I'm going to make a possibly unjustified assumption that these ASCII characters designed to control the printing hardware (form feed to advance one page, bell to activate the bell, XOff to stop the rs-232 feed while I'm busy, XOn to start the feed up again, etc.) are out of bounds for you, as you're not trying to control a mechanical device; I'm assuming when you say ASCII, you mean /printable/ ASCII, or ASCII values between 32 and 126 (inclusive). But on the other hand, you've mentioned carriage returns, so at least /one/ control code is acceptable. Of course, simply moving the cursor (the CRT equivalent of the print head) to the left side of the screen would simply result in a line of text being printed over and over again in the same spot, so I'm assuming you want to reinterpret CR to mean "move cursor to left hand side and move it down one line." Or does your system require both the carriage return and line feed to move to "the beginning of the next line on the screen?" (Note that in the *nix world, this function has been assigned to the line feed ASCII value, such that it is common to see it (10) referred to as "new line" instead of "line feed".) Should I assume that "move the cursor to a new line on the display, scrolling the display when the cursor is already at the bottom of the screen" can be represented by a carriage return (13) /or/ a line feed (10) but that the two of them together must be merged into a single command? That is, CRLF (two characters) is a single new line, but CRCRLF (three characters) would be two new lines? Are there any other control codes that you would exempt from the "printable ASCII" rule? I'm starting to feel very uncomfortable making all these assumptions. Now if I'm assuming that the software you have in mind can't handle strike-backspace-overstrike, then I have to assume it can't handle much more sophisticated functions either, like word-wrapping; we're going to need something like CR/LF/CRLF to tell the hardware where to put the cursor, otherwise we'll end up with a screen full of text perfectly right-justified, but with most of the words split at odd places when they occur at the end of the line. So we must insert this vaguely-defined "new line" at the beginning of any word that would be divided at the edge of the display. But to know where to insert the new line characters, we have to know the width of the screen, in fixed-size characters. PG made the /assumption/ that the width of all screens was 80 characters; the notion of TXT70 was that once the line length exceeded 70 characters, you should insert a new line character at the next occurring whitespace, so you should never see a line that ends funnily. But you're talking about /really/ old computers, which includes the Commodore, the Atari, and, most importantly, the Apple II. All of those computers had screens that were only /40/ characters wide, so to satisfy your requirements, what we really wanted is TXT40, where no line exceeds 40 characters in length. Of course, if you're willing to allow software smart enough to do word wrap, all bets are off, because with software I can display HTML just as easily, and with much better results, than pure printable ASCII. By now you should be saying, "Man, this is absurd." And you'd be right. You have posited a question with exposing all relevant assumptions or providing all relevant requirements. We're not starting on the same page at all. Everyone (well, almost everyone) wants simple answers, magic bullets. "We could easily fix today's economy if we would only just [insert favorite simplistic solution here]." But the simple answer is that there are no simple answers. Life, and text preservation, is more complex and nuanced than that. This is not to say that the answers do not exist, it just means that if you want a solution you have to closely examine the parameters of your problem, and deal with the required complexity of the solution.

On Sat, 29 Oct 2011, Greg M. Johnson wrote:
But I am wondering who wins the elimination round between: i) PG's current format for "TXT" with 80 chars ii) a format that is still all ASCII but removes all the end-of-column CR's, and only has double returns at the end of paragraphs.
IMO, this ii) wins 100% hands down for "convenient time on the screen for reading the thing", based on software that everyone has right now.
I recall reading ages ago that there were heated discussion about this in earlier days of PG (probably 1990s). I wonder what reasoning was used to settle on the txt-less-than-80 format... --Andrew
participants (6)
-
Andrew Sly
-
David Starner
-
Greg M. Johnson
-
Jim Adcock
-
Lee Passey
-
Marcello Perathoner