Re: In search of a more-vanilla vanilla TXT

newer
Re: a case of deliberate sabotage...

older
PG French eBook #1500

Bowerbird＠aol.com

13 Sep 2009 13 Sep '09

7:32 p.m.

jim said:

...

In case anyone really wants to do it right, what PG needs is to have each book (and other documents) marked up semanticly.

Of all of the exsting SGML/XML applications, TEI seems best for what PG is doing.

jim, first of all, you're wrong. and second of all, you're about 5-8 years late for this conversation. i think the archives are still available, and will give you a good idea of this long-raging debate. but thanks for giving us all a blast from the past. -bowerbird

Attachments:

attachment.html (text/html — 911 bytes)

Show replies by date

Jim Adcock

13 Sep 13 Sep

10:47 p.m.

New subject: In search of a more-vanilla vanilla TXT

Sigh. I don't know what the solution is, but for me as a content-provider it is heart-breaking to do my best to try to "do the job right" and then see the hard-won knowledge and effort I have put into "doing it right" thrown away BOTH by the txt and the html as implemented by PG. I'd love to see an input format that preserves the hard-won effort I put into content creation, AND which is NOT a "write once" format, such that future content producers can easily build on the efforts I have already put into creating a correct content creation, and NOT have to redo the work I have already done because BOTH txt and html as implemented by PG throw away work effort I have already done. Yes, it is possible for future content producers to go over the text front to back another three or four passes after I have done so already in order to try to "catch" again the errors that txt and html have re-introduced -- but why would anyone want that they should have to do so? What I would like to see as an input-submission format is something that: 1) Preserves the hard-won effort I have already put into content creation, such that a future volunteer can build on my work without having to "reverse engineer" those gratuitous errors currently being introduced by the current PG use of txt and html. 2) Works well-enough even with commonly available "bottom feeder" tools. [[Personally I get tired of claims of "magic bullet" tools and then I spend a day trying to get them to work on my computer and they don't even install and run correctly.]] 3) Does simple common tasks in a simple transparent way. 4) Isn't ugly or ungainly for simple common everyday tasks. 5) Can be -- and is in practice -- transformed from input format to a variety of end reader formats in an attractive manner which does not contain common uglinesses for common book situations.

Marcello Perathoner

11:11 p.m.

New subject: In search of a more-vanilla vanilla TXT

Jim Adcock wrote:

...

1) Preserves the hard-won effort I have already put into content creation, such that a future volunteer can build on my work without having to "reverse engineer" those gratuitous errors currently being introduced by the current PG use of txt and html.

Please give some real-world examples. -- Marcello Perathoner webmaster@gutenberg.org

Jim Adcock

14 Sep 14 Sep

3:59 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

...
1) Preserves the hard-won effort I have already put into content creation, such that a future volunteer can build on my work without having to "reverse engineer" those gratuitous errors currently being introduced by the current PG use of txt and html.

Please give some real-world examples.

OK. My point being that IF PG were to accept a "proper" book INPUT encoding format that preserves the hard-won knowledge of the original encoding volunteer, then there would be no need for a future volunteer to have to completely scan that encoding against the original book scans in order to make another pass looking for errors, etc. So what all is "wrong" with TXT and HTML in this regards as stored in the PG databases? Both formats throw away the original volunteers' knowledge about the common parts of books: TOC, author info, pub info, copyright pages, index, chapters, etc. Yes one can code this information in HTML but there is no unambiguous way to do so which means that PG HTML encodings all take different paths, as one rapidly discovers if one tries to automagically convert PG HTML into other reflow file formats. You could follow common h1, h2, h3 settings by convention -- if PG were to establish and require such -- but then you end up with really ugly rendered HTML on common displays. You can overcome this with style sheets -- but then you are defeating many tools which automagically convert HTML into a variety of other reflow file formats for the various e-readers. Both formats as stored by PG gratuitous throw away hard-won line-by-line alignments between scan text and hand-scanno corrected text. These alignments are needed if a future volunteer wanted to make another pass at "fixing" errors in the text, for example by running through DP again, or running it against a future automagic tool comparing a new scan to the PG text. I submit my HTML to PG WITH the original line-by-line alignments -- because it doesn't in any way hurt the HTML and allows a future volunteer to make another pass on my work -- but then PG insists on throwing this information away anyway before posting their HTML files. Both formats throw away page numbers and page breaks, which again are necessary to make another volunteer pass against the original scans, and also to make future passes against broken link info, etc. Also would be useful for some college courses, where you need page number refs, even if reading on a reflow reader device. I'm NOT suggesting that page numbers should be typically be displayed in an OUTPUT reflow file format rendering, rather that this represents hard-won information that ought to be retained in a well-designed INPUT file format encoding. TXT files seem to me to almost always have some glyphs outside of the 8-bit char set. Unicode text files would at least overcome this limitation. HTML in theory doesn't have this limitation, but in practice I find in submitting "acceptable" HTML to PG running it through their battery of acceptance tools I find some glyphs I can't get through so I end up punting and throwing away "correct" glyph information dumbing down the representation of some glyphs. PG and DP *in practice* have a dumbed-down concept of punctuation, such that it's impossible to maintain and retain "original authors intent" as expressed in the printed work. For example, M-Dash is commonly found in three contexts: lead-in, lead-out, and connecting, similar to how ellipses are used at least in three different ways: ...lead in, lead out, and ... connecting. But in practice all one can get through PG and DP is connecting M-dash. Also consider all the [correct] variety of Unicode quotation marks which needlessly get reduced in PG and DP to only U+0022 OR U+0027. In general PG has a dumbed-down concept of punctuation, that near is near enough, and is actively hostile to accurately encoding the punctuation as rendered in the original print document. Again, it is EASY to dumb down an INPUT file format, for example if you need to output to a 7-bit or even a 5-bit teletypewriter, if that is what you want. So why insist that the input file encoder get it wrong in the first place? It is easy to throw away information when going from an INPUT file encoding to an OUTPUT file rendering. It is VERY DIFFICULT to correctly fix introduced errors when going back from a reduced OUTPUT file rendering to a correctly encoded input file encoding. What I am imagining is some simple-to-use file encoding format where a volunteer can correctly and unambiguously code the common things and conventions one commonly finds in every day books, such that another volunteer can pick up and make another pass on the book some years hence -- without having to reinvent nor rediscover work that the previous volunteer has already put into understanding and coding the book. Such an INPUT file encoding having little or nothing to do with how the output will be displayed in an eventual OUTPUT file rendering. DP already has much of this distinction in their work flow. Unfortunately, their page-by-page conventions and simplifications "dumbing down" for the sake of the multiple levels of volunteers guarantees loss of information. Not to mention that they also throw away the correctly encoded INPUT file hard-won knowledge for more ambiguous OUTPUT file renderings in HTML and TXT. The end result is that both PG and DP end up be "write once" efforts that are hostile to future improvements by future volunteers -- instead of encouraging on-going efforts to improve what we got. Which is also indicative of a general culture of quantity not quality. PG pretends that part of why we do what we do is to protect and preserve books in perpetuity. This implies in exchange that information that is gratuitously thrown away during input file encoding [directly in an output file rendering] is potentially lost for eternity. Why insist via policy that volunteer input file encoders must throw away this information?

Marcello Perathoner

5:47 p.m.

New subject: In search of a more-vanilla vanilla TXT

Jim Adcock wrote:

...

OK. My point being that IF PG were to accept a "proper" book INPUT encoding format that preserves the hard-won knowledge of the original encoding volunteer, then there would be no need for a future volunteer to have to completely scan that encoding against the original book scans in order to make another pass looking for errors, etc.

There's a misconception here. PG *does* allow you to post additional file formats *along* with TXT and HTML. TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot. The reason that there isn't a TEI file posted along with *every* ebook is that most PPers at DP don't care to produce one.

...

Both formats throw away the original volunteers' knowledge about the common parts of books: TOC, author info, pub info, copyright pages, index, chapters, etc.

TEI has elements for all these cases.

...

TXT files seem to me to almost always have some glyphs outside of the 8-bit char set. Unicode text files would at least overcome this limitation.

I don't see any problem here: Produce utf-8 files. The whitewashers will create some work for themselves by converting the utf-8 to all sorts of embarrassing encodings and then waste more time at the helpdesk to explain to incredulous users what `encodings´ are, but that need not be your problem. -- Marcello Perathoner webmaster@gutenberg.org

David Starner

6:40 p.m.

New subject: In search of a more-vanilla vanilla TXT

On Mon, Sep 14, 2009 at 1:47 PM, Marcello Perathoner <marcello@perathoner.de> wrote:

...

PG *does* allow you to post additional file formats *along* with TXT and HTML. TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot.

The reason that there isn't a TEI file posted along with *every* ebook is that most PPers at DP don't care to produce one.

And the reason for that is not only is it a lot more work than an HTML edition, unsupported by any sort of tools, it's worthless to the end user, as apparently no one at PG can get decent output from it. -- Kie ekzistas vivo, ekzistas espero.

Marcello Perathoner

15 Sep 15 Sep

9:10 a.m.

New subject: In search of a more-vanilla vanilla TXT

David Starner wrote:

...

...
The reason that there isn't a TEI file posted along with *every* ebook is that most PPers at DP don't care to produce one.

And the reason for that is not only is it a lot more work than an HTML edition, unsupported by any sort of tools, it's worthless to the end user, as apparently no one at PG can get decent output from it.

Thats a lot of misinformation in such a short paragraph. 1. More work ... Of course you have to learn TEI, as you had to learn HTML. No difference there. Once you have mastered it, it is actually a lot less work, because TEI was designed for text preservation, while HTML was designed to bring scientific papers online. It is also less work because from one master you get the HTML, the TXT and the PDF. It is also less work fixing errata, because you fix the master instead of having to fix 2 or 3 different files. 2. Unsupported by tools ... PG has an implementation of TEI. I know you don't like it because you haven't figured out how to produce pretty title pages. But you don't have to use that one, there are plenty other. TEI is being used by many projects: http://www.tei-c.org/Activities/Projects/ and has a full suite of tools: http://wiki.tei-c.org/index.php/Category:Tools 3. Worthless to the end user ... TEI is a master format. Its use is in producing formats suitable for end-user consumption. Anf if we don't equate end user == reader but try: end user == librarian or end user == lunguistic researcher we find that TEI is many times as useful as HTML. 4. No decent output ... `Decentness´ is a matter of debate. At DP some PPers think it is essential to use every CSS feature at least once in every text, having pictures float right and left and text flowing around them and having illuminated dropcaps and printers ornaments and page numbers all over the place. PGTEI cannot (yet) do that. I very much prefer a simple layout, with only essential pictures smack in the middle of the text flow at the point they logically belong. A formatting that is easily ported to all existing devices. PGTEI excels at this. Ironically `decent´ DP output is already falling to pieces on ePub devices (not even to mention Mobipocket) because ePub does not support CSS position: absolute. -- Marcello Perathoner webmaster@gutenberg.org

Jim Adcock

16 Sep 16 Sep

2:49 a.m.

New subject: In search of a more-vanilla vanilla TXT

...

PG has an implementation of TEI.

How does one learn more and/or access the "PG implementation of TEI." I have seen PG TEI which looks to me to add some tags to the base TEI ? Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers than anything describing HTML tags that I have seen! I'd say the DP tagging documentation is already painful enough for most of us. I am about 100 pages into the TEI documentation, so in maybe two weeks I can tell you more about what I think about it....

Al Haines (shaw)

5:02 a.m.

New subject: In search of a more-vanilla vanilla TXT

http://pgtei.pglaf.org/ ----- Original Message ----- From: "Jim Adcock" <jimad@msn.com> To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d@lists.pglaf.org> Sent: Tuesday, September 15, 2009 7:49 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT

...

...
PG has an implementation of TEI.

How does one learn more and/or access the "PG implementation of TEI." I have seen PG TEI which looks to me to add some tags to the base TEI ?

Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers than anything describing HTML tags that I have seen! I'd say the DP tagging documentation is already painful enough for most of us. I am about 100 pages into the TEI documentation, so in maybe two weeks I can tell you more about what I think about it....

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Marcello Perathoner

9:03 a.m.

New subject: In search of a more-vanilla vanilla TXT

Jim Adcock wrote:

...

How does one learn more and/or access the "PG implementation of TEI." I have seen PG TEI which looks to me to add some tags to the base TEI ?

Adds some very few, restricts some others, specifies the usage of the rend attribute.

...

Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers than anything describing HTML tags that I have seen! I'd say the DP tagging documentation is already painful enough for most of us. I am about 100 pages into the TEI documentation, so in maybe two weeks I can tell you more about what I think about it....

Don't read the full TEI Guidelines. Read about TEI-Lite: http://www.tei-c.org/release/doc/tei-p5-exemplars/html/teilite.doc.html which is a lot shorter than the HTML4 specs. You don't even have to use all of TEI Lite. -- Marcello Perathoner webmaster@gutenberg.org

David Starner

4:21 p.m.

New subject: In search of a more-vanilla vanilla TXT

On Tue, Sep 15, 2009 at 5:10 AM, Marcello Perathoner <marcello@perathoner.de> wrote:

...

Of course you have to learn TEI, as you had to learn HTML. No difference there.

But we know HTML, for one. We also have tools that help us with HTML, for two. For three and the strike-out, I have a host of tools that will help me edit, verify and view HTML, but there is no Debian packages for PGTEI. Yes, yes, if I want to spend my hours mucking around with stuff, I can in theory get it all installed.

...

PG has an implementation of TEI. I know you don't like it because you haven't figured out how to produce pretty title pages.

Note: by "pretty title pages" Marcello means a title page that looks like any title page in an actual book. Once again, I grabbed the nearest books; I have ten books, by ten different publishers, including two in Esperanto and one in a mixture of Esperanto and Chinese, and with the exception of one of the English books which right-justifies its title page, they all follow the basic format of centered pages, title (new line) author (bottom of page) publisher. None of them look a darn thing like the title pages PGTEI prints out.

...

and has a full suite of tools:

http://wiki.tei-c.org/index.php/Category:Tools

I see "To install the filter(s), start Open Office and and follow the Tools / XML Filter Settings menu. Choose Open Package and locate the .jar file(s)." Again, no difference at all from stuff that comes preinstalled.

...

3. Worthless to the end user ...

TEI is a master format. Its use is in producing formats suitable for end-user consumption.

Then prove it. If I saw a single document produced from PGTEI that was suitable for end-user consumption, I might support it. Look damnit, I was a fan of TEI until I realized that the people who were going to bring it to PG didn't give a damn about making the output something we wanted people to see.

...

Anf if we don't equate end user == reader but try: end user == librarian or end user == lunguistic researcher we find that TEI is many times as useful as HTML.

The librarian is never the end-user. The librarian is the person who makes it available to the end-user. Nobody around here cares about the linguistic researcher as the end user, and we will never produce files that are marked up with the type of information--like distinguishing sentence ending punctuation from the same punctuation used other ways--that they need. The end user we're targeting is the reader.

...

4. No decent output ...

`Decentness´ is a matter of debate.

Which is why you blow at selling this. Until you accept that PGTEI needs to produce output that meets the standards of the people you're trying to sell it to, nobody cares.

...

At DP some PPers think it is essential to use every CSS feature at least once in every text, having pictures float right and left and text flowing around them and having illuminated dropcaps and printers ornaments and page numbers all over the place. PGTEI cannot (yet) do that.

I very much prefer a simple layout, with only essential pictures smack in the middle of the text flow at the point they logically belong. A formatting that is easily ported to all existing devices. PGTEI excels at this.

Yes, in fact, some PPers do want to produce an etext that replicates the original, includes the important illustrated dropcaps (that are frequently as much a part of the illustration of the book as any other illustration) and page numbers (that are crucial for much of the non-fiction that we reproduce, especially if you want to follow the web of references from one PG era book to another.)

...

Ironically `decent´ DP output is already falling to pieces on ePub devices (not even to mention Mobipocket) because ePub does not support CSS position: absolute.

And if you had produced TEI output that could do what people wanted to do, it's possible that we would have better output on the ePub devices. Right now, I would be surprised to find that PGTEI can output at all to ePub, and I wouldn't be surprised if the people who produced the DP output were happier with the results of their HTML translated to ePub than your HTML translated to ePub. -- Kie ekzistas vivo, ekzistas espero.

Marcello Perathoner

5:56 p.m.

New subject: In search of a more-vanilla vanilla TXT

David Starner wrote:

...

But we know HTML, for one. We also have tools that help us with HTML, for two. For three and the strike-out, I have a host of tools that will help me edit, verify and view HTML, but there is no Debian packages for PGTEI.

Where's the debian package for guiguts? I had to actually edit the code to make it run on my debian/unstable. nxml-mode in emacs is all you'll ever need to edit and validate xml. Or use the TEI stylesheets in OpenOffice, if you must needs have WYSIWYG. Sheesh!

...

...
PG has an implementation of TEI. I know you don't like it because you haven't figured out how to produce pretty title pages.

Note: by "pretty title pages" Marcello means a title page that looks like any title page in an actual book. Once again, I grabbed the nearest books; I have ten books, by ten different publishers, including two in Esperanto and one in a mixture of Esperanto and Chinese, and with the exception of one of the English books which right-justifies its title page, they all follow the basic format of centered pages, title (new line) author (bottom of page) publisher. None of them look a darn thing like the title pages PGTEI prints out.

Ohh. Pleeeeease! Go here: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf and tell me what you don't like about the title page. And then go here: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html to verify that it looks the same in HTML. And then go here: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt to see how it looks in TXT. All from ONE and the same TEI master.

...

If I saw a single document produced from PGTEI that was suitable for end-user consumption, I might support it.

http://www.gnutenberg.de/pgtei/0.5/examples/

...

The librarian is never the end-user. The librarian is the person who makes it available to the end-user. Nobody around here cares about the linguistic researcher as the end user, and we will never produce files that are marked up with the type of information--like distinguishing sentence ending punctuation from the same punctuation used other ways--that they need. The end user we're targeting is the reader.

(Distinguishing punctuation is very important for typesetters.) YOU are targeting the reader that reads on a desktop browser. I am targeting everybody on every platform of every size and every software that might want to use or convert our books in any way imaginable or not yet imaginable.

...

Yes, in fact, some PPers do want to produce an etext that replicates the original, includes the important illustrated dropcaps (that are frequently as much a part of the illustration of the book as any other illustration) and page numbers (that are crucial for much of the non-fiction that we reproduce, especially if you want to follow the web of references from one PG era book to another.)

And while they are busy `replicating the original´ they miss all opportunities of electronic text. Eg. the index entries are still linked to the *page* they reference, while it was technically possible for decades now to go directly to the word. So if the reader clicks on an indexed term, she must read all the page until she finds the reference instead of going directly to the reference (and maybe have it highlighted like on Wikipedia). This opportunity of making the books more accessible has been missed because DP is still producing electronic facsimiles instead of electronic books. Eg. speaker tagging. In a few years when everybody will have speech syntesis on their cell phones ebook readers people may want to listen to their books while driving. If you have quotes marked up you can assign different voices to different speakers. Eg. geografic tagging. While visiting someplace you may want to find all book references that refer to the place you are in. DP misses out again and again. But they make pretty facsimiles ...

...

And if you had produced TEI output that could do what people wanted to do, it's possible that we would have better output on the ePub devices.

If people had started using TEI instead of griping endlessly about minor shortcomings, we might have now a complete TEI workflow in place.

...

Right now, I would be surprised to find that PGTEI can output at all to ePub, and I wouldn't be surprised if the people who produced the DP output were happier with the results of their HTML translated to ePub than your HTML translated to ePub.

PGTEI outputs just fine to ePub. Just take the HTML output and convert it in Calibre or whatever you are using. Look here (this is PDF, not ePub): http://www.gnutenberg.de/pgtei/0.5/examples/pgtei-pdf-sony-reader.jpg -- Marcello Perathoner webmaster@gutenberg.org

Greg Newby

7:20 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

Go here:

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf

I'd like to support what Marcello wrote, below. I've long believed that having TEI be the native output from Distributed Proofreaders is desirable. My understanding is they just don't have the available person-power to implement this. When a TEI eBook is submitted to the whitewashers, we have a very nice processing stream (with the pieces mentioned below) to very easily produce .txt, .htm and anything else we might want. If we had enough of eBooks with TEI as the native format, we could add transformation options to www.gutenberg.org's catalog pages, to truly provide "your book, your way." There's no lack of ability to produce, transform or otherwise work with TEI files. As someone pointed out, the DP proofreading is essentially agnostic about the back-end encoding format. The postprocessors might see some variation in the workflow, but would not necessarily need to work directly with TEI markup. I think the existing software and examples are compelling. If there was an easier way of getting TEI embedded into the DP workflow, it would have happened by now. -- Greg On Wed, Sep 16, 2009 at 07:56:34PM +0200, Marcello Perathoner wrote:

...

David Starner wrote:

...
But we know HTML, for one. We also have tools that help us with HTML, for two. For three and the strike-out, I have a host of tools that will help me edit, verify and view HTML, but there is no Debian packages for PGTEI.

Where's the debian package for guiguts? I had to actually edit the code to make it run on my debian/unstable.

nxml-mode in emacs is all you'll ever need to edit and validate xml.

Or use the TEI stylesheets in OpenOffice, if you must needs have WYSIWYG. Sheesh!

...
...
PG has an implementation of TEI. I know you don't like it because you haven't figured out how to produce pretty title pages.

Note: by "pretty title pages" Marcello means a title page that looks like any title page in an actual book. Once again, I grabbed the nearest books; I have ten books, by ten different publishers, including two in Esperanto and one in a mixture of Esperanto and Chinese, and with the exception of one of the English books which right-justifies its title page, they all follow the basic format of centered pages, title (new line) author (bottom of page) publisher. None of them look a darn thing like the title pages PGTEI prints out.

Ohh. Pleeeeease!

Go here:

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf

and tell me what you don't like about the title page.

And then go here:

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html

to verify that it looks the same in HTML.

And then go here:

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt

to see how it looks in TXT.

All from ONE and the same TEI master.

...
If I saw a single document produced from PGTEI that was suitable for end-user consumption, I might support it.

http://www.gnutenberg.de/pgtei/0.5/examples/

...
The librarian is never the end-user. The librarian is the person who makes it available to the end-user. Nobody around here cares about the linguistic researcher as the end user, and we will never produce files that are marked up with the type of information--like distinguishing sentence ending punctuation from the same punctuation used other ways--that they need. The end user we're targeting is the reader.

(Distinguishing punctuation is very important for typesetters.)

YOU are targeting the reader that reads on a desktop browser.

I am targeting everybody on every platform of every size and every software that might want to use or convert our books in any way imaginable or not yet imaginable.

...
Yes, in fact, some PPers do want to produce an etext that replicates the original, includes the important illustrated dropcaps (that are frequently as much a part of the illustration of the book as any other illustration) and page numbers (that are crucial for much of the non-fiction that we reproduce, especially if you want to follow the web of references from one PG era book to another.)

And while they are busy `replicating the original´ they miss all opportunities of electronic text.

Eg. the index entries are still linked to the *page* they reference, while it was technically possible for decades now to go directly to the word. So if the reader clicks on an indexed term, she must read all the page until she finds the reference instead of going directly to the reference (and maybe have it highlighted like on Wikipedia).

This opportunity of making the books more accessible has been missed because DP is still producing electronic facsimiles instead of electronic books.

Eg. speaker tagging. In a few years when everybody will have speech syntesis on their cell phones ebook readers people may want to listen to their books while driving. If you have quotes marked up you can assign different voices to different speakers.

Eg. geografic tagging. While visiting someplace you may want to find all book references that refer to the place you are in.

DP misses out again and again.

But they make pretty facsimiles ...

...
And if you had produced TEI output that could do what people wanted to do, it's possible that we would have better output on the ePub devices.

If people had started using TEI instead of griping endlessly about minor shortcomings, we might have now a complete TEI workflow in place.

...
Right now, I would be surprised to find that PGTEI can output at all to ePub, and I wouldn't be surprised if the people who produced the DP output were happier with the results of their HTML translated to ePub than your HTML translated to ePub.

PGTEI outputs just fine to ePub. Just take the HTML output and convert it in Calibre or whatever you are using.

Look here (this is PDF, not ePub):

http://www.gnutenberg.de/pgtei/0.5/examples/pgtei-pdf-sony-reader.jpg

-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

David Starner

17 Sep 17 Sep

5:47 a.m.

New subject: In search of a more-vanilla vanilla TXT

On Wed, Sep 16, 2009 at 1:56 PM, Marcello Perathoner <marcello@perathoner.de> wrote:

...

Where's the debian package for guiguts?

There's not one, but that's why it's called in-house code and we can talk to the programers if we need help. But there are several Debian packages for programs that can check and display HTML.

...

...
...
PG has an implementation of TEI. I know you don't like it because you haven't figured out how to produce pretty title pages.

Ohh. Pleeeeease!

So you attack people for having made complaints that were perfectly valid when they were made? Unless you're going for the martyr award, I hardly see how that's productive.

...

Eg. the index entries are still linked to the *page* they reference, while it was technically possible for decades now to go directly to the word.

If they are still linked to the page instead of the word, it's because the PPer looked at a 50 page index and decided that there was no way they were going to wade through there and try and find where on the page the link was intended to go to for 20,000 references. HTML and TEI are no different here.

...

Eg. geografic tagging. While visiting someplace you may want to find all book references that refer to the place you are in.

Maybe. There's a very real question whether it's worth the man-power to mark this up, and it's really a bit of a gratuitous feature.

...

If people had started using TEI instead of griping endlessly about minor shortcomings, we might have now a complete TEI workflow in place.

If you had listened to the needs of the people who you wanted to start using TEI instead of bitching about them and their requirements, maybe they would have started using TEI. -- Kie ekzistas vivo, ekzistas espero.

Keith J. Schultz

7:55 a.m.

New subject: In search of a more-vanilla vanilla TXT

Hi Their, I have look at TEI, also the way things SHOULD be encoded and said NO WAY!! Fasr to complicated. As I have mention here time and time again a output format should not be presupossed. The layout of a page is not that hard to markup. regards Keith. Am 17.09.2009 um 07:47 schrieb David Starner:

...

On Wed, Sep 16, 2009 at 1:56 PM, Marcello Perathoner <marcello@perathoner.de> wrote:

Maybe. There's a very real question whether it's worth the man-power to mark this up, and it's really a bit of a gratuitous feature.

...
If people had started using TEI instead of griping endlessly about minor shortcomings, we might have now a complete TEI workflow in place.

If you had listened to the needs of the people who you wanted to start using TEI instead of bitching about them and their requirements, maybe they would have started using TEI.

Marcello Perathoner

11:25 a.m.

New subject: In search of a more-vanilla vanilla TXT

David Starner wrote:

...

...
Eg. geografic tagging. While visiting someplace you may want to find all book references that refer to the place you are in.

Maybe. There's a very real question whether it's worth the man-power to mark this up, and it's really a bit of a gratuitous feature.

Gratuitous to people who have no vision. People who think they are `preserving´, while they are only consigning to rot on a different medium. You take a book from a dusty bookshelf, digitize it, and put it on a file server. You have taken content expressed in technology of 500 years ago and `updated´ it to technology of 20 years ago. Today its all mobile devices. Ebooks have to come along in your shirt pocket or die. Wikipedia is doing it: http://en.wikipedia.org/wiki/File:Wikitude3.jpg There are many travel books in PG that could be marked up like that. -- Marcello Perathoner webmaster@gutenberg.org

Jim Adcock

7:53 p.m.

New subject: In search of a more-vanilla vanilla TXT

I personally like Marcello's efforts pretty well, but let me accept his challenge and use his examples as examples of the problems that I *personally* find as a reader of PG texts -- that I *in reality* find with PG's current efforts -- as well as examples of the need for better input markup languages than we currently are using:

...

Go here:

...

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf

...

and tell me what you don't like about the title page.

What I don't like about the title page is that it doesn't show up correctly on my choice of machine, because my choice of machine assumes the existence of spine information. Thus the "Title" shows up on my machine as "4650-pdf" and "Author" shows up as "4650-pdf" So when I come back to my machine two weeks from now and search for this book by title, I cannot find it. And when I search for it by author, I still cannot find it. Other than that, this PDF text, to my surprise, shows up beautifully on my machine. I would, in practice, be willing to read this text. The choice of sans-serif font looks weird, and I would like to be able to change this choice of font, but of course I can't because this is PDF. Other than that, I would be happy to read this as a book representing a good effort from PG. Further, I would be able to download this file via the airwaves while waiting stuck at an airport, for example, and read this book there. In my opinion these results well-represent PG as an electronic publishing house.

...

And then go here:

...

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html

...

to verify that it looks the same in HTML.

I can verify that it neither looks the same nor even shows up on my choice of machine at all, because my machine doesn't support HTML as a native file format. I can, if I am lucky, access this file via the airwaves using the machine's built-in web browser while waiting stuck at the airport, but I cannot store the results as a file, because my machine doesn't support HTML as a built-in file type. So I can read it on the ground, but I probably won't be able to read it in the air, and if I use my browser to access some other web site then I will probably lose this book. [Well, I take that back -- when I actually TRY to read this file via the airwaves as described above, it crashes my machine, requiring a hard reboot] Assuming I am not at an airport, but rather at home with my desktop computers, I can spend about 5 minutes of my time running an output-file-format to output-file-format cross-rendering software to change this HTML to MOBI format, which IS a native file format of my reader machine. The results then show up on my machine pretty beautifully. Except since HTML lacks spine information the Title now shows up as "4650-h" and the Author now shows up as "4650-h" Which means again, if I come back to my machine in two weeks, I will not be able to find this book. However, other than that, I like these results -- now that I have cross-rendered HTML to MOBI. The results are attractive, I CAN change font size. The font displayed is an attractive and appropriate sarif font. The pages reflow correctly. The links work for navigation. I can switch the machine to landscape mode and everything reflows correctly, supporting the capabilities of my machine. This file format would in practice be my favorite choice of file formats for my machine -- even though I can only access it initially from my house via a desktop machine and I have to waste five minutes of my time translating output file formats. In my opinion these results well represent PG as an electronic publishing house.

...

And then go here:

...

http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt

...

to see how it looks in TXT.

To my surprise, I CAN take this UTF-8 TXT formatted file, transfer it to my favorite machine, and it DOES open up correctly interpreting the UTF-8 encoding [You learn something new every day!] This file also lacks spine information, so now Author information shows up as "4650-0" and Title shows up as "4650-0" which means once again, if I come back to this machine in two weeks, I will not be able to find this book. Since this file was rendered char72 under the assumption of a fixed pitch font, and since my machine doesn't use fixed pitch fonts, the end result looks silly and amateurish. The "Printers Ornament" renders as laughable junk. The fixed char72 line breaks make the text in practice unreadable unless I choose an impossibly tiny font -- which then still makes the text in practice unreadable. Gratuitous underscores are sprinkled liberally "everywhere" in the text making the text an unreadable hash. I would not read this text if paid $100 to do so. If I paid good money for this text I would ask for double-my-money back. This is my least favorite file format. Further, it also lacks spine information, meaning that again the Author now displays as "4650" and the Title displays as "4650" which means, again, that if I came back to this machine again in two weeks I will not be able to find this book -- which in this case would be a *blessing* ! In my opinion, if I were a first-time "customer" of PG who makes the mistake of choosing this file format to download to read on my brand of machine, I would conclude that PG consists of a bunch of clueless clowns and I would never return to the PG site again. My Opinions Only -- but I would hope this illustrates how IN PRACTICE a real-world customer's opinion of PG will be filtered through the perception of their choice of reading machine -- and in turn how well WHICH choice of PG file formats they happen to choose to download matches the capabilities of their machine. And without the spine information, none of this really works well with my machine.

Paulo Levi

8:13 p.m.

New subject: In search of a more-vanilla vanilla TXT

Just a little input about text files and charsets (encodings), since i had to use it for my program. Most browsers and applications open these files in the correctly simply because someone (mostly mozilla) did the hard work of making a fast guessing engine. I wouldn't be amazed if it failed in some books.

Paulo Levi

8:15 p.m.

New subject: In search of a more-vanilla vanilla TXT

Also, i used the catalog information to get the title. Metadata in the file name only is not a good way to encode this information, and metadata inside the file, would require a special parser, everywhere.

Jim Adcock

9:02 p.m.

New subject: In search of a more-vanilla vanilla TXT

PG catalog may be a reasonable way to get Title information. As presently implemented the PG catalog is not a reasonable source of Author Firstname, Lastname information -- for multiple reasons!

Paulo Levi

11:10 p.m.

New subject: In search of a more-vanilla vanilla TXT

:) I had to make this loop for it (but then again i am indexing it, and not seperating the names like you are, that takes a little bit more work.) The code tries to find a date in the last part of the string, and then reorder all last name, first name multiple authors (i wanted normal order) Note the "possibles", and "hopefully" there, but i think it works for all books i encountered. In fact the names i reported a while ago as defective were found in errors in this method. It isn't that there is no method, it's just extremely ... non-normalized. private final StringBuilder normalizeString = new StringBuilder(); protected String normalizeName(String authorString) { int separator = authorString.lastIndexOf(','); //normal date seperator if (separator != -1) { String possibleDate = authorString.substring(separator + 1); for (int i = 0; i < possibleDate.length(); i++) { if (Character.isDigit(possibleDate.charAt(i))) { //a date, hopefully... return exchangeNames(authorString.substring(0, separator)); } } //no date, but change the name anyway. return exchangeNames(authorString); } return authorString; } protected String exchangeNames(String authorString) { normalizeString.setLength(0); exchangeNamesAux(authorString); return normalizeString.toString(); } private void exchangeNamesAux(String authorString) { int seperator = authorString.indexOf(','); if (seperator == -1) { normalizeString.append(authorString); return; } exchangeNamesAux(authorString.substring(seperator + 2)); normalizeString.append(' ').append(authorString.substring(0, seperator)); } On Thu, Sep 17, 2009 at 10:02 PM, Jim Adcock <jimad@msn.com> wrote:

...

PG catalog may be a reasonable way to get Title information. As presently implemented the PG catalog is not a reasonable source of Author Firstname, Lastname information -- for multiple reasons!

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Paulo Levi

11:23 p.m.

New subject: In search of a more-vanilla vanilla TXT

Correction, that is not multiple authors, but one per string + date.

Paulo Levi

11:25 p.m.

New subject: In search of a more-vanilla vanilla TXT

BTW i found this in my catalog post processor / indexer. //can have \n stupidly... String titleString = title.stringValue().replaceAll("\n", " "); :)

James Adcock

11:30 p.m.

New subject: In search of a more-vanilla vanilla TXT

LOL -- and what pray tell do you get for the Author Lastnames in the examples I gave using your algorithm?

...

private final StringBuilder normalizeString = new StringBuilder();

Paulo Levi

18 Sep 18 Sep

12:29 a.m.

New subject: In search of a more-vanilla vanilla TXT

Normaly a name tuple is like this : Last names, first name , Date It can also be like this Last name, first name or like this Name (for plato etc) I strip out the optional date (I should change the deciding algorithm to 2 non consecutive digits probably. Basically dates always seem to have a digit there) then exchange the first and last names if needed and join them again. If you want to keep them separate you can make a domain object or a list for that. BTW i just realized i don't need the recursion for nothing. I might change it. It takes a good 1.5 m to index the Gutenberg index even with a lot of hacks.

Paulo Levi

12:54 a.m.

New subject: In search of a more-vanilla vanilla TXT

Actually a little bit of disinformation there: I do need the recursion. Titles (for instance) have an additional ",". Forgot. On Fri, Sep 18, 2009 at 1:29 AM, Paulo Levi <i30817@gmail.com> wrote:

...

Normaly a name tuple is like this : Last names, first name , Date It can also be like this Last name, first name or like this Name (for plato etc)

I strip out the optional date (I should change the deciding algorithm to 2 non consecutive digits probably. Basically dates always seem to have a digit there) then exchange the first and last names if needed and join them again.

If you want to keep them separate you can make a domain object or a list for that. BTW i just realized i don't need the recursion for nothing. I might change it. It takes a good 1.5 m to index the Gutenberg index even with a lot of hacks.

Paulo Levi

1:18 a.m.

New subject: In search of a more-vanilla vanilla TXT

Oh i see the obvious error now (finally). How about a little different algorithm: strip out the date, then take the , suffix, prefix, sufix prefix until empty.

Paulo Levi

1:33 a.m.

New subject: In search of a more-vanilla vanilla TXT

Possibly this? BTW thanks for spotting that. private String normalizeName(String authorString) { int separator = authorString.lastIndexOf(','); //normal date seperator if (separator != -1) { String possibleDate = authorString.substring(separator + 1); for (int i = 0; i < possibleDate.length(); i++) { if (Character.isDigit(possibleDate.charAt(i))) { //a date, hopefully... return exchangeNames(authorString.substring(0, separator)); } } //no date, but change the name anyway. return exchangeNames(authorString); } return authorString; } private String exchangeNames(String authorString) { normalizeString.setLength(0); exchangeNamesAuxSuffix(authorString); return normalizeString.toString(); } private void exchangeNamesAuxSuffix(String authorString) { int seperator = authorString.lastIndexOf(','); if (seperator == -1) { normalizeString.append(authorString); return; } normalizeString.append(authorString.substring(seperator + 2)).append(' '); exchangeNamesAuxPrefix(authorString.substring(0, seperator)); } private void exchangeNamesAuxPrefix(String authorString) { int seperator = authorString.indexOf(','); if (seperator == -1) { normalizeString.append(authorString); return; } normalizeString.append(authorString.substring(0, seperator)).append(' '); exchangeNamesAuxSuffix(authorString.substring(seperator + 2)); }

James Adcock

2:15 a.m.

New subject: In search of a more-vanilla vanilla TXT

Sorry Paulo, I'm not sure what you are up to, but again, what do your algorithms actually find when applied to the author name examples I presented earlier? Sun Tzu Miguel de Cervantes Marquis de Sade

Paulo Levi

2:24 a.m.

New subject: In search of a more-vanilla vanilla TXT

Sun Tzu apparently doesn't exist (it's probably as the original name. Searching for Art of War gives Sunzi as one of the names) Miguel de Cervantes - > Miguel de Cervantes Saavedra Marquis de Sade -> marquis de Sade (marquis is lowercase for some reason on the index).

Paulo Levi

2:33 a.m.

New subject: In search of a more-vanilla vanilla TXT

This is not applied to the names you gave themselves, but as they appear on the index. Marquis de Sade for instance appears on the index as : "Sade, marquis de, 1740-1814".

Paulo Levi

3:47 a.m.

New subject: In search of a more-vanilla vanilla TXT

Duh, still wrong. Wait a second, i will sort it out.

Paulo Levi

3:57 a.m.

New subject: In search of a more-vanilla vanilla TXT

You're right, it is inconsistent in some (apparently only titled) authors. For example: 1 name, title broken up. La Rochejaquelein, Marie-Louise-Victoire, marquise de, 1772-1857 versus: 2 names title intact (correct apparently since it is consistent with most of the rest of the names). Disraeli, Benjamin, Earl of Beaconsfield, 1804-1881 No way to recognize if it should be plain LIFO order or something else.

Andrew Sly

4:54 a.m.

New subject: Author names in catalog

There is a separate cataloger's mailing list if you are interested in further discussion with the people who are editing the catalog. However, it might help if I tell you that most of the author headings follow the form used at the Library of Congress. And _they_ follow rules and vaguries that have built up over many decades. I can tell you without uncertainty that you will not be able to prepare a process which will give you 100% good results. --Andrew On Fri, 18 Sep 2009, Paulo Levi wrote:

...

You're right, it is inconsistent in some (apparently only titled) authors. For example: 1 name, title broken up. La Rochejaquelein, Marie-Louise-Victoire, marquise de, 1772-1857

James Adcock

5:11 a.m.

New subject: In search of a more-vanilla vanilla TXT

I hope you have figured out my point by now: Namely, IF one wants to make "correct" e-book files in a number of formats, including EPUB and MOBI, it is not possible algorithmically to determine the "correct" encoding of Author Lastname, Firstname from data currently found in either the PG HTML encodings nor the PG TXT encodings. It is also not possible to make "correct" encodings of Author Lastname, Firstname from the information currently recorded in the PG catalog. One would like to have "correct" encodings of Author Lastname, Firstname so that if a customer adds a PG text in say EPUB or MOBI to their existing collection of e-book titles in their e-book library, it would be nice if the Author Lastname, Firstname sorts and displays correctly next to any other e-books they might already possess from other sources. Sun Tzu Sun is the author's family name, or what is represented as an authors "Lastname" in western cultures. Tzu is a romanization of an honorarium such as "Sir" or "Mr" Sun Tzu 孫子; Sūn Zǐ; Which is listed in a westernized corrupted form in the PG catalog as "Sunzi" which shows lack of cultural respect -- combining the family name with the honorarium in a way to artificially form an apparent feminine. However, I believe the transcriber needs to transcribe the book as written, including the spelling or representation of the author name found there, which means that the book transcription in HTML or PG TXT cannot be used as a reliable source of author name -- nor should the spelling given in transcription necessarily be how the author is listed in the PG catalog. Nor can it algorithmically be thus possible to figure out what part therein is the "last name [family name]" So therefore in addition to the coding in the HTML or the PG TXT there also needs to be a "spine" representation that gives a correct canonical identification of author "Lastname: Sun Firstname: Tzu" where again Tzu isn't really the first name, but by traditional this slot gets used for that part of the canonical author name representation which isn't the lastname. "Art of War" also being known simply as "The Sun Tzu." Miguel de Cervantes Last name of author is actually most often canonically represented as "Cervantes Saavedra", with the "firstname" part typically represented as "Miguel de". Saavedra being mother's last name in a culture where children bear their mother's name but when the book is sold in other cultures that are uncomfortable with this convention then the Saavedra tends to get dropped -- but shouldn't be because it IS the author's last name. Marquis de Sade Last name of author = Sade. First name part is "Donatien Alphonse François". But by tradition customers are probably expecting the firstname part to be represented as "Marquis de" -- they almost certainly will not recognize "Donatien Alphonse François". So its not real clear how the firstname part ought be coded, but if the lastname part is coded as Sade then at least the book will show up about the right place in the possessor's library listing. Again, the point being neither the PG catalog nor the literal transcription can be used as a reliable source of the author lastname, firstname information -- which DOES need to be reliably included in the e-book file so that the e-book will show up at correct location in the customer's e-book library sort.

Andrew Sly

19 Sep 19 Sep

6:28 a.m.

New subject: In search of a more-vanilla vanilla TXT

I think that with the few examples you have given, you have shown that it is not possible to do so with _any_ library catalog, because the usage of names has so many variations and exceptions. --Andrew On Thu, 17 Sep 2009, James Adcock wrote:

...

I hope you have figured out my point by now:

Namely, IF one wants to make "correct" e-book files in a number of formats, including EPUB and MOBI, it is not possible algorithmically to determine the "correct" encoding of Author Lastname, Firstname from data currently found in either the PG HTML encodings nor the PG TXT encodings.

Andrew Sly

18 Sep 18 Sep

12:33 a.m.

New subject: In search of a more-vanilla vanilla TXT

Are you talking about reading the files directly from gutenberg.org? Files are served up with the encoding specified in the http header. I don't know the technical details--Marcello set it all up. --Andrew On Thu, 17 Sep 2009, Paulo Levi wrote:

...

Just a little input about text files and charsets (encodings), since i had to use it for my program. Most browsers and applications open these files in the correctly simply because someone (mostly mozilla) did the hard work of making a fast guessing engine. I wouldn't be amazed if it failed in some books.

Jim Adcock

14 Sep 14 Sep

9:34 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot.

Um, this standard is 1350 pages long. Tell me again why I should be reading it? I want to code books -- not the Sistine Chapel.

...

I don't see any problem here: Produce utf-8 files.

But that would still leave all the other problems with txt files. And the reason we are required to produce txt is to support those with teletypewriters. Rhetorically speaking why not just produce as bad txt files as one can and still get away with it and hope that someday soon both Gut readers and Gut content produces will see the light and give txt up as long gone dead?

David Starner

10:01 p.m.

New subject: In search of a more-vanilla vanilla TXT

On Mon, Sep 14, 2009 at 5:34 PM, Jim Adcock <jimad@msn.com> wrote:

...

...
I don't see any problem here: Produce utf-8 files.

But that would still leave all the other problems with txt files. And the reason we are required to produce txt is to support those with teletypewriters.

So? UTF-8 works just fine when viewed in a UTF-8 xterm, and can be translated on the fly by many programs. -- Kie ekzistas vivo, ekzistas espero.

Al Haines (shaw)

10:49 p.m.

New subject: In search of a more-vanilla vanilla TXT

----- Original Message ----- From: "Jim Adcock" <jimad@msn.com> To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d@lists.pglaf.org> Sent: Monday, September 14, 2009 2:34 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT

...

...
TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot.

Um, this standard is 1350 pages long. Tell me again why I should be reading it? I want to code books -- not the Sistine Chapel.

Check out http://pgtei.pglaf.org/. Marcello's PG-TEI manual is <200 pages. There's also TEI-Lite at http://www.tei-c.org/Guidelines/Customization/Lite/.

...

...
I don't see any problem here: Produce utf-8 files.

But that would still leave all the other problems with txt files. And the reason we are required to produce txt is to support those with teletypewriters. Rhetorically speaking why not just produce as bad txt files as one can and still get away with it and hope that someday soon both Gut readers and Gut content produces will see the light and give txt up as long gone dead?

Text will never be dead. It's portable to all platforms, doesn't need a browser or a PDF-like reader, only the most basic editor. In modern terms, it's the stem cell of ebook files--all else can be generated from it. Maybe with greater or lesser prettiness, but as long as you get the words, who cares what the quote marks look like?

...

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Jim Adcock

15 Sep 15 Sep

3:31 a.m.

New subject: In search of a more-vanilla vanilla TXT

...

....but as long as you get the words, who cares what the quote marks look like?

There are a lot of texts where you cannot "get" the words from just the words. There are also texts with quotes within quotes, where if you don't care what the quote marks look like _you cannot read it!_ Certainly a text like Tristram Shandy demonstrates there are books which are NOT just about the words -- where rather, the artistry of representing word on paper -- including careful choice of fonts, puncs, etc. is a central part of the artistry -- as one can easily see by comparing a bad publication of this work to a good one! The good publications represent the work of the artist, the bad one's clearly do not. And a txt representation would be just so many chicken scratchings in the mud. I'm sure there are many here who would say "but I don't like Tristram Shandy" -- and that would be my point. By bringing a prejudice to the table that only texts worth representing in txt are worth representing, you prejudice what books PG is allowed to preserve, and you censor the choice of artists that others are permitted to preserve. You represent some artists, and consign the others to oblivion.

David Starner

3:47 a.m.

New subject: In search of a more-vanilla vanilla TXT

On Mon, Sep 14, 2009 at 11:31 PM, Jim Adcock <jimad@msn.com> wrote:

...

Certainly a text like Tristram Shandy demonstrates there are books which are NOT just about the words -- where rather, the artistry of representing word on paper -- including careful choice of fonts, puncs, etc. is a central part of the artistry -- as one can easily see by comparing a bad publication of this work to a good one!

Those sculptors who choose to work in ice are rarely remembered well by later ages. Sculptors who work in iron and bronze can easily be remembered for several millennia. The choice is the artist's. -- Kie ekzistas vivo, ekzistas espero.

Jim Adcock

10:05 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

Those sculptors who choose to work in ice are rarely remembered well by later ages. Sculptors who work in iron and bronze can easily be remembered for several millennia. The choice is the artist's.

Except when what we are talking about is transcribers scratching other artist's works into mud tablets with (at best) a pointy stick.

David Starner

16 Sep 16 Sep

4:25 p.m.

New subject: In search of a more-vanilla vanilla TXT

On Tue, Sep 15, 2009 at 6:05 PM, Jim Adcock <jimad@msn.com> wrote:

...

...
Those sculptors who choose to work in ice are rarely remembered well by later ages. Sculptors who work in iron and bronze can easily be remembered for several millennia. The choice is the artist's.

Except when what we are talking about is transcribers scratching other artist's works into mud tablets with (at best) a pointy stick.

Which is exactly what happened to Gilgamesh. I suppose the author should have thrown a temper tantrum and demanded it be written only on the finest silk, in which case we wouldn't have a copy. -- Kie ekzistas vivo, ekzistas espero.

Al Haines (shaw)

15 Sep 15 Sep

5:20 a.m.

New subject: In search of a more-vanilla vanilla TXT

----- Original Message ----- From: "Jim Adcock" <jimad@msn.com> To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d@lists.pglaf.org> Sent: Monday, September 14, 2009 8:31 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT

...

...
....but as long as you get the words, who cares what the quote marks look like?

There are a lot of texts where you cannot "get" the words from just the words. There are also texts with quotes within quotes, where if you don't care what the quote marks look like _you cannot read it!_

I think I, and any other followers of this thread, will need an example of "not getting the words from the words". I've seen any number of instances of nested quotes, mostly nested doublequotes, lots of triple-nested double-single-double quotes, and some triple-nested single-double-single quotes (mostly in British-published books) and I have yet to encounter any that I couldn't read, either in the original source or when they've been etexted.

...

Certainly a text like Tristram Shandy demonstrates there are books which are NOT just about the words -- where rather, the artistry of representing word on paper -- including careful choice of fonts, puncs, etc. is a central part of the artistry -- as one can easily see by comparing a bad publication of this work to a good one! The good publications represent the work of the artist, the bad one's clearly do not. And a txt representation would be just so many chicken scratchings in the mud.

I've looked at PG's text and HTML version of Shandy, and several PDFed scansets in Internet Archive. Unless I'm missing something, they all look like standard prose to me. If you've got an edition as difficult to transcribe as you seem to indicate, and it's not in Internet Archive, you should scan it, and if you have no interest in producing it yourself, upload the zipped scanset via FTP to PG (I can give exact instructions to you privately). As long as it's clearable, it may be possible to arrange for it to go into PG's Preprints page where it'll be available as a project for someone.

...

I'm sure there are many here who would say "but I don't like Tristram Shandy" -- and that would be my point. By bringing a prejudice to the table that only texts worth representing in txt are worth representing, you prejudice what books PG is allowed to preserve, and you censor the choice of artists that others are permitted to preserve. You represent some artists, and consign the others to oblivion.

Personally, I'm book-agnostic--as long as it's in English, a book is a book is a book. I'm would assume that those who produce books for PG in other languages feel the same way about books in those languages. Distributed Proofreaders, at least once, has produced a book in a language none of its proofers understood (#27120).

...

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Keith J. Schultz

8:58 a.m.

New subject: In search of a more-vanilla vanilla TXT

Hi Everybody, I will step in here for a moment. As Bowerbird has mentioned this discussion is as old as PG itself. The problems are: 1) Plain Vanilla Texts can not reproduce books (It is not meant, too) 2) PG does NOT have a comprehensive format for reproducing books. 3) PG has not evolved with mopdern computer technology. 4) Ecerybody wants thier pet formats for reading. 5) PG does not have a consolidated following willing to build the resources needed to solve the above. There are many various reason for the above problems. Yes, there ARE and have been efforts to solve the above. Yet, none of these have fruited much or have been able to satisfy needs of all its contributors or users. So what is needed: 1) A single modular and extensible format for encoding the books a) the structures in the book (text) need to be represented b) it does not presume a particular output format c) does not care about the size of files d) does not need to be very readable easily 2) a parser for creating output formats a ) use all information to create the best possible output for a particular format 3) an editor a) display the book b) allow for changes in the representation of the book c) must be modular and extensible 4) a parser for creating the representation of the book in the format from scans a) must be modular and extensible b) must be multi-pass c) flags possible conflicts with the format d) intelligent to do most markup by itself e) intelligent to correct common errors by itself 5) parsers for converting older formats a) all of 4) b) does not expect particular information c) allows for presets injorder to same time and desirable representation. 6) a proofing workflow So what do we have. We need a a format that is not based on an existing format, is modualr and extensible. Either we start from scratch or use a generic format. SGML or XML come to mind. We can then put in waht we want and need, have a well structured format, can extend it easily and it is modular. Plus, XML can handle all kind of information an data. Yes, we have to reinvent the wheel for markup, but we want a representation that contains as much information as possible. The question would be how much is needed. At least the markup will be a layout format. It should only take about a month to create such a format. The other parts will take a little longer. The important thing is everything has to be centered around the representation format and not the output. The output is handle by parsers. Where a particular output format can handle or represent a particular feature can be a concern of the PG internal representation. The developers of the output format can converted it to what ever the seem fittest. regards Keith.

Marcello Perathoner

9:20 a.m.

New subject: In search of a more-vanilla vanilla TXT

Keith J. Schultz wrote:

...

We need a a format that is not based on an existing format, ...

Why not?

...

... but we want a representation that contains as much information as possible. It should only take about a month to create such a format.

ROTFL -- Marcello Perathoner webmaster@gutenberg.org

Jim Adcock

16 Sep 16 Sep

2:56 a.m.

New subject: In search of a more-vanilla vanilla TXT

As an example, I just tried auto-magically unwrapping some PG txt because I don't like the char count per line choices forced by PG and the assumed size of the txt display that PG assumes -- which is NOT the size of MY txt display. This is what then ended up being displayed on MY choice of txt display, once I applied the txt unwrapping algorithm: Ham. To be, or not to be,--that is the question:-- Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them? --To die,--to sleep,--No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,--'tis a consummation Devoutly to be wish'd. To die,--to sleep;-- To sleep! perchance to dream:--ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But that the dread of something after death,-- The undiscover'd country, from whose bourn No traveller returns, --puzzles the will, And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought; And enterprises of great pith and moment, With this regard, their currents turn awry, And lose the name of action.--Soft you now! The fair Ophelia! --Nymph, in thy orisons Be all my sins remember'd. Now maybe to some of you -- you consider this result to be a good thing, an acceptable thing, a thing that well-represents the considerable efforts of the PG volunteers. But personally, I do not think so.

Al Haines (shaw)

5:13 a.m.

New subject: In search of a more-vanilla vanilla TXT

It's clearly stated in PG Volunteers' FAQ V.89 (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_an...) that if you want to prevent unwanted wrapping, lines that should not be wrapped should be indented a space or two. In PG's older etexts, that predated this standard, the technique was used only sporadically. However, whenever an older text is cleaned up and reposted, it *is* applied where necessary. ----- Original Message ----- From: "Jim Adcock" <jimad@msn.com> To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d@lists.pglaf.org> Sent: Tuesday, September 15, 2009 7:56 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT

...

As an example, I just tried auto-magically unwrapping some PG txt because I don't like the char count per line choices forced by PG and the assumed size of the txt display that PG assumes -- which is NOT the size of MY txt display. This is what then ended up being displayed on MY choice of txt display, once I applied the txt unwrapping algorithm:

Ham. To be, or not to be,--that is the question:-- Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them? --To die,--to sleep,--No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,--'tis a consummation Devoutly to be wish'd. To die,--to sleep;-- To sleep! perchance to dream:--ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But that the dread of something after death,-- The undiscover'd country, from whose bourn No traveller returns, --puzzles the will, And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought; And enterprises of great pith and moment, With this regard, their currents turn awry, And lose the name of action.--Soft you now! The fair Ophelia! --Nymph, in thy orisons Be all my sins remember'd.

Now maybe to some of you -- you consider this result to be a good thing, an acceptable thing, a thing that well-represents the considerable efforts of the PG volunteers.

But personally, I do not think so.

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Marcello Perathoner

9:19 a.m.

New subject: In search of a more-vanilla vanilla TXT

Al Haines (shaw) wrote:

...

It's clearly stated in PG Volunteers' FAQ V.89 (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_an...)

that if you want to prevent unwanted wrapping, lines that should not be wrapped should be indented a space or two.

This `markup´ does not distinguish between poetry and a block quote. A block quote should be indented *and* rewrapped. And the Rewrap Blues is only part of the problem ... Another formidable challenge is to recover the chapter headings and other headings to make them stand out and to build a TOC. -- Marcello Perathoner webmaster@gutenberg.org

Sankar Viswanathan

10:56 a.m.

New subject: In search of a more-vanilla vanilla TXT

DP produces TEI text. But very few post processors take to the TEI route. Why? The software Guiguts automatically converts the formatted text to html. You need not know much about HTML. The html output only needs to be tweaked at times. Even that is not necessary in all cases. Even with this scenario there has been a reluctance on the part of many post processors to do a html version. DP does not insist on a html version. But most of the Project Managers do insist on a html version. Even then there are DP projects which are posted only in the text format. For TEI to become popular we need a software which would automatically convert the TEI text to a final TEI version. Is it possible? I saw a software here. How good is it? http://www.tei-c.org/Talks/Forli/2006/conversion.xml On Wed, Sep 16, 2009 at 2:49 PM, Marcello Perathoner <marcello@perathoner.de

...

wrote:

...

Al Haines (shaw) wrote:

It's clearly stated in PG Volunteers' FAQ V.89 (

...
http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_an...)

that if you want to prevent unwanted wrapping, lines that should not be wrapped should be indented a space or two.

This `markup´ does not distinguish between poetry and a block quote. A block quote should be indented *and* rewrapped.

And the Rewrap Blues is only part of the problem ...

Another formidable challenge is to recover the chapter headings and other headings to make them stand out and to build a TOC.

-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

-- Sankar Service to Humanity is Service to God

traverso＠posso.dm.unipi.it

11:45 a.m.

New subject: In search of a more-vanilla vanilla TXT

People use guiguts because it produces HTML, but mainly because it includes gutcheck, aspell, wordcount routines, and integrates display of the text and of the image corresponding to the text cursor position. The only route to have more TEI submissions is to have a version of guiguts producing TEI instead of HTML. And of course improve the automatic conversion from TEI to HTML Carlo

Marcello Perathoner

12:40 p.m.

New subject: In search of a more-vanilla vanilla TXT

Carlo Traverso wrote:

...

People use guiguts because it produces HTML, but mainly because it includes gutcheck, aspell, wordcount routines, and integrates display of the text and of the image corresponding to the text cursor position. The only route to have more TEI submissions is to have a version of guiguts producing TEI instead of HTML.

That should be trivial.

...

And of course improve the automatic conversion from TEI to HTML

In my copius free time ... -- Marcello Perathoner webmaster@gutenberg.org

David Starner

4:26 p.m.

New subject: In search of a more-vanilla vanilla TXT

On Wed, Sep 16, 2009 at 8:40 AM, Marcello Perathoner <marcello@perathoner.de> wrote:

...

Carlo Traverso wrote:

...
And of course improve the automatic conversion from TEI to HTML

In my copius free time ...

Stop ranting about what others are doing in their free time, then. -- Kie ekzistas vivo, ekzistas espero.

Keith J. Schultz

17 Sep 17 Sep

7:55 a.m.

New subject: In search of a more-vanilla vanilla TXT

Am 16.09.2009 um 11:19 schrieb Marcello Perathoner:

...

Al Haines (shaw) wrote:

...
It's clearly stated in PG Volunteers' FAQ V.89 (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_an... ) that if you want to prevent unwanted wrapping, lines that should not be wrapped should be indented a space or two.

This `markup´ does not distinguish between poetry and a block quote. A block quote should be indented *and* rewrapped. It depends on what is considered desirable

And the Rewrap Blues is only part of the problem ...

...

Another formidable challenge is to recover the chapter headings and other headings to make them stand out and to build a TOC.

I have to disagree here. Any fourth grader can do it. There are certain rules which one can follow. It will not handle all possible cases, yet most. But, then again that what proofers can handle easily. regards Keith.

Keith J. Schultz

7:05 a.m.

New subject: In search of a more-vanilla vanilla TXT

Hi There, Am 15.09.2009 um 11:20 schrieb Marcello Perathoner:

...

Keith J. Schultz wrote:

...
We need a a format that is not based on an existing format, ...

Why not? Very simply. Basically, most formats have a particular output in mind! Furthermore they are far too complex. The idea is to markup the book text in a way that we can extract its structure and features. Then depending on the the output format is created.

...

...
... but we want a representation that contains as much information as possible. It should only take about a month to create such a format.

ROTFL

I said to create such a format. I did not say create the tools for creating output formats. Which is the actual crux if you have been trying to follow this thread. Also, you need tools for getting the scan into this format from scans which should be done mostly by a computer inorder to save time. regards Keith.

James Adcock

15 Sep 15 Sep

11:38 p.m.

New subject: In search of a more-vanilla vanilla TXT

...So what is needed... Yes, except I don't think it's as bad as you make it out to be. TEI and/or PG-TEI could be a good intermediate formal file format. DP markup [and conventions] could be a good preliminary editing markup format. Editing doesn't necessarily need to be WYSIWYG. Input formatted files don't have to be perfect since they are living documents, as opposed to current "write once" output formatted files. Conversion from an input format file to output rendering formats such as txt or html or the various other reflow formats doesn't have to be perfect -- as long as the input format to output format rendering software does more work than the current tools for the job -- which basically is none. You probably have to store CSS or other style choices representation to help reconstruct how the original volunteers chose to render the input file format to the output rendering file format. [Where I am assuming here that html is simply being used as an output rendering file format, so that we don't have to argue anymore about the "correct" semantic use of html -- we would say that the semantics are being represented in the input file format, not in the html] Again, this is all trying to address at least three problems: 1) How do you represent the author's intention without deliberately throwing away information? 2) How do you make the files submitted by volunteers be "living documents" rather than "write once" documents -- which other volunteers can pick up and improve on in the future without having to go back to original scans and rework the work "from scratch" ? 3) How do you support as best as possible various output rendering file formats most appropriate for various reader devices? -- of which PG *already* "officially" recognizes literally about 80 different output file formats of differing complexities!

James Adcock

16 Sep 16 Sep

6:12 a.m.

New subject: In search of a more-vanilla vanilla TXT

As an example of how much author semantic information is lost going from an author's writing to PG txt format, I went and compared differences between a recent HTML and PG TXT I did -- where after doing the TXT encoding I went back and did three more passes over the images to add back in semantic differences to the HTML that the PG TXT didn't represent. Now the reality would be that it would take say TEI not HTML to represent all of the author's intent. But measuring the loss going from HTML back to TXT gives an order of magnitude estimate of how much author information we are throwing away by representing a work in PG TXT. In the case of this book, the answer was more than 1000 "losses" -- or an average of about 3 losses per page. And this is NOT counting about an addition 1000 losses in representation of emphasis. Now, let's say we have a PG TXT and some volunteer in the future wants to go back from that txt and say as correctly as possible represent that text using PDF. How many "errors" does that volunteer need to correctly find where the TXT file loses author's semantic information by carefully comparing the page images to the PG TXT file, reintroducing information known to the original volunteer transcribers, but discarded as not being representable in PG TXT? The answer is that this volunteer has to find and fix the txt in literally about 2000 places. Want to place a bet on how many of those 2000 places the volunteer trying to create an accurate PDF file is actually going to "catch" ??? I can tell you in my efforts going from PG TXT to HTML in the first place it's a good part of a week's work -- not to imply *I* caught them all either!

Sankar Viswanathan

7:41 a.m.

New subject: In search of a more-vanilla vanilla TXT

The PG texts are produced by Volunteers. Individual producers and the post processors of D.P. Text file is only the minimum requirement stipulated by PG. It is upto the independent producers and the post-processors of DP to decide in what formats the book should be submitted. PG has no control over the format submitted. The White Washers check the files and post them. TEI is not popular either with the independent producers or the post processors. We could discuss this till the cows come home. But the solution is in the hands of the independent producers and post processors of DP. On Wed, Sep 16, 2009 at 11:42 AM, James Adcock <jimad@msn.com> wrote:

...

As an example of how much author semantic information is lost going from an author's writing to PG txt format, I went and compared differences between a recent HTML and PG TXT I did -- where after doing the TXT encoding I went back and did three more passes over the images to add back in semantic differences to the HTML that the PG TXT didn't represent.

Now the reality would be that it would take say TEI not HTML to represent all of the author's intent. But measuring the loss going from HTML back to TXT gives an order of magnitude estimate of how much author information we are throwing away by representing a work in PG TXT. In the case of this book, the answer was more than 1000 "losses" -- or an average of about 3 losses per page. And this is NOT counting about an addition 1000 losses in representation of emphasis.

Now, let's say we have a PG TXT and some volunteer in the future wants to go back from that txt and say as correctly as possible represent that text using PDF. How many "errors" does that volunteer need to correctly find where the TXT file loses author's semantic information by carefully comparing the page images to the PG TXT file, reintroducing information known to the original volunteer transcribers, but discarded as not being representable in PG TXT? The answer is that this volunteer has to find and fix the txt in literally about 2000 places. Want to place a bet on how many of those 2000 places the volunteer trying to create an accurate PDF file is actually going to "catch" ??? I can tell you in my efforts going from PG TXT to HTML in the first place it's a good part of a week's work -- not to imply *I* caught them all either!

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

-- Sankar Service to Humanity is Service to God

James Adcock

4:31 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

PG has no control over the format submitted.

Nonsense. I have tried submitting “other things” and have been told repeatedly that the “minimum requirements in practice of PG” is that a TXT and an HTML file be submitted, and that these two files pass through a large number of fitness tests required by PG, which in practice includes restrictions on the choice of char sets used in the internal rep of the TXT and of the HTML file. So, in fact PG DOES have control over the format submitted, and the way PG asserts that control is by refusing to accept submission of formats and details of those formats that they choose not to support. As a simple counter-example of the above claim “PG has no control over the format submitted” note that personally I would much rather be submitting TXT files which do not correspond to the PG requirements of including a gratuitous line wrap every 72 chars. Or if I am required to submit TXT files with line wraps I would much prefer to retain the line wraps of the original text, because it is a royal pain for some future volunteer to have to “fix” the position of line wraps back to the original text in order to do additional processing of the text file in the future, for example because they want to find and include additional semantic information that can be found in the original page scans, but not in the TXT. And in practice it is impossible to do this visual analysis unless one matches line breaks to the original page scans – as DP well knows. Another example from a couple years ago is I asked PG how I could submit MOBI formatted texts of books they already had in other formats. I was told that I was not allowed to do so. So I set up an independent website to distribute PG books in MOBI format to my friends in the MOBI community -- retaining the PG licenses and legalese conditions. Now, as hoped for, some years later PG has decided to support MOBI after all – at least to some extent. But: what a pain! Why is this important to me? Well, I happen to like classes of reader machines that the internal mechanizers of PG do not like. PG likes big teletype like display machines, capable of displaying more than 72 chars per line. [Your standard PC or Mac still remains fundamentally a teletype emulator] And PG likes tiny machines with extremely limited displays, also known as cell phones. I personally do not like either of those classes of machines, but rather machines that are middle sized – small enough that I can pick them up and easily read them while lying in bed late at night for example. But large enough that I can understand in context the ebb-and-flow of what the author wrote in some surrounding context. With these middle-sized machines issues of text reflow become a central issue in the pleasure (or lack thereof) of being able to use the machine. And yes there are quite a number of tools one can use to help “fix” at least partially “broken” texts re these machines, including Calibre and say Mobipocket Creator. But I’d rather not have to “fix” a text each time before I can read it. And I’d rather it not be “broken” in the first place. So, in summary, as a “volunteer” am I free to do what I want? Yes, certainly – but not if I want any of my efforts to ever show up on any PG website! As Bowerbird is only too happy to point out: “Please feel free to go somewhere else!”

Keith J. Schultz

17 Sep 17 Sep

7:55 a.m.

New subject: In search of a more-vanilla vanilla TXT

...

As an example of how much author semantic information is lost going from an author's writing to PG txt format, I went and compared differences between a recent HTML and PG TXT I did -- where after doing the TXT encoding I went back and did three more passes over the images to add back in semantic differences to the HTML that the PG TXT didn't represent. The problem is that there are very few systems that truely represent semantic content. Inorder to truely represent such information you have to know about it. This requires one to have aditional information which is know as "world knowledge". This information is provided by

Hi There, Am 16.09.2009 um 08:12 schrieb James Adcock: the reader of books.

...

Now the reality would be that it would take say TEI not HTML to represent all of the author's intent. But measuring the loss going from HTML back to TXT gives an order of magnitude estimate of how much author information we are throwing away by representing a work in PG TXT. In the case of this book, the answer was more than 1000 "losses" -- or an average of about 3 losses per page. And this is NOT counting about an addition 1000 losses in representation of emphasis.

This problem is a matter of complexity. That is even in pure Vanilla Text one can reprensent these intentions, but one loses readablity. Furthermore one has to make assumptions of the true intent of the author!! regards Keith

Jim Adcock

8:29 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

....Furthermore one has to make assumptions of the true intent of the author!!

I'm not sure what the problem is if one has an <i>tag to indicate the author's intent was rendered in the original book in italic</i> and a <sc>tag to indicate the author's intent in the original book was rendered in small-caps</sc> etc.? On the contrary, the assumptions have to be made when the input markup language and the output rendering file formats are required to be one-and-the-same AND the rendering file format's power is less than that used by real-world printers already 400 years ago. Then the markup transcriber is forced to interpret authors intent and how to compromise that intent in order to make it fit within the constraints of the rendering language -- which is being artificially constrained to be identical to the input markup language. If one had a input markup language that closely follows author's intent as rendered by the original printer then the problem becomes how do you reduce the strength of this markup to match the weaknesses of the output rendering file format, and that in general is an issue of style that can be represented in CSS for example. Or hacked up by hand if and when absolutely necessary. But it still means that the previous round of volunteers efforts are correctly and completely maintained in the input markup language text so that the next round of volunteers can take another shot at the text some time in the future.

James Adcock

15 Sep 15 Sep

11:18 p.m.

New subject: In search of a more-vanilla vanilla TXT

...

I think I, and any other followers of this thread, will need an example of "not getting the words from the words".

Okay, let's go over a number of simple examples: Consider Michael's thesis of the "goodness" of viewing PG texts on cellphones. Which is a "good" submission format for submitting a transcription of Shakespeare to be read on a cellphone, PG txt format, or HTML? Answer: Neither file format works worth a dang for specifying Shakespeare to be read on cellphones. Yet both file formats contain the lists of the words. -- Even 400 years ago authors understood the importance of formatting and printing decisions to represent the meanings of words -- artistic writings ARE NOT just lists of words -- even when those words are clearly intended to be spoken out loud. Here's a brief excerpt from Dove: "'Go?'" he wondered. "Go when, go where?" And another one: She particularly likes you. Yes, you can read these words and you will assign meanings to these words but you will not get the author's intent because the author understood that he needed to put additional information in the printing so that you can understand his intent. This is particularly important in the Henry James because what he is writing is deliberately ambiguous and confusing in the first place, so much so that he has to disambiguate in order to reduce the degree of ambiguity in what he is writing -- while still deliberately leaving the reader dazed and confused -- but not so confused as to think (incorrectly) that they understand what is going on. I guess I can put some txt representation of Tristram Shandy here, but what would be the point? He's gone! said my uncle Toby Where? Who? cried my father My nephew, said my uncle Toby What, without leave, without money, without governor? cried my father in amazement No he is dead, my dear brother, quoth my uncle Toby Without being ill? cried my father again I dare say not, said my uncle Toby, in a low voice, and fetching a deep sigh from the bottom of his heart; he has been ill enough, poor lad! I'll answer for him, for he is dead. Yes, once again, you can read the words and you will assign meaning to them -- but not the meaning intended by the author, because the txt is missing information that the author found important to include so that you can understand his meaning -- to the extent that he wanted you to understand his meaning which again was partial in the first place. I'm not saying that there is no place in the world for txt -- as archy demonstrated clearly back in 1916: expression is the need of my soul And you can read this entire email and still come back and complain that you don't understand what I am talking about and in making this complaint you once again make my point for me: The reason that you don't understand what I am talking about is that I am writing this email using txt and the authors given as examples above were writing in a style requiring representation richer than mere PG txt. Go and find the author's original representations and read them there because PG txt simply doesn't cut it to represent their work. Read what they wrote and ask yourself what it takes to actually implement the author's intent, either automagically, or even semiautomagically, on a variety of differing reader devices -- including, but not limited to -- teletypes and their software emulators [which is essentially what txt devices are, including this email system and notepad, etc PS: If you can read this email at all please note that it's because I *didn't* write it following PG txt conventions.

Jon Richfield

16 Sep 16 Sep

4:26 p.m.

New subject: In search of a more-vanilla vanilla TXT

I have to agree with large parts of what James Adcock says. A lot of it depends on the medium (media in fact), the message, and so on. When I write (without any interest in whether I should be writing or not, or whether anyone cares) there is considerable rumination, not to mention bellyaching, about punctuation, font, typeface, formatting and so on, in fact practically anything that could be done in more than one way. The fundamental thing is information. Alternative ways of representing the information require information to convey them, and offer opportunities for conveying the information. Well-conveyed information is in that respect at least, beautiful. The reason that most of my presentations are fairly spare is that most of what I have to say is fairly directly factual. Conspicuous headings, distinct tables of contents, and clear meanings are usually enough for my purposes because I am no artist. The reason I struggle with punctuation is that I have my own rules, and bugger the grammarians. My rules are: If the punctuation doesn't matter, leave it out. If it changes the meaning, it does matter. Put it in. If it does not really change the meaning, but the reader needs to read something twice to make sense of it, adapt the punctuation, the sentence structure, or even the wording, to provide unconscious, one-pass parsing. If omitting (or inserting) logically unnecessary punctuation is likely to distract or confuse the reader, then don't or do, as the case might be. Know something about common conventions and their significances so that you have some idea of what to flaunt and what to flout. That about somes it up, sum of it anyway. If that is how simple it is, why is it so complicated? Because I am lousy at noticing when I violate those rules. Many people are not that that is more of an excuse than an explanation. Recently I helped I think a friend with a book that he had written in German and translated into English. The book was a straightforward work of philosophy, so it should have been easy. Unfortunately, though he is literate and intelligent, he had absent-mindedly retained a lot of the German commas. It rendered reading of the book such hard work that I could not read it in bulk. I was doing double-takes every few sentences, which was more than was needed to ruin my concentration and wreck my attempts to remain coherently aware of the thread of significance. A sign of mine being a lesser intellect according to Whitehead? Definitely, but remember not only that the average intellect is less than lesser, but what is worse, it is less lesser than half the population. The lesser is who you are writing for more or lesser always. Consider "It is a long tail, certainly,' said Alice, looking down with wonder at the Mouse's tail; 'but why do you call it sad?' And she kept on puzzling about it while the Mouse was speaking, so that her idea of the tale was something like this: 'Fury said to a mouse, That he met in the house, "Let us both go to law: I will prosecute you. --Come, I'll take no denial; We must have a trial: For really this morning I've nothing to do." Said the mouse to the cur, "Such a trial, dear Sir, With no jury or judge, would be wasting our breath." "I'll be judge, I'll be jury," Said cunning old Fury: "I'll try the whole cause, and condemn you to death."' 'You are not attending!' said the Mouse to Alice severely. 'What are you thinking of?' 'I beg your pardon,' said Alice very humbly: 'you had got to the fifth bend, I think?' 'I had NOT!' cried the Mouse, sharply and very angrily." Then again, pace archy, how about something like: "Wenn hinter fliegen fliegen fliegen fliegen fliegen fliegen hinternach" or "smith who when jones had had had had had had had had had had had the judgement of the examiners in his favour" Or which would fit the writer's intention better: "You would be the lad for that." or "You would be the lad for that." or "You would be the lad for that." How many ways with more or less distinct meanings could one place the emphasis in "Two twenty-buck tickets for her show I should buy"? If anyone does not believe that punctuation matters, try reading "Eats shoots and leaves" by Lynne Truss. (If you haven't read it anyway, do yourself a favour and read it anyway.) Now all that is great fun, compared to waiting for Godot with a hangover in a hot public lavatory at the terminus of a diesel trucking company in Houston, but if you actually wish to write (or convey someone else's writing) with efficiency and with respect for the information, the author, and the reader, then you will use all the channels of information that the medium (media, funiculi, funicula ) that assist without increasing the noise to signal ratio. The fact that some authors don't need or want it is irrelevant. The right amount is what works best, and if he wants nothing, that is the right amount. It does not follow that it is the right amount elsewhere. The fact that your reader can get no end of fun out of Joyce without punctuation, does not mean that the same must apply for figurate verse or calligraphic works. The medium is rarely the message unless the message is about the medium, unless you are in one of the bottom-feeding niches, or a great artist, but to gird at more powerful notations because less can be made to do, for some people, mostly, with some exertion, is poorly persuasive, let alone cogent. I never did like Gertrude Stein. I don't know when where or how anyone will come up with the generally universally and perfect notation. (I know when I think they will, but that is another story.) All I ask is that they please make it something that can be read with a vanilla text reader, no instruction manual, and some patience, even if the proper markup interpreter on a great audiovisual system or tiny cellphone can give a mind-blasting performance. Personally I would like it to start with the vanilla text and punctuation and have the markups follow as an appendix, to be ignored when unwanted or not understood. Patience upon a rock Smiling at grief Because she is wearing ear plugs. (Of course?) In my case I am privileged because I am not dependent on pure txt. If necessary I can convert PDF, though I have never learned its internal format. Cheers, Jon

Marcello Perathoner

15 Sep 15 Sep

8:01 a.m.

New subject: In search of a more-vanilla vanilla TXT

Al Haines (shaw) wrote:

...

Text will never be dead. It's portable to all platforms, doesn't need a browser or a PDF-like reader, only the most basic editor.

Its not portable to cellphones. While every modern cellphone comes with a browser I have never seen one with an editor. -- Marcello Perathoner webmaster@gutenberg.org

Keith J. Schultz

9:09 a.m.

New subject: In search of a more-vanilla vanilla TXT

Hi, Am 15.09.2009 um 10:01 schrieb Marcello Perathoner:

...

Al Haines (shaw) wrote:

...
Text will never be dead. It's portable to all platforms, doesn't need a browser or a PDF-like reader, only the most basic editor. True text will never go away! Yet, the way it is represented will change. It also, depends on how you define TEXT.

...

Its not portable to cellphones.

Strange? I get text messages all the time ;-))

...

While every modern cellphone comes with a browser I have never seen one with an editor.

I have a editor on mine. iPhones and Blackberries have them. Of course the are not modern. ;-)) regards Keith.

5836

Age (days ago)

5842

Last active (days ago)

List overview

Download

65 comments

13 participants

participants (13)

Al Haines (shaw)
Andrew Sly
Bowerbird＠aol.com
David Starner
Greg Newby
James Adcock
Jim Adcock
Jon Richfield
Keith J. Schultz
Marcello Perathoner
Paulo Levi
Sankar Viswanathan
traverso＠posso.dm.unipi.it