
I read Stevan White's recent appends about improving PG epubs with a certain amount of wry amusement. Not at Stevan's append, which I thoroughly agree with, but at the fact that I wrote a fairly similar one about 18 months ago, which was more or less completely ignored by the PG decision makers. (And I am sure I wasn't the first person to do this, as BB kindly pointed out to me at the time.) Since that time I've come to realise that PG basically isn't interested in technology, but in books, and preserving them into the digital future. The key PG technology policy, which seems to have originated from the late and much-lamented Michael S. Hart, is that the best way to do this is to require that all books should wherever possible be made available in a 'lowest common denominator' plain text format so that the books would be as universally accessible as possible. Remembering back to the 1960's, when I started programming, I can understand why that was a pretty sensible decision at the time, although as someone who started out writing in EBCDIC, I probably have less faith in US ASCII as a lingua franca than many others do. What I find a lot harder to understand is why the rule still stands today, when there is a much more universally known lingua franca called (X)HTML in which, with a certain amount of standardisation, it is possible to encode important parts of the book that can't easily be done with plain text. Basically PG has failed to provide leadership in setting standards for HTML encoding of books, and that the result is that though, as Stevan said, the books are great, in reality anyone wanting to create an epub for their reading device cannot rely either on the PG HTML, or the generated epubs. It seems to me that this is a major strategic failure of PG, which if it were a commercial organisation would by now have put it out of businesss. How did this happen? Did PG really DECIDE to limit itself to being largely an ASCII text mine/museum, or did it fail to change? Bob Gibbins

On Mon, Sep 17, 2012 at 08:06:39PM +0100, Robert Gibbins wrote:
I read Stevan White's recent appends about improving PG epubs with a certain amount of wry amusement.
Not at Stevan's append, which I thoroughly agree with, but at the fact that I wrote a fairly similar one about 18 months ago, which was more or less completely ignored by the PG decision makers. (And I am sure I wasn't the first person to do this, as BB kindly pointed out to me at the time.)
Your starting premise is that PG mainly focuses on ASCII, which is incorrect. I don't remember your proposal specifically, but most proposals for improvements actually involve one or two things: 1) Telling people what they can no longer do (i.e., limiting choices), or 2) Finding "someone" to do something -- write software, create standards, develop policy, etc. The usual answer from me, and from Michael, is that we will fully support and encourage your effort. It's never clear to me why this isn't satisfying. BTW, I am in the midst of planning a "Gutenberg Summit" for next spring, and anyone with an interest in PG will be invited. Stay tuned for details... maybe there will be some like-minded ideas that achieve critical mass, instead of feeling like a lone voice in the digital wilderness. -- Greg
Since that time I've come to realise that PG basically isn't interested in technology, but in books, and preserving them into the digital future.
The key PG technology policy, which seems to have originated from the late and much-lamented Michael S. Hart, is that the best way to do this is to require that all books should wherever possible be made available in a 'lowest common denominator' plain text format so that the books would be as universally accessible as possible.
Remembering back to the 1960's, when I started programming, I can understand why that was a pretty sensible decision at the time, although as someone who started out writing in EBCDIC, I probably have less faith in US ASCII as a lingua franca than many others do.
What I find a lot harder to understand is why the rule still stands today, when there is a much more universally known lingua franca called (X)HTML in which, with a certain amount of standardisation, it is possible to encode important parts of the book that can't easily be done with plain text.
Basically PG has failed to provide leadership in setting standards for HTML encoding of books, and that the result is that though, as Stevan said, the books are great, in reality anyone wanting to create an epub for their reading device cannot rely either on the PG HTML, or the generated epubs.
It seems to me that this is a major strategic failure of PG, which if it were a commercial organisation would by now have put it out of businesss.
How did this happen? Did PG really DECIDE to limit itself to being largely an ASCII text mine/museum, or did it fail to change?
Bob Gibbins
Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org

I really don't understand the problem Greg. 1) Let's say someone finds a scanno bug in one of PG's books. What happens? It gets fixed. 2)Let's say someone else finds a formatting bug in one of PG's books. What happens? A: It is virtually impossible to get it fixed, and if it does get fixed it is only because one of a small chosen few decide that they as "High Priesthood" want to fix it -- and often their fixes are worse than the problem. So, what the difference between these two problems? 1) gets fixed, 2) does not. 1) happens rarely in PG books, 2) HAPPENS ALL THE TIME!

On 9/18/2012 10:37 AM, Greg Newby wrote:
I don't remember your proposal specifically, but most proposals for improvements actually involve one or two things:
1) Telling people what they can no longer do (i.e., limiting choices), or
2) Finding "someone" to do something -- write software, create standards, develop policy, etc.
The usual answer from me, and from Michael, is that we will fully support and encourage your effort. It's never clear to me why this isn't satisfying.
A little over 40 years ago, ABC aired what was apparently its first animated Movie of the Week: The Point. This story is of a kingdom where everyone has a pointed head, except for a young boy named Oblio. The evil count declares that because Oblio does not have a pointed head he is an outlaw, and he is banished to the Pointless Forest where nothing has a point. During his adventures in the Pointless Forest, Oblio meets a man who is completely covered in arrows, pointing every which way. When asked how this Pointed Man could exist in the supposedly Pointless Forest, he replies "A point in every direction is the same as no point at all." Project Gutenberg is pointless. The Gutenberg newsletters have consistent advertised that Project Gutenberg e-texts are "Readable By Both Humans and Computers." This assertion is, in fact, untrue. Computer programs require predictable data in order to process them. Because they conform to no known standard, Project Gutenberg e-texts are not useable by computers, only by humans; despite being converted to ASCII, PG e-texts are just one small step removed from their paper antecedents, and given today's computer processing power have no advantage over displaying raw page scans, à la Internet Archive or Google Books. Setting standards is not about limiting choices, nor is it about limiting opportunity: it is simply about transparency and truth in labeling. If a text is submitted that meets a certain standard, and it is labeled as meeting that standard, that does not mean that texts that do /not/ meet the standard cannot be stored and served, it just means that they should not be labeled as meeting the standard. When I say that Project Gutenberg e-texts adhere to no known standard, I do not mean to say that /all/ e-texts are devoid of standards adherence; I am simply saying that there is no mechanism to determine whether a text adheres to a standard or what that standard is. Thus, there is no effective way to build a tool chain to automate the conversion of Project Gutenberg texts to other formats, because there is no way to predict the starting point for any given text. It is never clear to /me/ why some people think that establishing standards implies limiting choice or opportunity. That is a straw man argument, designed, I can only imagine, to obscure the fact that the majority of the Project Gutenberg corpus is simply hopelessly outdated and of very limited value.

On 9/18/2012 10:37 AM, Greg Newby wrote:
I don't remember your proposal specifically, but most proposals for improvements actually involve one or two things:
1) Telling people what they can no longer do (i.e., limiting choices), or
This is completely stuff and nonsense. Many of us have offered to help fix the existing problems in the existing corpus and have been told by the WWers and the Webmaster that somehow fixing problems in the existing corpus is somehow going to make PG go to hell in a handbasket. So, contrary to what you are saying Greg, it is *PG* who is telling people what they are not allowed to do. *PG* is telling volunteers they are not allowed to fix problems in the existing corpus. That right is being reserved exclusively by the WWers and the Webmaster -- except that they are in reality simply being the "dog guarding the straw" and the problems in the existing corpus ARE NOT being fixed.

Hi All, Lee, Am 25.09.2012 um 20:02 schrieb Lee Passey <lee@novomail.net>:
Computer programs require predictable data in order to process them. Because they conform to no known standard, Project Gutenberg e-texts are not useable by computers, only by humans; [...]
This is not quite true. Furthermore, it is an old paradigm. Programs be written to handle "unpredictable data" (unexpected, is a better term). It can, also, be written to learn how to handle such unexpected input. I do admit that there are very few involved with PG that can do such programming, and the programs used are not designed that way. regards Keith.
participants (5)
-
Greg Newby
-
James Adcock
-
Keith J. Schultz
-
Lee Passey
-
Robert Gibbins