procedures for contributions, forum for questions and ideas

Hi, I have several questions. I got a nice e-book reader, and became interested in the Gutenberg collection. The collection itself is great! Some of the e-books are almost unusable on my reader, mostly due to unfortunate automatic formatting. So...I took the e-books apart, and re-formatted them. Then I tested them in several e-reader environments. I think they're very nice now. And, I would like to share these improvements, preferably through the Gutenberg project. ** Does the project accept these kinds of improvements to text? I see some talk of old and new editions in the books. ** Would such changes constitute new editions? ** What are the preferred procedures? ** Is this a good forum for general technical and policy questions regarding Gutenberg e-books? Thanks!

Please note that I agree 100% or more with your suggestions and with what you want to do. ** Does the project accept these kinds of improvements to text? No, unfortunately. ** I see some talk of old and new editions in the books. ** Would such changes constitute new editions? Yes and no. You are not allowed make repairs to damaged formatting decisions in existing editions unless you are one of the few "chosen" PG insiders in which case you get to party as you like -- and most of that partying just makes the situation worse. It is not permitted however, for the great unwashed masses to perform the same kind of partying, because, god forbid, you might make the situation better and that is not permitted. However, many of the old editions come from unfortunate choices of original edition of "donor" text, or they come from unspecified, and typically quite corrupt unknown sources. In this case you can make your own, better, version of these texts and submit them, such that Project Gutenberg will have a clean, uncorrupted version with known provenance, and with high-quality formatting which actually works with most ebook readers. This will take you a lot of time and energy if you don't have specialized tools to do these "versioning" efforts, and will still take you a fair amount of time and effort even if you do have the necessary "versioning" tools. In exchange when you try to submit your efforts you will find that the PG "whitewashers" will complain about how you are wasting their time making them double-check a submission for a book "they already have" [albeit in a highly corrupted form without provenance but they like that better] and if and when you get the whitewashers to accept your submission then the webmaster has a system of "negatively advertising" your new improved "correct" version -- with the formatting that actually works and with real provenance -- such that 99%+ of the real world users will unknowingly choose the old corrupt version of the book with the formatting which doesn't work. ** What are the preferred procedures? I can't answer that question because whatever I do is by definition "wrong" ;-) ** Is this a good forum for general technical and policy questions regarding Gutenberg e-books? This is it. Unfortunately long history has shown the PG [and DP] are more hide-bound than Andersen's "Old House" such that any such suggestions generate much heat and noise and no action.

Hi James, On Thu, Sep 13, 2012 at 8:38 PM, James Adcock <jimad@msn.com> wrote:
Please note that I agree 100% or more with your suggestions and with what you want to do.
** Does the project accept these kinds of improvements to text?
No, unfortunately.
** I see some talk of old and new editions in the books. ** Would such changes constitute new editions?
Yes and no. You are not allowed make repairs to damaged formatting decisions in existing editions unless you are one of the few "chosen" PG insiders in which case you get to party as you like -- and most of that partying just makes the situation worse. It is not permitted however, for the great unwashed masses to perform the same kind of partying, because, god forbid, you might make the situation better and that is not permitted. However, many of the old editions come from unfortunate choices of original edition of "donor" text, or they come from unspecified, and typically quite corrupt unknown sources. In this case you can make your own, better, version of these texts and submit them, such that Project Gutenberg will have a clean, uncorrupted version with known provenance, and with high-quality formatting which actually works with most ebook readers. This will take you a lot of time and energy if you don't have specialized tools to do these "versioning" efforts, and will still take you a fair amount of time and effort even if you do have the necessary "versioning" tools. In exchange when you try to submit your efforts you will find that the PG "whitewashers" will complain about how you are wasting their time making them double-check a submission for a book "they already have" [albeit in a highly corrupted form without provenance but they like that better] and if and when you get the whitewashers to accept your submission then the webmaster has a system of "negatively advertising" your new improved "correct" version -- with the formatting that actually works and with real provenance -- such that 99%+ of the real world users will unknowingly choose the old corrupt version of the book with the formatting which doesn't work.
** What are the preferred procedures?
I can't answer that question because whatever I do is by definition "wrong" ;-)
** Is this a good forum for general technical and policy questions regarding Gutenberg e-books?
This is it. Unfortunately long history has shown the PG [and DP] are more hide-bound than Andersen's "Old House" such that any such suggestions generate much heat and noise and no action.
Well, I hear you. I hope your experience is at one end of the spectrum (and not the other). But as I've observed in so many other projects and jobs, the problem ain't the hardware, and it ain't the software, and it ain't the firmware and it ain't the middleware. It's some other ware. I have mostly looked at some of the older books in the collection (for no particular reason). There, the most egregious problem is that the book was transcribed into ASCII, which inevitably results in some degree of data loss--sometime very unfortunate losses. Some human intervention would be necessary to repair that, but there are ways to reduce the amount of intervention required. I'm cognizant of necessity to keep a master copy, and build other formats from that. It's not clear to me what the preferred format of the master copy is, or should be. I am quite sure that very legible and attractive books could be automatically generated with a suitable base format. There are all sorts of technical possibilities -- again, this is not the problem. Besides that, I think some direction and clarity could be helpful regarding modern formats. Example: it seems that many e-books are formatted to retain the ASCII look-and-feel, at least regarding the Gutenberg project text. This is always pointlessly ugly, and is often totally illegible. A little thought, and some examples, clearly documented somewhere, would go a long way. Cheers!

The core "trouble" of the Project Gutenberg is its massive size. With over 40,000 ebooks, produced over several decades by countless volunteers, keeping up to any standards will be hard. Even a large and well-funded organization will have trouble to set consistent standards for such a large and diverse collection of texts. You can basically propose any master format, any quality standard, or whatsoever, to do with the Project Gutenberg collection, but unless you are willing and able to contribute significantly to your own proposals, it won't get you anywhere. Over a period of about 15 years, I've contributed over 500 ebooks to Project Gutenberg, and even as a single individual, I have trouble maintaining that bulk -- and even while I have the luxury of working with a decent master format, (TEI, which from a strictly technical point of few is the best choice, but suffers from a extremely steep learning curve, which makes it a bridge-too-far for most volunteers) and tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a point that just maintaining that sub-collection (fixing errors, improving tagging of early texts, etc.) gets a considerable task. (And then I can regenerate HTML and ePub files with a few keypresses, and have everything in a revision control system) What I would currently like to see most to move Project Gutenberg forward are (in order of priority) 1. A decent issue tracking system that can cope with the number of books we have, so readers can report possible issues with ebooks 2. A revision control system (preferably distributed) that can handle the massive size of the PG collection, so we can keep track of what is going on. 3. An integrated production environment, a kind of PGDP 2.0, to help stream-line the production of new "text-based" (as opposed to scans only) ebooks. Only after those things are in place, we can work towards a suitable master format, and a publishing pipeline that can produce desired output formats, such as HTML, ePub, etc. from that. I've been looking around for suitable tools to make this possible, but probably nothing that is currently available can handle this, so it will require a considerable effort. Taking for example the size of the collection. My own 500+ books are stored in a bazaar repository of about 4 gigabytes. Scaling that up a hundred times to hold PG's 40.000 books will result in a half a terabyte of controlled data; several orders of magnitudes larger than the largest open source code projects I know of. Jeroen Hellingman On 2012-09-14 22:37, Steve White wrote:
But as I've observed in so many other projects and jobs, the problem ain't the hardware, and it ain't the software, and it ain't the firmware and it ain't the middleware. It's some other ware.
I have mostly looked at some of the older books in the collection (for no particular reason). There, the most egregious problem is that the book was transcribed into ASCII, which inevitably results in some degree of data loss--sometime very unfortunate losses. Some human intervention would be necessary to repair that, but there are ways to reduce the amount of intervention required.
I'm cognizant of necessity to keep a master copy, and build other formats from that. It's not clear to me what the preferred format of the master copy is, or should be.
I am quite sure that very legible and attractive books could be automatically generated with a suitable base format. There are all sorts of technical possibilities -- again, this is not the problem.
Besides that, I think some direction and clarity could be helpful regarding modern formats. Example: it seems that many e-books are formatted to retain the ASCII look-and-feel, at least regarding the Gutenberg project text. This is always pointlessly ugly, and is often totally illegible. A little thought, and some examples, clearly documented somewhere, would go a long way.
Cheers! _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Jeroen, I have look a TEI and believe me the learning curve is not the problem. It is simply too ambiguous. Then there are as you mentioned all the rules how to do things. The problems you mention, also show its technical inability. Though that may be my honest opinion. regards Keith. Am 14.09.2012 um 23:28 schrieb Jeroen Hellingman <jeroen@bohol.ph>:
Over a period of about 15 years, I've contributed over 500 ebooks to Project Gutenberg, and even as a single individual, I have trouble maintaining that bulk -- and even while I have the luxury of working with a decent master format, (TEI, which from a strictly technical point of few is the best choice, but suffers from a extremely steep learning curve, which makes it a bridge-too-far for most volunteers) and tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a point that just maintaining that sub-collection (fixing errors, improving tagging of early texts, etc.) gets a considerable task. (And then I can regenerate HTML and ePub files with a few keypresses, and have everything in a revision control system)

Hi Keith, TEI is not just a bunch of tags, similar to HTML, which is fairly limited; TEI has evolved into a kind of meta-standard, describing how you can encode a wide range of text. Finding your way in that, tweaking it to your own set of requirements, getting tools to produce something neat (ePub anybody) out of it, etc. will take you considerable time. I think once you've coded a few types of books in TEI. Everybody who does TEI does a tiny subset, including me... The technical inability is not something we can blame TEI for, it is the difficulty of the underlying problem -- finding a common denominator to describe any type of book produced since the first clay-tables where produced 5000 years ago. That problem, by its very nature is extremely hard. Jeroen. On 2012-09-16 11:02, Keith J. Schultz wrote:
Hi Jeroen,
I have look a TEI and believe me the learning curve is not the problem. It is simply too ambiguous. Then there are as you mentioned all the rules how to do things.
The problems you mention, also show its technical inability. Though that may be my honest opinion.
regards Keith.

...finding a common denominator to describe any type of book produced since the first clay-tables where produced 5000 years ago. That problem, by its very nature is extremely hard.
Most formatting of most books on PG is "dirt simple" and yet more often than not is done terribly wrong.

On 2012-09-20 06:40, James Adcock wrote:
...finding a common denominator to describe any type of book produced since the first clay-tables where produced 5000 years ago. That problem, by its very nature is extremely hard.
Most formatting of most books on PG is "dirt simple" and yet more often than not is done terribly wrong.
I know this, but I have two observations: 1. Most books can be done for 95% with a mere 5% of what is in the TEI standard, but the remaining 5% of each book needs another 5% of that standard, and for each book, the exact 5% required is something different. For one book, I may need the verse tags, for another tables, for another drama, and so on... After 500 books, you'll have covered big parts of TEI -- and will have required tooling that can deal with all of it. 2. Many contributors are volunteers, who even have trouble getting HTML and plain typography right (given the number of HTML books with "straight" quotes, etc.) Now I am not claiming my books are good (that would require a lot more individual tweaking, and I do rely on what my scripts produce. However, I think terribly wrong is over the top. Also, nothing prevents you from picking up a terribly wrong book, and improving it, then resubmit it to PG, with some explanation what you've done. Normally, when I am made aware of an issue in one of the books I've produced, I pick it up myself, fix the issue, and regenerate the book with the current version of my scripts, and resubmit it. Jeroen.

On Thu, Sep 20, 2012 at 12:53 AM, Jeroen Hellingman <jeroen@bohol.ph> wrote:
2. Many contributors are volunteers, who even have trouble getting HTML and plain typography right (given the number of HTML books with "straight" quotes, etc.)
There's a lot you can complain about our output. But straight quotes? Seriously? For historical reasons, adding curved quotes in is a pain in the ass, that done properly involves someone looking at every single quote in a book--and if you don't want it done properly, a program can do it for you instantaneously. There's a very real question of whether this tedious procedure adds more value to the output then other things editors could do. I found * on Perseus pretty quickly, so apparently even non-volunteers have trouble getting "plain typography" correct. * http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.02.0001%3... -- Kie ekzistas vivo, ekzistas espero.

On 9/20/2012 5:15 PM, David Starner wrote:
There's a lot you can complain about our output. But straight quotes? Seriously?
Yes. Seriously. In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing. The lack of proper typography in this area is painful.

In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing.
www.freekindlebooks.org has a large number of books that were "98% to 100%" "corrected" from straights to curlys "automatically." You can judge for yourself whether this is a good thing or not -- when "automatic" algorithms go bad -- and they will go bad -- the results can be very ugly. [freekindlebooks.org predates that point in time when PG offered real ebooks -- in fact the whole point of freekindlebooks.org was to try to shame PG into offering real ebooks.]

I don't doubt that things may have gone wrong in some instances. But I didn't notice any such errors in the few works on freekindlebooks.org that I quickly looked through, so I can't evaluate what exactly might have caused the problem or how it might, if possible, be corrected. Are there particular instances you might be able to point me to? Regards, Mark
In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing.
www.freekindlebooks.org has a large number of books that were "98% to 100%" "corrected" from straights to curlys "automatically."
You can judge for yourself whether this is a good thing or not -- when "automatic" algorithms go bad -- and they will go bad -- the results can be very ugly.
[freekindlebooks.org predates that point in time when PG offered real ebooks -- in fact the whole point of freekindlebooks.org was to try to shame PG into offering real ebooks.]

I don't doubt that things may have gone wrong in some instances. But I didn't notice any such errors in the few works on freekindlebooks.org that I quickly looked through, so I can't evaluate what exactly might have caused the problem or how it might, if possible, be corrected. Are there particular instances you might be able to point me to?
Sorry, its been too long, I don't remember where. But when it goes bad, it goes very bad. I think I remember among other things this algorithm (embedded in GutenMark) goes horribly wrong on running long quotes (as do most algorithms.) Also unrelated it screws the pooch needlessly on left single quotes -- it appears to try to implement them using "ASCII" U+0060 At the time I was using Gutenmark to "quick and dirty" get to HTML at a time when PG was still offering some texts txt-only.

On Mon, Sep 24, 2012 at 8:02 AM, Mark Swofford <mark@romanization.com> wrote:
Yes. Seriously.
In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing.
The problem is, if they can be automatically corrected, then they aren't carrying an information load. It's exactly those case where we have nested quotes and "'an't 'ee, ya' 'ee?" that they carry information load, and that's exactly where they'll get it wrong, and that's exactly where readers will slam into a brick wall and have to stop and wonder whether the curved quotes or their reading is wrong.
The lack of proper typography in this area is painful.
Shrug. This is the 21st century; a huge amount of text you read, whether from email or texts or Wikipedia and most other webpages, are not going to use curved quotes. If it's really painful, then you have a problem. -- Kie ekzistas vivo, ekzistas espero.

Hi David, Am 25.09.2012 um 00:37 schrieb David Starner <prosfilaes@gmail.com>:
On Mon, Sep 24, 2012 at 8:02 AM, Mark Swofford <mark@romanization.com> wrote:
Yes. Seriously.
In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing.
The problem is, if they can be automatically corrected, then they aren't carrying an information load. It's exactly those case where we have nested quotes and "'an't 'ee, ya' 'ee?" that they carry information load, and that's exactly where they'll get it wrong, and that's exactly where readers will slam into a brick wall and have to stop and wonder whether the curved quotes or their reading is wrong. Oh, fiddle sticks. Got something harder to solve and get done properly. Naturally, reg-ex is the wrong way to go! Anyone, who has just started parsing can handle this easily.
regards Keith

It was stated:
In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing. I disagree. I have posted hundreds of books where straight quotes have been converted. This is my process:
First, run the text through a program, written in Perl. This does most of the hard work but it can't do it all. This will flag everything that it's not sure about. That will flag continued quoted paragraphs where the closing double-quote is missing. It will flag situations where, after the conversion, a set of rules have been broken, such as a closing double quote immediately followed by a letter, typically caused by triple-nested quotes. It has a built-in dictionary of rules for single quotes (i.e. a word ending in 's is going to be ’s). It also has a built-in dictionary of known words, such as fo’c’sle, which don't fall under the rules. But that's as far as it can go. It will leave behind unconverted single quotes typically at the start of a word. There are a handful of these in every book. Verifying the paragraphs missing the closing double quote and resolving the opening single quotes on a word is a manual process. It takes only a few minutes and almost always shows other problems with the text as an added benefit. But it can never be totally and reliably automated based on my experience. --Roger

Perhaps my post was misunderstood, because I'm not sure where the point of disagreement is. I never said that the process could be *totally* automated. Is your program that "does most of the work but ... can't do it all" reliably correcting less than 98 percent of the quotation marks and apostrophes in book-length texts? If so, perhaps additional fine-tuning is possible. Regards, Mark
It was stated:
In the vast majority of PG works, probably between 98 and 100 percent of quotation marks and apostrophes could be reliably corrected without the need for human eyeballing. I disagree. I have posted hundreds of books where straight quotes have been converted. This is my process:
First, run the text through a program, written in Perl. This does most of the hard work but it can't do it all. This will flag everything that it's not sure about. That will flag continued quoted paragraphs where the closing double-quote is missing. It will flag situations where, after the conversion, a set of rules have been broken, such as a closing double quote immediately followed by a letter, typically caused by triple-nested quotes. It has a built-in dictionary of rules for single quotes (i.e. a word ending in 's is going to be âs). It also has a built-in dictionary of known words, such as foâcâsle, which don't fall under the rules. But that's as far as it can go. It will leave behind unconverted single quotes typically at the start of a word. There are a handful of these in every book.
Verifying the paragraphs missing the closing double quote and resolving the opening single quotes on a word is a manual process. It takes only a few minutes and almost always shows other problems with the text as an added benefit. But it can never be totally and reliably automated based on my experience.
--Roger

"Mark" == Mark Swofford <mark@romanization.com> writes:
Mark> Perhaps my post was misunderstood, because I'm not sure Mark> where the point of disagreement is. Mark> I never said that the process could be *totally* automated. Mark> Is your program that "does most of the work but ... can't do Mark> it all" reliably correcting less than 98 percent of the Mark> quotation marks and apostrophes in book-length texts? If so, Mark> perhaps additional fine-tuning is possible. I rather think that instead of trying to push 98% to 99% one should try to increase reliability. One should be sure that the program never silently introduces an error, and this is really hard. For example, one cannot assume that a ' between letters is an apostrophe, hence rendered with a right single quotation mark (do you know the standard exception in English?) nor assume English, since even in English books non-english words may be included. And increasing the reliability to 100% might require to check each one. Carlo Traverso

Hi All, in order to increase reliability one has to have proper analysis of the "co-text" not context using look forward and look backward, and proper heuristics. Most algorithms do not use such methods as many do not know how to do this and assume that such algorithms are slow. regards Keith. Am 26.09.2012 um 09:28 schrieb Carlo Traverso <traverso@posso.dm.unipi.it>:
"Mark" == Mark Swofford <mark@romanization.com> writes:
Mark> Perhaps my post was misunderstood, because I'm not sure Mark> where the point of disagreement is.
Mark> I never said that the process could be *totally* automated.
Mark> Is your program that "does most of the work but ... can't do Mark> it all" reliably correcting less than 98 percent of the Mark> quotation marks and apostrophes in book-length texts? If so, Mark> perhaps additional fine-tuning is possible.
I rather think that instead of trying to push 98% to 99% one should try to increase reliability. One should be sure that the program never silently introduces an error, and this is really hard. For example, one cannot assume that a ' between letters is an apostrophe, hence rendered with a right single quotation mark (do you know the standard exception in English?) nor assume English, since even in English books non-english words may be included. And increasing the reliability to 100% might require to check each one.

Also, nothing prevents you from picking up a terribly wrong book, and improving it, then resubmit it to PG, with some explanation what you've done.
This is simply not true. Tons of people have pointed out formatting problems with specific books over the years, asked PG how they can submit fixes, and have been told that they are not allowed to do so. The "work around" has been to find a version of that book under a different edition, and then submit a "new book" based on that edition, but then under PG's webmaster's policy of continuing to promote the oldie-moldie versions of books, the customers continue to download the old defective versions.

Lets experiment. You find a book with a mistake, fix that mistake, and I will ask the PG white-washers to update it. Note that any correction to mistakes should be either obvious or attested by a printed version, and that you need a clearance for any version your working with for more than trivial information. Formatting issues can be matters of taste, and in particular with copyrighted submissions, you will find pretty bad formatting, which the white-washers might be more reluctant to upgrade. Don't expect people to start fixing a book just because you point out the formatting is horrible; we know there are horrible books in their, but we are also busy. Updated books are posted almost every day. Just follow the posted list. On 2012-09-20 15:31, James Adcock wrote:
Also, nothing prevents you from picking up a terribly wrong book, and improving it, then resubmit it to PG, with some explanation what you've done.
This is simply not true. Tons of people have pointed out formatting problems with specific books over the years, asked PG how they can submit fixes, and have been told that they are not allowed to do so. The "work around" has been to find a version of that book under a different edition, and then submit a "new book" based on that edition, but then under PG's webmaster's policy of continuing to promote the oldie-moldie versions of books, the customers continue to download the old defective versions.

Don't expect people to start fixing a book just because you point out the formatting is horrible; we know there are horrible books in their, but we are also busy.
You are talking both ways at once. I just said that people point out where the formatting is horrible and ask to be permitted to fix this horribleness, and are told "no." We've been through all this before, just a few month's ago, where the whitewashers and the webmaster pushed back strongly against the suggestion that people who care should be allowed to fix formatting errors. We're not asking that YOU be made to fix formatting errors, we are asking that WE be allowed to fix formatting errors. The fact that maybe YOU are busy is a non-starter argument.

Comments from a Whitewasher: Do not use the WWers as experimental subjects. We're busy, too. If genuine, textual, errors are found, send a report to PG's errata system, and it'll be dealt with. If there are lots of textual errors, contact the WWers for advice first--it's possible you might be asked to submit a complete corrected fileset, which we'll use to replace the existing files, which would be archived. Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Jeroen Hellingman Sent: Thursday, September 20, 2012 8:07 AM To: James Adcock Cc: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] procedures for contributions, forum for questions and ideas
Lets experiment. You find a book with a mistake, fix that mistake, and I will ask the PG white-washers to update it.
Note that any correction to mistakes should be either obvious or attested by a printed version, and that you need a clearance for any version your working with for more than trivial information.
Formatting issues can be matters of taste, and in particular with copyrighted submissions, you will find pretty bad formatting, which the white-washers might be more reluctant to upgrade.
Don't expect people to start fixing a book just because you point out the formatting is horrible; we know there are horrible books in their, but we are also busy.
Updated books are posted almost every day. Just follow the posted list.
Also, nothing prevents you from picking up a terribly wrong book, and improving it, then resubmit it to PG, with some explanation what you've done.
This is simply not true. Tons of people have pointed out
problems with specific books over the years, asked PG how they can submit fixes, and have been told that they are not allowed to do so. The "work around" has been to find a version of that book under a different edition, and then submit a "new book" based on
On 2012-09-20 15:31, James Adcock wrote: formatting that edition,
but then under PG's webmaster's policy of continuing to promote the oldie-moldie versions of books, the customers continue to download the old defective versions.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol> -d

On Thu, Sep 20, 2012 at 2:05 PM, Al Haines <ajhaines@shaw.ca> wrote:
If genuine, textual, errors are found, send a report to PG's errata system, and it'll be dealt with. If there are lots of textual errors, contact the WWers for advice first--it's possible you might be asked to submit a complete corrected fileset, which we'll use to replace the existing files, which would be archived.
Which avoids the issues we're seeing. Very few PG books have lots of textual errors that make readers not want to read the books. Many books have formatting errors that make readers not want to read the books. When I open a drama in my Kindle and it's unreadable, that's a more serious error then the few typos that most PG books have. -- Kie ekzistas vivo, ekzistas espero.

Which avoids the issues we're seeing. Very few PG books have lots of textual errors that make readers not want to read the books. Many books have formatting errors that make readers not want to read the books. When I open a drama in my Kindle and it's unreadable, that's a more serious error then the few typos that most PG books have.
Just as a quick "sanity check" of these complaints that myself and others have been making, I chose "at random" to download and look on my ebook reader at ten books starting at catalog# 10000 to see if they are plausibly well-formatted or not. IE if I bought this e-book commercially would I be satisfied with this effort, or would I say "this is a load of junk." In the below list "yes" means "reasonably competent formatting" whereas "no" means the opposite: 10000 no 10001 no 10002 yes 10003 no 10004 no 10005 no 10006 no 10007 no 10008 no 10009 no

On 9/21/2012 3:05 PM, James Adcock wrote:
Which avoids the issues we're seeing. Very few PG books have lots of textual errors that make readers not want to read the books. Many books have formatting errors that make readers not want to read the books. When I open a drama in my Kindle and it's unreadable, that's a more serious error then the few typos that most PG books have.
Just as a quick "sanity check" of these complaints that myself and others have been making, I chose "at random" to download and look on my ebook reader at ten books starting at catalog# 10000 to see if they are plausibly well-formatted or not. IE if I bought this e-book commercially would I be satisfied with this effort, or would I say "this is a load of junk." In the below list "yes" means "reasonably competent formatting" whereas "no" means the opposite:
10000 no 10001 no 10002 yes 10003 no 10004 no 10005 no 10006 no 10007 no 10008 no 10009 no Those are books from around 2003, James; still quite early (imho) as far as PG books go. Some of them didn't even have HTML created by the contributor, but were produced as plain text.
How about trying your experiment with more recent books? The current crop are up in the 40000+ range. It might also help to mention which ebook reader you're using, and which ebook format you downloaded. -- Walt

IME, people have different opinions about acceptable formatting and layout. What one person sees as an acceptable use of screen space, another will see as overly busy or cluttered, a third will see space underutilized, and a fourth will complain about the shade of white used for the background. Meanwhile, the designer is pleading for a fifth of Balwhinnie because the other three have four opinions on how to fix it. Things like strange line breaks can be blamed on the operating screen width of the original editor. Blank lines between paragraphs might make sense to one person, but to me, they're like the commercial breaks every ten minutes in the last hour of a theatrical movie shown on late night tv. The most irritating digitization I've read so far is a copy of Alice in Wonderland: _Adventures_ had mid-paragraph notes about differences in the editions, and the poems recited as dialogue ("The Carpenter and the Walrus," "Jabberwocky," "The Battle of Tweedledee and Tweedledum") were missing from both books. I've yet to look through my download history to see where I grabbed this travesty from. I also learned in my earliest English classes that writing "this stinks" 150 times does not count as a 300-word essay. Posting the specifics of the errors ad nauseum might get you banned from the forums, but it might get someone to understand wtf you're (justifiably) railing on about. -- Mjit RaindancerStahl answerwitch@gmail.com

How about trying your experiment with more recent books? The current crop are up in the 40000+ range.
Well, the problem I have been arguing for a "fix" for some years is that PG doesn't fix formatting errors. If PG is still today accepting books which have formatting errors, even though PG tells submitters what they want in order to get ebooks which actually work, then why is PG still accepting things which have broken formatting?
It might also help to mention which ebook reader you're using, and which ebook format you downloaded.
Any or all of the mobi format readers, after downloading mobi format, since that is one of if not the most commonly read ebook format. PG does somewhat better, but not terribly so, on epub format.

On Fri, Sep 14, 2012 at 11:28:20PM +0200, Jeroen Hellingman wrote:
The core "trouble" of the Project Gutenberg is its massive size. With over 40,000 ebooks, produced over several decades by countless volunteers, keeping up to any standards will be hard. Even a large and well-funded organization will have trouble to set consistent standards for such a large and diverse collection of texts.
Jeroen, I looked into TRAC, and actually got it to ingest the whole collection (it took a few days). http://trac.readingroo.ms/gutenberg/ It does revision control, issue tracking, etc. Unfortunately I have not had time (and don't have sufficient expertise) to take it much further than that. If anyone is interested, I'd be happy to provide access. Ultimately, part of my goal (which I expressed here earlier in the year) is exactly what you wrote about: better ability to crowdsource production and errata handling, and to more easily allow variations & derivative works. I wrote a fair amount about it then, so won't go into detail in this thread. -- Greg
You can basically propose any master format, any quality standard, or whatsoever, to do with the Project Gutenberg collection, but unless you are willing and able to contribute significantly to your own proposals, it won't get you anywhere.
Over a period of about 15 years, I've contributed over 500 ebooks to Project Gutenberg, and even as a single individual, I have trouble maintaining that bulk -- and even while I have the luxury of working with a decent master format, (TEI, which from a strictly technical point of few is the best choice, but suffers from a extremely steep learning curve, which makes it a bridge-too-far for most volunteers) and tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a point that just maintaining that sub-collection (fixing errors, improving tagging of early texts, etc.) gets a considerable task. (And then I can regenerate HTML and ePub files with a few keypresses, and have everything in a revision control system)
What I would currently like to see most to move Project Gutenberg forward are (in order of priority)
1. A decent issue tracking system that can cope with the number of books we have, so readers can report possible issues with ebooks 2. A revision control system (preferably distributed) that can handle the massive size of the PG collection, so we can keep track of what is going on. 3. An integrated production environment, a kind of PGDP 2.0, to help stream-line the production of new "text-based" (as opposed to scans only) ebooks.
Only after those things are in place, we can work towards a suitable master format, and a publishing pipeline that can produce desired output formats, such as HTML, ePub, etc. from that.
I've been looking around for suitable tools to make this possible, but probably nothing that is currently available can handle this, so it will require a considerable effort.
Taking for example the size of the collection. My own 500+ books are stored in a bazaar repository of about 4 gigabytes. Scaling that up a hundred times to hold PG's 40.000 books will result in a half a terabyte of controlled data; several orders of magnitudes larger than the largest open source code projects I know of.
Jeroen Hellingman
On 2012-09-14 22:37, Steve White wrote:
But as I've observed in so many other projects and jobs, the problem ain't the hardware, and it ain't the software, and it ain't the firmware and it ain't the middleware. It's some other ware.
I have mostly looked at some of the older books in the collection (for no particular reason). There, the most egregious problem is that the book was transcribed into ASCII, which inevitably results in some degree of data loss--sometime very unfortunate losses. Some human intervention would be necessary to repair that, but there are ways to reduce the amount of intervention required.
I'm cognizant of necessity to keep a master copy, and build other formats from that. It's not clear to me what the preferred format of the master copy is, or should be.
I am quite sure that very legible and attractive books could be automatically generated with a suitable base format. There are all sorts of technical possibilities -- again, this is not the problem.
Besides that, I think some direction and clarity could be helpful regarding modern formats. Example: it seems that many e-books are formatted to retain the ASCII look-and-feel, at least regarding the Gutenberg project text. This is always pointlessly ugly, and is often totally illegible. A little thought, and some examples, clearly documented somewhere, would go a long way.
Cheers!
Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org

This is _very_ cool, and would allow people to report problems with things or check in new revisions... What's the VCS behind it? Subversion? Alex On Tue, Sep 18, 2012 at 12:41 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On Fri, Sep 14, 2012 at 11:28:20PM +0200, Jeroen Hellingman wrote:
The core "trouble" of the Project Gutenberg is its massive size. With over 40,000 ebooks, produced over several decades by countless volunteers, keeping up to any standards will be hard. Even a large and well-funded organization will have trouble to set consistent standards for such a large and diverse collection of texts.
Jeroen,
I looked into TRAC, and actually got it to ingest the whole collection (it took a few days).
http://trac.readingroo.ms/gutenberg/
It does revision control, issue tracking, etc. Unfortunately I have not had time (and don't have sufficient expertise) to take it much further than that. If anyone is interested, I'd be happy to provide access.
Ultimately, part of my goal (which I expressed here earlier in the year) is exactly what you wrote about: better ability to crowdsource production and errata handling, and to more easily allow variations & derivative works. I wrote a fair amount about it then, so won't go into detail in this thread.
-- Greg
You can basically propose any master format, any quality standard, or whatsoever, to do with the Project Gutenberg collection, but unless you are willing and able to contribute significantly to your own proposals, it won't get you anywhere.
Over a period of about 15 years, I've contributed over 500 ebooks to Project Gutenberg, and even as a single individual, I have trouble maintaining that bulk -- and even while I have the luxury of working with a decent master format, (TEI, which from a strictly technical point of few is the best choice, but suffers from a extremely steep learning curve, which makes it a bridge-too-far for most volunteers) and tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a point that just maintaining that sub-collection (fixing errors, improving tagging of early texts, etc.) gets a considerable task. (And then I can regenerate HTML and ePub files with a few keypresses, and have everything in a revision control system)
What I would currently like to see most to move Project Gutenberg forward are (in order of priority)
1. A decent issue tracking system that can cope with the number of books we have, so readers can report possible issues with ebooks 2. A revision control system (preferably distributed) that can handle the massive size of the PG collection, so we can keep track of what is going on. 3. An integrated production environment, a kind of PGDP 2.0, to help stream-line the production of new "text-based" (as opposed to scans only) ebooks.
Only after those things are in place, we can work towards a suitable master format, and a publishing pipeline that can produce desired output formats, such as HTML, ePub, etc. from that.
I've been looking around for suitable tools to make this possible, but probably nothing that is currently available can handle this, so it will require a considerable effort.
Taking for example the size of the collection. My own 500+ books are stored in a bazaar repository of about 4 gigabytes. Scaling that up a hundred times to hold PG's 40.000 books will result in a half a terabyte of controlled data; several orders of magnitudes larger than the largest open source code projects I know of.
Jeroen Hellingman
On 2012-09-14 22:37, Steve White wrote:
But as I've observed in so many other projects and jobs, the problem ain't the hardware, and it ain't the software, and it ain't the firmware and it ain't the middleware. It's some other ware.
I have mostly looked at some of the older books in the collection (for no particular reason). There, the most egregious problem is that the book was transcribed into ASCII, which inevitably results in some degree of data loss--sometime very unfortunate losses. Some human intervention would be necessary to repair that, but there are ways to reduce the amount of intervention required.
I'm cognizant of necessity to keep a master copy, and build other formats from that. It's not clear to me what the preferred format of the master copy is, or should be.
I am quite sure that very legible and attractive books could be automatically generated with a suitable base format. There are all sorts of technical possibilities -- again, this is not the problem.
Besides that, I think some direction and clarity could be helpful regarding modern formats. Example: it seems that many e-books are formatted to retain the ASCII look-and-feel, at least regarding the Gutenberg project text. This is always pointlessly ugly, and is often totally illegible. A little thought, and some examples, clearly documented somewhere, would go a long way.
Cheers!
Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Tue, Sep 18, 2012 at 02:22:07PM -0400, Alex Buie wrote:
This is _very_ cool, and would allow people to report problems with things or check in new revisions...
Yes, it seems to have potential, but is just an experiment - it is currently just a static copy of the collection.
What's the VCS behind it? Subversion?
TRAC lets you choose. It is Subversion currently. -- Greg
On Tue, Sep 18, 2012 at 12:41 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On Fri, Sep 14, 2012 at 11:28:20PM +0200, Jeroen Hellingman wrote:
The core "trouble" of the Project Gutenberg is its massive size. With over 40,000 ebooks, produced over several decades by countless volunteers, keeping up to any standards will be hard. Even a large and well-funded organization will have trouble to set consistent standards for such a large and diverse collection of texts.
Jeroen,
I looked into TRAC, and actually got it to ingest the whole collection (it took a few days).
http://trac.readingroo.ms/gutenberg/
It does revision control, issue tracking, etc. Unfortunately I have not had time (and don't have sufficient expertise) to take it much further than that. If anyone is interested, I'd be happy to provide access.
Ultimately, part of my goal (which I expressed here earlier in the year) is exactly what you wrote about: better ability to crowdsource production and errata handling, and to more easily allow variations & derivative works. I wrote a fair amount about it then, so won't go into detail in this thread.
-- Greg
You can basically propose any master format, any quality standard, or whatsoever, to do with the Project Gutenberg collection, but unless you are willing and able to contribute significantly to your own proposals, it won't get you anywhere.
Over a period of about 15 years, I've contributed over 500 ebooks to Project Gutenberg, and even as a single individual, I have trouble maintaining that bulk -- and even while I have the luxury of working with a decent master format, (TEI, which from a strictly technical point of few is the best choice, but suffers from a extremely steep learning curve, which makes it a bridge-too-far for most volunteers) and tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a point that just maintaining that sub-collection (fixing errors, improving tagging of early texts, etc.) gets a considerable task. (And then I can regenerate HTML and ePub files with a few keypresses, and have everything in a revision control system)
What I would currently like to see most to move Project Gutenberg forward are (in order of priority)
1. A decent issue tracking system that can cope with the number of books we have, so readers can report possible issues with ebooks 2. A revision control system (preferably distributed) that can handle the massive size of the PG collection, so we can keep track of what is going on. 3. An integrated production environment, a kind of PGDP 2.0, to help stream-line the production of new "text-based" (as opposed to scans only) ebooks.
Only after those things are in place, we can work towards a suitable master format, and a publishing pipeline that can produce desired output formats, such as HTML, ePub, etc. from that.
I've been looking around for suitable tools to make this possible, but probably nothing that is currently available can handle this, so it will require a considerable effort.
Taking for example the size of the collection. My own 500+ books are stored in a bazaar repository of about 4 gigabytes. Scaling that up a hundred times to hold PG's 40.000 books will result in a half a terabyte of controlled data; several orders of magnitudes larger than the largest open source code projects I know of.
Jeroen Hellingman
On 2012-09-14 22:37, Steve White wrote:
But as I've observed in so many other projects and jobs, the problem ain't the hardware, and it ain't the software, and it ain't the firmware and it ain't the middleware. It's some other ware.
I have mostly looked at some of the older books in the collection (for no particular reason). There, the most egregious problem is that the book was transcribed into ASCII, which inevitably results in some degree of data loss--sometime very unfortunate losses. Some human intervention would be necessary to repair that, but there are ways to reduce the amount of intervention required.
I'm cognizant of necessity to keep a master copy, and build other formats from that. It's not clear to me what the preferred format of the master copy is, or should be.
I am quite sure that very legible and attractive books could be automatically generated with a suitable base format. There are all sorts of technical possibilities -- again, this is not the problem.
Besides that, I think some direction and clarity could be helpful regarding modern formats. Example: it seems that many e-books are formatted to retain the ASCII look-and-feel, at least regarding the Gutenberg project text. This is always pointlessly ugly, and is often totally illegible. A little thought, and some examples, clearly documented somewhere, would go a long way.
Cheers!
Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Greg, The TRAC database is interesting. I will have to look a little into trac to see what it can do. A quick look at the trac project page learns that it contains a wrapper around Subversion (or alternative version control systems). Basically it will take some time working with a system to evaluate its usefulness. It seems you've never done anything with it after the initial import. One of the first steps I would take is weed out all files that can be dynamically regenerated, that is, all the zip archives that are also present as uncompressed file-sets. A zip is easily created on the fly. (I have maintained a local copy of PG for years, working the other way round, only grabbing the zips; since I've ran out of disk space, I am no longer doing this.) I still have about 80 (hard) books to Post-Process for PGDP, after that I will dive more in this type of things. One barrier to "can-do" attitude requested that Project Gutenberg can do is to remove the "copyright" restrictions on the collection as a whole, and revise the somewhat arcane PG license restrictions. I think it would be most appropriate to stamp something like CC-BY or CC-BY-SA on the entire collection (exempting only the copyrighted contributions), and drop the claim that the collection itself warrants some kind of "compilation copyright" (something I thing makes no sense and will no hold, as there has been no creative effort in making the compilation, just accepting everything that meets a certain threshold of technical quality and copyright clearance does not count here). Of course, having the RDF catalog under GNU GPL is great. Jeroen. On 2012-09-18 18:41, Greg Newby wrote:
Jeroen,
I looked into TRAC, and actually got it to ingest the whole collection (it took a few days).
http://trac.readingroo.ms/gutenberg/
It does revision control, issue tracking, etc. Unfortunately I have not had time (and don't have sufficient expertise) to take it much further than that. If anyone is interested, I'd be happy to provide access.
Ultimately, part of my goal (which I expressed here earlier in the year) is exactly what you wrote about: better ability to crowdsource production and errata handling, and to more easily allow variations & derivative works. I wrote a fair amount about it then, so won't go into detail in this thread.
-- Greg

I hope your experience is at one end of the spectrum (and not the other).
Sorry for the "dump" but this forum has been through these issues in literally exhaustive details over the years past, with the "outsiders" pushing PG to change, and the "insiders" pushing back -- by making the formatting problems even worse. This forum then went quiet, presumably because everyone who had been pushing for change has given up on PG. I have a number of books in the works, but I really don't know what to do with them -- working with PG is simply too painful, too insulting, and one can submit a beautifully, and simply, formatted book to PG and then what comes out of their "sausage maker" software for you to read on your EPUB or MOBI device ends up being a formatting disappointment. Last time I checked there was still some people interested in tackling the formatting problems inside the DP hierarchy [but DP has its own set of "High Priest/High Priestess" problems too] and DP has more leverage with PG than individuals [since so many of the new books come from DP] so many check in with what the DP people interested in the formatting problems are doing. I would love someone to tackle figuring out a "work around" that would allow us to make and distribute beautifully formatted books to ebook readers, but too much has been going on in my life to tackle that project right now.

Hi Jim, Lots of places to go and have your texts/books published. It is up to you ! regards Keith. Am 15.09.2012 um 17:31 schrieb James Adcock <jimad@msn.com>:
because everyone who had been pushing for change has given up on PG. I have a number of books in the works, but I really don't know what to do with them -- working with PG is simply too painful, too insulting, and one can submit

Hi! It does seem that eveybody has their complaints! It's impressive: one person voices concerns, and proposes solutions, and we get to hear everybody's concerns, and how those are far more important. And then... well... Each of these problems may ultimately be a valid problem that needs to be addressed, but each of them is do-able, technically. In each case, it's a matter of looking at the thing a different way. Technically, no one problem is particularly difficult here. Technically. Oh well. Hey! Here's something completely different! I just made an e-book with an imbedded Devanagari font that shows some hymns from the Rig Veda in Sanskrit! Ah well. It looks very very cool. C ya! On Thu, Sep 13, 2012 at 10:41 AM, Steve White <stevan.white@gmail.com> wrote:
Hi,
I have several questions.
I got a nice e-book reader, and became interested in the Gutenberg collection. The collection itself is great! Some of the e-books are almost unusable on my reader, mostly due to unfortunate automatic formatting.
So...I took the e-books apart, and re-formatted them. Then I tested them in several e-reader environments. I think they're very nice now. And, I would like to share these improvements, preferably through the Gutenberg project.
** Does the project accept these kinds of improvements to text?
I see some talk of old and new editions in the books. ** Would such changes constitute new editions?
** What are the preferred procedures?
** Is this a good forum for general technical and policy questions regarding Gutenberg e-books?
Thanks!

Steve, I'm interested in checking out your Rg Veda. I myself am responsible for this: http://www.gutenberg.org/ebooks/39442 It was quite a job to format, mostly because of family tree tables. I was not at all happy with the formatting of the EPUB and Kindle versions, so I reworked the EPUB, generated a Kindle from it, and published it on Amazon: http://www.amazon.com/Bhagavata-Purana-Corrected-Introduction-ebook/dp/B0086Q7UTE/ref=tmm_kin_title_0?ie=UTF8&qid=1347892023&sr=8-1 If you look on Amazon you'll see that others have taken my inadequately formatted version, put a quick and dirty cover on it, and are selling it for more than I am asking. Another thing I've done with reformatted PG submissions (my own submissions) is put them on the Internet Archive: http://archive.org/details/VidyapatiBangiyaPadabali http://archive.org/details/ChaitanyasLifeAndTeachings For my Bhagavata Purana I created an RST file which the other formats got generated from. For most non-illustrated books this would be an excellent solution. For illustrated books you have the problem that PG wants all illustrations to be viewable vertically on the web page, and if the printed book requires you to rotate the book to see the illustration the right way up you need to rotate the image so the web page viewer can see it the right way up. Good for web pages, not good for e-readers which should have the same orientation the original book had. Doing poetry on a e-reader has its own challenges. There has been a lot of discussion on this list of how to deal with this. My own suggestion would be to give the submitter the option to include a properly formatted EPUB with his submission. PG would use this EPUB instead of the auto generated one and would generate the Kindle version from the EPUB as well. James Simmons On Fri, Sep 14, 2012 at 5:43 PM, Steve White <stevan.white@gmail.com> wrote:
Hi!
It does seem that eveybody has their complaints!
It's impressive: one person voices concerns, and proposes solutions, and we get to hear everybody's concerns, and how those are far more important. And then... well...
Each of these problems may ultimately be a valid problem that needs to be addressed, but each of them is do-able, technically. In each case, it's a matter of looking at the thing a different way.
Technically, no one problem is particularly difficult here. Technically.
Oh well.
Hey! Here's something completely different! I just made an e-book with an imbedded Devanagari font that shows some hymns from the Rig Veda in Sanskrit!
Ah well. It looks very very cool.
C ya!

http://www.gutenberg.org/ebooks/39442.kindle.images Shows a variety of formatting problems. Granted, issues like family trees and the reproduction of large maps in ebook format are difficult issues. However, simple small images should reproduce in a usable, undistorted form. This is, however, to my mind, not the central "formatting problem" at PG. The central problem is that "dirt simple" PG books have formatting problems on ebook readers which make them miserable to read, including problems with the simplest of issues, such as "how does one correctly format a paragraph?"
participants (14)
-
Al Haines
-
Alex Buie
-
David Starner
-
Greg Newby
-
James Adcock
-
James Simmons
-
Jeroen Hellingman
-
Keith J. Schultz
-
Mark Swofford
-
Mjit RaindancerStahl
-
Roger Frank
-
Steve White
-
traverso@posso.dm.unipi.it
-
Walt Farrell