
The mission of PG is accessibility. Getting books to people, so they can actually read them. This is not exactly the same as preservation -- but touches to it in some ways. Yes, you can preserve a book in a sense by scanning it to very high standards, and then keeping those scans a few servers for patrons to use, and then lock away the original book in an ideally conditioned environment, taking wear-and-tear out of the picture for all but the most essential physical manipulations. But I think true preservation is more than keeping the artifacts for eternity, this is about keeping the expressions available and accessible, and part of our culture for as long as they deserve it ("deserve" being decided by the members of the public, not some artificial gatekeepers). For this, works need to be accessible.... Curiously, not our (lack of) standards, not even our limited capacity, but the laws of Copyright, cynically advertised as "promoting science and arts" are the biggest barrier to making works inaccessible by needlessly preventing even the slightest steps required to keep works accessible until they have become completely culturally irrelevant. Of course changing that will take some time, and we do have a huge backlog to work on until that change has been materialized. Turning then to the question of a master format. It didn't took me long back in 1996 on discovering TEI to be converted. This is exactly what we needed to get going with a master format. Yes, it takes time to learn, and yes, we do not yet have a fool-proof toolchain for it, but it will survive in the long run because it is reasonably well defined, and still flexible enough to deal with the large number of details we come across in real books. Still, TEI is not a stand-still target, and since its start has gone through several incarnations, and the whole customization stuff included with it make it hard as well (but compare that with the Unicode standard, few people will ever require the use of Cherokee or Deseret letters, but still they are in there, just in case). And yes, most tool-chains will require conventions to work well, but that can be solved... I don't see the point in defining yet another master format to get to something simpler than TEI. TEI has done the groundwork and heavy lifting, and works well, and while I agree that you need to make things simple, you can't really make them simpler than needed. Converting a simple work into TEI (a run of the mill novel) starting from reasonable clean HTML takes an hour or two; starting from Word doubles that, and plain text requires a walk-through of all pages (or scans) to add missing layout information. When I work on a text for PG for which a on-line source is already in existence, I often normalize both the OCR results and the alternative source towards TEI, and then run a compare to find issues in both. (As I have done with the "Old Frisian" text in the Oera Linda book recently posted). What I would understand is to have a more WYSIWYG interface on Distributed Proofreaders, that is, have people use a Wikipedia like mark-up, and then render that page as HTML internally, which will help to find more issues in the content. Jeroen. Quoting Lee Passey <lee@novomail.net>:
On Fri, October 12, 2012 4:47 pm, don kretz wrote:
One wonders how many historically significant books are being irretrievably lost in the destruction and violence in Syria and Egypt and Libya, literarily [sic] among the most historically active areas in the world.
I do not believe that it is now, or ever has been, the mission of Project Gutenberg to preserve world literature. According to the web site, the mission of Project Gutenberg (written in Michael Hart's inimitable style) is:
"To encourage the creation and distribution of eBooks."
There is nothing in the explanatory text surrounding this mission statement that suggests that preservation plays any part in PG's mission, although to be fair there is much in Mr. Hart's statement that is at odds with the current practices of Project Gutenberg.
It seems to me that the mission of Project Gutenberg has nothing to do with the /preservation/ of literary works and everything to do with the /popularization/ and /accessibility/ of those works. And while Mr. Hart never said, "we encourage our volunteers to furnish us with as many rare texts as they can," he did say, "[W]e are happy to bring eBooks to our readers in as many formats as our volunteers wish to make.... [P]eople are still encouraged to send us eBooks in any format and at any accuracy level and we will ask for volunteers to convert them to other formats, and to incrementally correct errors as times goes on."
I have come to believe that when Mr. Hart started Project Gutenberg on the donated mainframe time he understood the potential of storing mass amounts of text on computers, but he did not understand the transformative power of computing. He understood the power of the hard drive, but not the power of the CPU. Thus, when he first started placing text into storage instead of using a rich format that could be transformed into the Format Of Any Day, he chose to carefully, manually transform each text directly into the Format Of His Day, which in that day was 80 character lines, ASCII-only text, suitable for use on the VT52 terminal.
Over time, the Format Of The Day has changed, but given the difficulty of up-converting VT52 format to more modern formats, and the fact that most modern operating systems can still display VT52 text files, however badly, most PG texts have remained in their original, sorry state. This state of affairs has persisted for so long that most of the PG old-timers see it as being not only normal, but desirable.
Most of the complaints now leveled at PG are not that the archive is too incomplete, but that the contents of the archive are so visually unappealing as to be unusable. Thus, the true mission of Project Gutenberg, "[t]o encourage the creation and distribution of eBooks," is now no longer being satisfied.
So, the first advice /I/ would give to someone wanting to volunteer at Project Gutenberg is to start by learning how to create an electronic book from an existing file (a tutorial to this effect should be created). Then s/he should practice what s/he has learned by taking an existing PG file that s/he is interested in, and which sucks, and make it suck less. The text can then be returned to PG for the kind of incremental update that Mr. Hard envisioned.
This kind of approach provides a gentle introduction to the creation of e-texts. You can take an existing text, see how someone else has done it, see where the mistakes are, fix a few simple mistakes, check it in to PG, see what kind of feedback you get, fix more mistakes, take on a more challenging text, and so on until you're comfortable with markup. Then, go get some OCR'ed text from IA or Google, fix that up and check that in. When you finally get around to doing OCR yourself, you have all the underlying knowledge to fix up your own OCR.
Stay in the shallow end until you learn how to tread water, and do not use the high dive until you are an expert swimmer.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d