Re: [gutvol-d] 1. Choose a book to digitize

17 Oct 2012

      The mission of PG is accessibility. Getting books to people, so they  
can actually read them.

This is not exactly the same as preservation -- but touches to it in  
some ways. Yes, you can preserve a book in a sense by scanning it to  
very high standards, and then keeping those scans a few servers for  
patrons to use, and then lock away the original book  in an ideally  
conditioned environment, taking wear-and-tear out of the picture for  
all but the most essential physical manipulations.

But I think true preservation is more than keeping the artifacts for eternity,
this is about keeping the expressions available and accessible, and  
part of our
culture for as long as they deserve it ("deserve" being decided by the  
members of the public, not some artificial gatekeepers). For this,  
works need to be accessible....

Curiously, not our (lack of) standards, not even our limited capacity,  
but the laws of Copyright, cynically advertised as "promoting science  
and arts" are the biggest barrier to making works inaccessible by  
needlessly preventing even the slightest steps required to keep works  
accessible until they have become completely culturally irrelevant.

Of course changing that will take some time, and we do have a huge  
backlog to work on until that change has been materialized.

Turning then to the question of a master format. It didn't took me  
long back in 1996 on discovering TEI to be converted. This is exactly  
what we needed to get going with a master format. Yes, it takes time  
to learn, and yes, we do not yet have a fool-proof toolchain for it,  
but it will survive in the long run because it is reasonably well  
defined, and still flexible enough to deal with the large number of  
details we come across in real books. Still, TEI is not a stand-still  
target, and since its start has gone through several incarnations, and  
the whole customization stuff included with it make it hard as well  
(but compare that with the Unicode standard, few people will ever  
require the use of Cherokee or Deseret letters, but still they are in  
there, just in case). And yes, most tool-chains will require  
conventions to work well, but that can be solved...

I don't see the point in defining yet another master format to get to  
something simpler than TEI. TEI has done the groundwork and heavy  
lifting, and works well, and while I agree that you need to make  
things simple, you can't really make them simpler than needed.

Converting a simple work into TEI (a run of the mill novel) starting  
from reasonable clean HTML takes an hour or two; starting from Word  
doubles that,
and plain text requires a walk-through of all pages (or scans) to add  
missing layout information. When I work on a text for PG for which a  
on-line source is already in existence, I often normalize both the OCR  
results and the alternative source towards TEI, and then run a compare  
to find issues in both. (As I have done with the "Old Frisian" text in  
the Oera Linda book recently posted).

What I would understand is to have a more WYSIWYG interface on  
Distributed Proofreaders, that is, have people use a Wikipedia like  
mark-up, and then render that page as HTML internally, which will help  
to find more issues in the content.

Jeroen.

Quoting Lee Passey <lee@novomail.net>:
...
On Fri, October 12, 2012 4:47 pm, don kretz wrote:
...
One wonders how many historically significant books are being
irretrievably lost in the destruction and violence in Syria and Egypt
and Libya, literarily [sic] among the most historically active areas
in the world.
I do not believe that it is now, or ever has been, the mission of  
Project Gutenberg to preserve world literature. According to the web  
site, the mission of Project Gutenberg (written in Michael Hart's  
inimitable style) is:
"To encourage the creation and distribution of eBooks."
There is nothing in the explanatory text surrounding this mission  
statement that suggests that preservation plays any part in PG's  
mission, although to be fair there is much in Mr. Hart's statement  
that is at odds with the current practices of Project Gutenberg.
It seems to me that the mission of Project Gutenberg has nothing to  
do with the /preservation/ of literary works and everything to do  
with the /popularization/ and /accessibility/ of those works. And  
while Mr. Hart never said, "we encourage our volunteers to furnish  
us with as many rare texts as they can," he did say, "[W]e are happy  
to bring eBooks to our readers in as many formats as our volunteers  
wish to make.... [P]eople are still encouraged to send us eBooks in  
any format and at any accuracy level and we will ask for volunteers  
to convert them to other formats, and to incrementally correct  
errors as times goes on."
I have come to believe that when Mr. Hart started Project Gutenberg  
on the donated mainframe time he understood the potential of storing  
mass amounts of text on computers, but he did not understand the  
transformative power of computing. He understood the power of the  
hard drive, but not the power of the CPU. Thus, when he first  
started placing text into storage instead of using a rich format  
that could be transformed into the Format Of Any Day, he chose to  
carefully, manually transform each text directly into the Format Of  
His Day, which in that day was 80 character lines, ASCII-only text,  
suitable for use on the VT52 terminal.
Over time, the Format Of The Day has changed, but given the  
difficulty of up-converting VT52 format to more modern formats, and  
the fact that most modern operating systems can still display VT52  
text files, however badly, most PG texts have remained in their  
original, sorry state. This state of affairs has persisted for so  
long that most of the PG old-timers see it as being not only normal,  
but desirable.
Most of the complaints now leveled at PG are not that the archive is  
too incomplete, but that the contents of the archive are so visually  
unappealing as to be unusable. Thus, the true mission of Project  
Gutenberg, "[t]o encourage the creation and distribution of eBooks,"  
is now no longer being satisfied.
So, the first advice /I/ would give to someone wanting to volunteer  
at Project Gutenberg is to start by learning how to create an  
electronic book from an existing file (a tutorial to this effect  
should be created). Then s/he should practice what s/he has learned  
by taking an existing PG file that s/he is interested in, and which  
sucks, and make it suck less. The text can then be returned to PG  
for the kind of incremental update that Mr. Hard envisioned.
This kind of approach provides a gentle introduction to the creation  
of e-texts. You can take an existing text, see how someone else has  
done it, see where the mistakes are, fix a few simple mistakes,  
check it in to PG, see what kind of feedback you get, fix more  
mistakes, take on a more challenging text, and so on until you're  
comfortable with markup. Then, go get some OCR'ed text from IA or  
Google, fix that up and check that in. When you finally get around  
to doing OCR yourself, you have all the underlying knowledge to fix  
up your own OCR.
Stay in the shallow end until you learn how to tread water, and do  
not use the high dive until you are an expert swimmer.
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d

Re: [gutvol-d] 1. Choose a book to digitize

jeroen＠bohol.ph