
As a credentialed conflict avoider, I've been loathe to stick my head into this fray. Indeed, this battle about meeting the needs of academia appears to be waged at times with an ideological fervor to rival that of the recent US election. It seems to me that the fervency with which people approach this issue has made it difficult in some cases for the arguments to follow a path towards resolution. It is perhaps also complicated by the wide assortment of changes being proposed to remedy the perceived problems. Some arguments for change suggest that PG should direct its energies towards making its library suitable for scholars by including more information in the files, particularly pagination and provenance, presumably packaged with XML. I have no problem with including such information. However, I don't think it should be required of all texts, nor do I believe that it really solves the scholarship issue. Including page scans _would_, to the degree that a solution is possible, and requires approximately 0% extra work for most of our valiant volunteers. And, PG has made it clear that this is acceptable, and has already done so for some projects. I feel that Marcello gave the most persuasive and concise summary of the situation, and I didn't notice any overt disagreement. Marcello Perathoner wrote:
The best value for Academia (and the least work for us) would be just to include the page scans. Any transcription you make will fall short of the requirements of some scholar. I think we should use our time for producing more books for a general audience instead than producing Academia-certified editions of them.
HSH's comments justify such an approach. Her Serene Highness wrote:
I need to know EXACTLY when the original was published, who published it, and where, since there are variant texts out there. Even a single word change that might have occurred in the copying process could change the meaning of a vital sentence.
Of course, there is a simple, if unsatisfactory, answer to all these questions for PG texts: they were published by PG, on the PG website, and each file states when it was published. Each work we publish is the "PG variant" of that text. As an academic, I find it dishonest and unhelpful for a scholar to cite a physical volume when the volume they consulted is an electronic edition. It is virtually impossible to guarantee that "even a single word change" was not introduced in the transcription process. Even with DP's careful processes, I would not wager that most of our books enter PG completely error free (or correction free, for that matter.) Page scans allow for an additional layer of safety for any scholar concerned about the adherence to a given print edition, though a certain level of trust in the provider is still required. Thus, while I hope that PG's holdings are as accurate as possible, it would also be my hope that scholars using PG would cite PG. Evidently this is not always the case. Michael Hart wrote:
I've also heard that many of those who complain, actually use our eBooks in secret, and ONLY want the provenance so they can steal them without giving credit where credit is due.
This suggests to me two things. 1) We can include page scans and information about provenance, _when available_, with the files so that academics can feel confident in the reliability of those PG holdings. Not so that the original sources can be dishonestly cited, but to provide the necessary data for certain scholars to confidently cite PG's edition. We can point to this in our documentation to enhance our scholarly credibility. 2) We can prominently suggest an appropriate style of citation of works in PG's holdings. (I've seen this done with other digital collections.) Perhaps if the citation style also takes into account the original source, some otherwise reluctant scholars would be appeased. Is this something we can all agree on? -- David Newman www.davidnewman.info

David's solution is perfectly OK for me. It is sufficient that PG does not discourage keeping the extra information (it did until recently). The volunteers will do the rest. An importaint improvement would be to be able to go easily from the text to the corresponding page scan. Just having the two separately is fine, but having them linked is better; going from image to txt is easy (search), but the converse is often hard. There are of course different solutions. All require, in some form, to preserve the page information, including page numbers in the source is just one method. Another remark, on page scans obtained from other sources: one of these sources, the one that I mostly use, and that has originated hundreds and probably thousands of PG books, is the french national library, http://gallica.bnf.fr. I have received (by email) a ratheer broad permission to use everything on the site to produce ebooks for DP and PG, and related sites (I have used the permission for LiberLiber and DP-EU). It might be possible to renegotiate the permission, but might result in a restriction of the terms. But I believe that the original permission could cover the possibility of giving to the user the possibility of checking an individual page for comparison, not of mirroring their files, once the transcription completed; these files can very well obtained from the origin. The french national library is not expected to die or to become unavailable: and in that case we have the image files. Carlo

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org]On Behalf Of David Newman Sent: Saturday, November 13, 2004 4:31 AM To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] Scholarly use of PG As a credentialed conflict avoider, I've been loathe to stick my head into this fray. Indeed, this battle about meeting the needs of academia appears to be waged at times with an ideological fervor to rival that of the recent US election. It seems to me that the fervency with which people approach this issue has made it difficult in some cases for the arguments to follow a path towards resolution. It is perhaps also complicated by the wide assortment of changes being proposed to remedy the perceived problems. Some arguments for change suggest that PG should direct its energies towards making its library suitable for scholars by including more information in the files, particularly pagination and provenance, presumably packaged with XML. I have no problem with including such information. However, I don't think it should be required of all texts, nor do I believe that it really solves the scholarship issue. Including page scans _would_, to the degree that a solution is possible, and requires approximately 0% extra work for most of our valiant volunteers. And, PG has made it clear that this is acceptable, and has already done so for some projects. I feel that Marcello gave the most persuasive and concise summary of the situation, and I didn't notice any overt disagreement. Marcello Perathoner wrote:
The best value for Academia (and the least work for us) would be just to include the page scans. Any transcription you make will fall short of the requirements of some scholar. I think we should use our time for producing more books for a general audience instead than producing Academia-certified editions of them.
HSH's comments justify such an approach. Her Serene Highness wrote:
I need to know EXACTLY when the original was published, who published it, and where, since there are variant texts out there. Even a single word change that might have occurred in the copying process could change the meaning of a vital sentence.
Of course, there is a simple, if unsatisfactory, answer to all these questions for PG texts: they were published by PG, on the PG website, and each file states when it was published. Each work we publish is the "PG variant" of that text. As an academic, I find it dishonest and unhelpful for a scholar to cite a physical volume when the volume they consulted is an electronic edition. It is virtually impossible to guarantee that "even a single word change" was not introduced in the transcription process. Even with DP's careful processes, I would not wager that most of our books enter PG completely error free (or correction free, for that matter.) ** I would find it dishonest also. I think it is very important for people to give correct citations. However- and this is a big however- PG is not 'publishing' books. It's copying them. There is no PG publishing house that is making decisions on whether something is worth publishing or not. PG acts a repository- a library. Paper publishers cannot guarantee that each word onthe written page is exactly as written by the author. However, with books that are well known or historically important, scholars can often compare published texts with author's notes in order to see the variants. Many of the books on PG are obscure. we are given the name of a book and an author, but there is no book to be looked at. If these texts are important- and I would argue that many obscure texts are, if only for historical reasons- it is important to have copies of the scans. In some cases, PG may be the only place where someone can find particular texts. Textual clues do not live only in words. A book comes alive in typeface, and in word placement on a page. James Joyce didn't just write words to be read- he placed them on pages in ways that told the reader how to interpret them. Taking a book out of context- the context of the page- when that book was written prior to the computer revolution is like ignoring how many paintings were paired with their frames by the painters themselves. Saving a book while divorcing it from its index, illustrations, typefont, and so on is not 'saving' it. It's a decontextualization. A perfect example would be movie remakes. There are many different versions of 'A Christmas Carol', several of them in modern dress. Many of them use pretty much the same exact script. Does that make them the same? Why do people prefer even an old, scratched-up and faded copy with Alistair Sim to a nice shiny new version, even if the new film is a shot for shot remake? A film is more than actors spouting lines. Film is every aspect that goes into it, even beyond hwat Dickens thought up. There are times when we want nothing more than the words of dickens, and there are times when we want the thrill of seeing characters come to life before us in front of our physical eyes. A book may be perfectly good reading material- but an ebook printed in Courier (which is very hard to read), perhaps missing its original illustrations, without an index that shows the manner in which the author's or editor's mind worked- is no longer the original book. As a scholar I like working from original materials. An original material may be on a computer screen- that's fine by me. An original material might be enhanced by being online- many versions of The Bible are, for instance, and I received great joy recently while reading what was essentially a book that gave a key to Silverlock- it worked better online than it ever could have on paper. But PG is not publishing or storing original texts. It's working with old ones. I recall the cry that vinyl was going the way of the dinosaur- yet it has not. In fact, the MP3 player is the new vinyl- for the first time in years, there are cost effective '45s', courtesy of Napster and other companies. I can hear snippets of a song before buying, just as my mother one did in record shops. However, Napster technology is not better in the long run than a record- CDs and computer memory degrade at an alarming rate. Books aren't dead either, and people who think books are about finding passages in less than 25 seconds are missing the point of why people read- in the same way that people who drink coffee to get revved up often don't understand why tea drinkers make elaborate ceremonies around a caffeinated beverage. People read because they want a total experience- computers don't feel like paper. They don't smell. The text is usually flat and more difficult to read. Some of this will change over time- but not all of it, thank the Lord. I want books to be available to the public in ways that they have never been before, and so I support PG. But it doesn't have the credibility of a real library or publishing house, because it doesn't publish (copying things and leaving out some of the vitals doesn't constitute puplishing in most people's minds, or at least not in a good way, no matter that info techies might want to think)and it doesn't store (libraries don't cut the covers and publishing info off their books to make more room on the shelves, they include books of criticism, and they have technologies for cross referencing- they also have people called librarians who can help people refine their interests and find books that might be of use to them. So do bookstores. Even Barnes and Noble, to some extent). I think some people here want to store books. That's nice, as far as it goes. Whether they understand how people use books or why- well, I seriously doubt that some people here have thought about that. It's like MS Word, which ignores that people write more complex things than business letters. Its vocabulary and understanding of grammar are seriously stunted, and it's hellish for anyone who wants to edit anything longer than two pages. Does it process words efficiently? Yes. But it's a fucking bad word processor and has none of the grace of WordPerfect. That most people are fine with it shows how few people actually write or edit for the joy of doing so, which is fine- but its incomaptibilty with WP and vice versa makes life tough on those who do.*** Page scans allow for an additional layer of safety for any scholar concerned about the adherence to a given print edition, though a certain level of trust in the provider is still required. Thus, while I hope that PG's holdings are as accurate as possible, it would also be my hope that scholars using PG would cite PG. Evidently this is not always the case. Michael Hart wrote:
I've also heard that many of those who complain, actually use our eBooks in secret, and ONLY want the provenance so they can steal them without giving credit where credit is due.
This suggests to me two things. 1) We can include page scans and information about provenance, _when available_, with the files so that academics can feel confident in the reliability of those PG holdings. Not so that the original sources can be dishonestly cited, but to provide the necessary data for certain scholars to confidently cite PG's edition. We can point to this in our documentation to enhance our scholarly credibility. 2) We can prominently suggest an appropriate style of citation of works in PG's holdings. (I've seen this done with other digital collections.) Perhaps if the citation style also takes into account the original source, some otherwise reluctant scholars would be appeased. Is this something we can all agree on? -- David Newman www.davidnewman.info _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Sat, Nov 13, 2004 at 01:30:44AM -0800, David Newman wrote:
Marcello Perathoner wrote:
The best value for Academia (and the least work for us) would be just to include the page scans. Any transcription you make will fall short of the requirements of some scholar. I think we should use our time for producing more books for a general audience instead than producing Academia-certified editions of them.
It occurred to me that some people might think that page scans are forbidden or not welcome. While it's true that we don't have many (any?) eBooks with full page scans, we *are* willing & able & ready to take them. Jim Tinsley did a 'howto' on the page scan naming convention (that is the hard part - so people know what they're called and where to find them). The post-10K directory structure, created over a year ago, includes the notion of a subdir for scans. DP has been invited to submit scans along with their texts. Maybe this word has not gone out sufficiently. Like with the XML markup discussion, the question is not "if" but "how." The first folks to submit scans with their submitted eBooks will need to do some extra work to help figure out the best way to do it. The posting team will need to keep track of the large files involved. If someone has the scans for a completed eBook, now would be a good time to work on getting them online. My estimate from early 2004 was that this would have grown the PG collection by an extra terabyte or so if we did it all through 2004. We haven't, and so this growth hasn't happened. But other than needing to deal with the extra space (which is trivial for a small number of eBooks, but could be challenging for our mirrors and main distribution servers when done en masse), there's no impediment I know of to moving forward. -- Greg

Greg Newby wrote:
DP has been invited to submit scans along with their texts.
http://gutenberg.net/faq/S-21 says: "Page images submitted to Distributed Proofreaders are automatically saved, and, while not publicly available today, will probably become so in the future." I took this to mean that there is no point in submitting page scans of DP projects to PG. Is that right? Mike

On Sun, Nov 14, 2004 at 06:14:42AM +1100, Michael Ciesielski wrote:
Greg Newby wrote:
DP has been invited to submit scans along with their texts.
http://gutenberg.net/faq/S-21 says:
"Page images submitted to Distributed Proofreaders are automatically saved, and, while not publicly available today, will probably become so in the future."
I took this to mean that there is no point in submitting page scans of DP projects to PG. Is that right?
Mike
Jim Tinsley had sent a note in this thread about this too, that I hadn't yet seen when I wrote my reply. No, it's not right. Yes, we are ready to accept page scans as part of completed eBooks from DP or other sources. Jim said he's already done this for 3 eBooks. I hope Jim can find the little 'howto' he wrote about the file names & formats, otherwise I can dive into my email archive to seek it out. Getting the process tuned will take a few tries, and require some patience from everyone involved, but the intent for quite some time (over a year at least) has been to move forward with scans. -- Greg

Greg wrote:
It occurred to me that some people might think that page scans are forbidden or not welcome. While it's true that we don't have many (any?) eBooks with full page scans, we *are* willing & able & ready to take them.
This is excellent news! Yes, I think people were uncertain about how welcome page scans were by PG. (Whether PG should require page scans be submitted along with texts, with certain exceptions given, is a different issue.) Obviously, if the page scans existed for all the 10,000+ PG texts, the collection of scans would occupy a lot of space, but surprisingly not as much as one might think, at least by today's hardware standards. Assuming we have 15,000 texts, each of which has an average of 300 source pages (which may be a high estimate -- anyone?), and each page scan occupies about 60k (using an efficient lossless compression scheme -- this may also be a high estimate -- anyone?), this works out to approximately a little under 300 gigabytes. (My son recently bought two 200G hard drives for $100 each. There are 300G drives available, and it seems like year after year hard disk capacities continue to increase, while $/gig continues to drop.) I know Brewster Kahle at the Internet Archive will also be happy to receive file copies of these page scans and tuck them away into his archive (which is redundantly mirrored) for preservation and open online access. Of course, with one million scanned books, we are now talking about significant space, approximately 20 terabytes (using the assumptions above). But this is 1/5 of Brewster's "rack" (where 10 racks makes a petabyte) and again I know he'll be thrilled to store these away for safekeeping and open access. (PG should also store these scans itself and find others throughout the world willing to store them on hard disk, tape, etc., to assure redundant storage and preservation.) It would not surprise me to see in a few years high quality, durable, random access, compact, and very cheap storage in the ten to twenty terabyte range per unit -- enough to hold the original page scans for one million books. We then can start thinking about one billion books. So storage and access should NOT be an issue with regards to acquiring the original page scans for the PG Library. Jon Noring

I thought we had a plan to save all page scans nearly a year ago. Greg told me he thought that Charles Franks had them, but both are on the road/vacation right now, so not sure how to check. We'll see. Michael
participants (7)
-
Carlo Traverso
-
David Newman
-
Greg Newby
-
Her Serene Highness
-
Jon Noring
-
Michael Ciesielski
-
Michael Hart