
don said:
If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements.
The page images must be available as the basis for resolving all questions about content (understanding they aren't always unambiguous, but no images is hopeless.)
The text must have the page numbers embedded as data so the error submission process includes the ability to easily confirm the text with the image.
well, don, yes, your general point is certainly spot-on. if i wanted to quibble, i could mention that it would certainly be possible to build capabilities that could handle much of the task via a variety of other means, using other ways, maybe being even more effective. (such as auto-comparison of multiple digitizations.) nonetheless, i would still have to admit that, at many points in time, along the way, it is simply _necessary_ to be able to refer to the original, to quell some issue. it's an unavoidable need, for a digitizing system. *** and yes, i understand quite well that you are merely probing to see if the powers-that-be will "concede" that they will indeed have to install such capability. since if they don't agree to _that_, we just go home, knowing that the job before us would be impossible. *** still, don, perhaps you're getting ahead of yourself. because even if they _do_ concede, would it matter? is there anyone to build such a correction system? (and by "build", i mean "program". it's obvious we have no shortage of "architects" here who're willing to draft up their favorite fancy blueprints. but without anyone to actually build the thing...) so, would anyone code a p.g. correction capability? as someone who has volunteered -- many times -- to program a system for project gutenberg that did exactly that, i do believe i can answer that question. today, 2012, i would never build that system for p.g. and not just because p.g. lost all its credit with me... because i'd advise anyone else _not_ to do it either... the reason is simple: not enough bang for your buck. first of all, the requirements are hard for p.g. to meet. even just these two don listed are hard, because p.g. never paid any attention -- at all -- to pagenumbers. and even though they give a little lip-service to scans, it's not part of their d.n.a. either. so the infrastructure is wholly and completely unprepared for such demands. indeed, d.p. has the scans for most of the books they've digitized over the years -- online as we speak, i believe. (if a provider asks other sites not to repost their scans, d.p. will respect that request, which is a bad response, to my mind, because nobody should put that restriction on scans from a public-domain book, and we shouldn't honor such a restriction if anyone does try to impose it; but that's the d.p. policy.) at any rate, the fact remains that d.p. _has_ the scans. but p.g. doesn't mount them. and yes, for all the p.g. books, you can probably go find a scan-set somewhere, either google or internet archive, or somewhere else. but who is it that'll do that legwork? and how do we decide which version, if there are many? and then there are the practical issues, like file-naming. and the p.g. linebreaks which do not match the scan-set. and scan-sets not atypically will manifest some glitches, which -- at d.p. -- requires a "hospital". who mans that? none of these are trivial issues. they can be surmounted, at least individually, but it's work. en masse? more work. and if you think _programmers_ are in short supply here, take a look around and see how many gophers you count. but even if we ignore all of that, for now, we can't ignore that p.g. _could_ be mounting scans now. but it doesn't. which tells me the p.g. infrastructure can't handle scans, not with any volume, which means this thing won't scale. so that's a difficulty facing you. but it's not the hardest one. no sir. not by a long shot. the hardest part would be the obstacle of the politics. the navigation of marcello, all by itself, would impose such a huge cost on you that it wouldn't be worthwhile. no way to go anywhere without marcello blocking you. then add greg, and the whitewashers, plus all the flack you'd have to take from d.p. for invading "their" space, not to mention all the "cooks" in the gutvol-d "kitchen". all with the best of intentions, mind you. (well, maybe except for marcello.) but a huge obstacle nonetheless. heck, it'd be a _nightmare_ -- and that's putting it mildly. *** but even if, with some miracle, you were prepared to deal with such immense costs, the benefits are feeble. the most you're gonna get by fixing the p.g. library is 40,000 books. and that is a total fantasy number. at least 10,000 of the "books" are repeat fragments, audio files, and various things which are not "books". so the _real_ number is closer to 30,000, if that many. but even if it _was_ 40,000, or 80,000, that number is _dwarfed_ by the collections now available elsewhere. moreover, a very high percentage of the p.g. e-books are _already_ very clean, especially in a relative sense, which lessens further the bang you'd get for your buck. nobody cares much if you remove the last 100 errors from a book. and if they knew you spent _100 hours_ fixing those last 100 errors, they'd say you were crazy. they'd rather you spent that time taking 10 new books from "unacceptably dirty o.c.r." to "ok, this will work". and where can you do that? well, over at internet archive, of course. so, if any programmers want to build such a system, my advice to them is to "build it for internet archive". their infrastructure _does_ support page-scans, and it _does_ keep track of page-numbers, and their stuff is exposed on the web, in a somewhat-documented way, so you can grab it from them without talking to them. (which is a good thing, because they're just as deaf as the p.g. powers-that-be are, so count your blessings.) so the costs of building their system will be much lower, and with millions of books the benefits will keep comin'. plus correcting their text would be a real contribution, rather than simply gilding the lily on a few p.g. books. dollar-for-dollar, pound-for-pound, bang for the buck is gonna be _far_greater_ at internet archive than at p.g. much lower costs. much greater benefits. no-brainer.
People will self-verify if they are given the tools. If they can't, the whitewashers are screwed again.
the whitewashers screwed themselves by not installing such a verification-system in the first place, and that's something i have been saying since december of 2003, when p.g. had 10,000 e-texts, as it was apparent then that we needed a way to go back and repair the books. a significant mark of the uselessness of this endeavor is the fact that -- more than 8 years after that date -- we are _still_ having a "dialog" about "how" we "could" perhaps "build" a "system" that "might" (or might not) "accomplish" the "goal" of "correcting" those "e-books". it is testament to michael hart's vision and achievement in building from the grass-roots an actual cyberlibrary, that so many good people have shown that we're willing to waste valuable years of our lives in service to that goal. but maybe, just maybe, it's time to stop wasting our time. michael is gone, folks. there is no vision here any more... i, for one, am not gonna discuss this until the year 2020 before i see the p.g. powers-that-be are deaf _and_ blind. -bowerbird

On Wed, Feb 08, 2012 at 05:10:01PM -0500, Bowerbird@aol.com wrote:
but even if we ignore all of that, for now, we can't ignore that p.g. _could_ be mounting scans now. but it doesn't.
which tells me the p.g. infrastructure can't handle scans, not with any volume, which means this thing won't scale.
Those are false statements. Earlier in the past few weeks of this dialog, I posted a message explaining the page scan system, and that I have been disappointed that DP has not provided scans with every title. It is part of the established WWer workflow to upload such images as part of a new book publication, and we also have a workflow to add page images to existing titles. (BB was, in fact, at the event in San Francisco in 2004 when Charles & Juliet & I talked about wanting to get images, and the file naming scheme & other details. Though he wasn't part of that discussion, that I recall.) The file naming scheme is well known, and page image sets for over 7000 eBooks are online at www.gutenberg.org right now. -- Greg

On Feb 10, 2012, at 11:00, Greg Newby wrote:
Earlier in the past few weeks of this dialog, I posted a message explaining the page scan system, and that I have been disappointed that DP has not provided scans with every title. It is part of the established WWer workflow to upload such images as part of a new book publication, and we also have a workflow to add page images to existing titles.
Well… We (i.e. a handful of DP volunteers) did it for a while, and at some point something at PG broke and the WWers stopped posting our page images. ): (So we stopped providing them because it was rather frustrating to spend hours on checking and packaging them correctly and then not have them appear in the catalogue anyway.) I would be delighted to provide them for all the books I do. I would be delighted to provide them for all or most of the books I've done; should still have them lying around. Is there documentation of how to do it? Back then we had to follow a specific naming scheme, then put them in some arcane folder hierarchy, zip that, and upload it by FTP. It could only be done after the book was posted, because before we wouldn't know the e-book number that was required for the arcane folder hierarchy. We also had to look up which WWer had posted each book, because they only posted page images for the books they had posted themselves. If it's still done that way, I would need an FTP account as well, please. I would also be delighted if they were more accessible. The way I remember it, they are hidden somewhere behind the “more files” link, which most users would probably not think to click. Thanks, Jana

So … can I get a reply on this from someone? Greg? One of the WWers? Greg says he's “disappointed” we aren't providing page scans for each title. He says it's “part of the established WWer workflow” to post them, either as part of a new book, or separately. Can someone please explain that “established workflow”? How exactly does one, as the contributor of a book, either new or already posted, go about submitting page scans for it? Thanks, Jana Begin forwarded message:
From: Jana Srna <jana.srna@gmail.com> Subject: Re: [gutvol-d] 2020 vision Date: February 10, 2012 13:46:34 GMT+01:00 To: Project Gutenberg Volunteer Discussion <gutvol-d@lists.pglaf.org>
On Feb 10, 2012, at 11:00, Greg Newby wrote:
Earlier in the past few weeks of this dialog, I posted a message explaining the page scan system, and that I have been disappointed that DP has not provided scans with every title. It is part of the established WWer workflow to upload such images as part of a new book publication, and we also have a workflow to add page images to existing titles.
Well… We (i.e. a handful of DP volunteers) did it for a while, and at some point something at PG broke and the WWers stopped posting our page images. ): (So we stopped providing them because it was rather frustrating to spend hours on checking and packaging them correctly and then not have them appear in the catalogue anyway.)
I would be delighted to provide them for all the books I do. I would be delighted to provide them for all or most of the books I've done; should still have them lying around.
Is there documentation of how to do it? Back then we had to follow a specific naming scheme, then put them in some arcane folder hierarchy, zip that, and upload it by FTP. It could only be done after the book was posted, because before we wouldn't know the e-book number that was required for the arcane folder hierarchy. We also had to look up which WWer had posted each book, because they only posted page images for the books they had posted themselves. If it's still done that way, I would need an FTP account as well, please.
I would also be delighted if they were more accessible. The way I remember it, they are hidden somewhere behind the “more files” link, which most users would probably not think to click.
Thanks, Jana

On Mon, Feb 13, 2012 at 09:43:36AM +0100, Jana Srna wrote:
So ? can I get a reply on this from someone? Greg? One of the WWers?
Greg says he's ?disappointed? we aren't providing page scans for each title. He says it's ?part of the established WWer workflow? to post them, either as part of a new book, or separately.
Can someone please explain that ?established workflow?? How exactly does one, as the contributor of a book, either new or already posted, go about submitting page scans for it?
Sorry for not sending this earlier, but I checked with the WWers first, and then neglected to follow up on -d. Details below. I don't know whether these guidelines are available elsewhere, but I did confirm with the WWers they are able & ready to accept page images. One volunteer was very active at getting these from DP, but that ended a few years ago. Note that we do NOT need page images only as a zip. They are OK unzipped. (The PIZ extensions are because our old mailing list manager refused messages with the string 'zip') From: "Jim Tinsley" <jtinsley@pobox.com> To: "Posted Etexts for Project Gutenberg" <posted@listserv.unc.edu> Subject: [posted] Posted (#12973, Butler) ! Date: Tue, 20 Jul 2004 20:24:32 -0700 (PDT) Personal Recollections of Pardee Butler, by Pardee Butler 12973 [Editor: Mrs. Rosetta B. Hastings] [Contributor: Mrs. Rosetta B. Hastings] [Contributor: Elder John Boggs] [Contributor: Elder J. B. McCleery] [Link: http://www.gutenberg.net/1/2/9/7/12973 ] [Files: 12973.txt; 12973-h.htm; 12973-page-images] Thanks to Roger for finding and scanning this book. This is the first PG book to be posted with page images. We are now beginning to accept page images along with the regular postings. Of course, DP has always preserved its page images, and those will eventually be uploaded in a big batch, or series of batches, but non-DP contributions may now begin adding page images. For now, we're setting the following guidelines for page image postings: 1. PG is now accepting page images of books posted. Page images will be posted _only_ as an addition to an etext posted in the normal way -- we will not post page images without plain text. 2. Page images are an option; they are not and will not be required for the posting of a text. 3. All page images should be good enough to work reasonably well with OCR packages, up to 600 dpi, and should be stored as black-and-white TIFFs with CCITT-4 (aka ITU-G4 or Fax Group 4) compression. This is important, so that we keep the overall file size down to a sustainable level. With this compression, a typical 600dpi page can be stored for about 40KB. Our ability to post these images depends on the file sizes staying fairly reasonable. Pages such as color pictures or greyscale photos that cannot reasonably be stored as black-and-white only should be stored as TIFF or JPEG with the best compression you can get for that image. (Note: Irfanview for Windows does this nicely individually or in batch. ImageMagick v 6.x: convert myimage.png -compress group4 myimage.tif ) 4. Each page image should be a separate file and named with the page number within the set; e.g. 001.tif, 002.tif, etc. Separate, non-page images, such as covers or color images scanned separately from the pages, should have suitable names, such as "cover.jpg" or "072-image.tif" All page images for the book will be zipped into one file, to be called FILENUMBER-page-images, e.g. 12345-page-images.piz (reverse the extension) for etext #12345, and stored in the main directory for that etext. It will unzip to a subdirectory ./page-images, but we will not post separate page images in that directory, since that would double the space used, and we believe that people who want to consult the images will probably want them all. So, for now at least, if you want the images, you download the PIZ (backwards again) file. jim

Thanks, Greg (and, by extension, Jim). So, to clarify: One of the ways is to upload the page images with the new e-book. I'm assuming I just include another folder, next to the images folder and the text and HTML files I submit, called “page-images”, including all the page scans. I can obviously not call it “FILENUMBER-page-images”, since at that point I don't know the e-book number yet. Is that correct? I'll try with my next submission. What if I haven't submitted the page images with a book and want to go about adding them later? How do I send them to the WWers? Do I upload them somewhere? If so, where? Shall I use the ordinary e-book submission form again and just upload the page images without the book, with a note which e-book number they are for? That seems rather inconvenient. (Not inconvenient for _me_, but for the _WWers_. _I_ wouldn't mind.) Or is it still done by FTP? If so, how do I (or anyone else) get an account? And exact instructions on where to put the images and how to notify the WWers? What do I do if there are lots of pages and/or lots of illustrations and the file gets a little larger? The e-book submission web form has an upload limit. I don't know how large it is, but I know some of us have run up against it even for ordinary illustrated e-book submissions without page images. Is there an alternative way in that case? Jana, really happy that she'll hopefully get to store her images somewhere other than on her own machine soon! (I have some that are _not_ in the Internet Archive or Google Books, or at least not as complete sets, so at least those would be valuable to have) On Feb 13, 2012, at 11:04, Greg Newby wrote:
On Mon, Feb 13, 2012 at 09:43:36AM +0100, Jana Srna wrote:
So ? can I get a reply on this from someone? Greg? One of the WWers?
Greg says he's ?disappointed? we aren't providing page scans for each title. He says it's ?part of the established WWer workflow? to post them, either as part of a new book, or separately.
Can someone please explain that ?established workflow?? How exactly does one, as the contributor of a book, either new or already posted, go about submitting page scans for it?
Sorry for not sending this earlier, but I checked with the WWers first, and then neglected to follow up on -d. Details below.
I don't know whether these guidelines are available elsewhere, but I did confirm with the WWers they are able & ready to accept page images. One volunteer was very active at getting these from DP, but that ended a few years ago.
Note that we do NOT need page images only as a zip. They are OK unzipped. (The PIZ extensions are because our old mailing list manager refused messages with the string 'zip')
From: "Jim Tinsley" <jtinsley@pobox.com> To: "Posted Etexts for Project Gutenberg" <posted@listserv.unc.edu> Subject: [posted] Posted (#12973, Butler) ! Date: Tue, 20 Jul 2004 20:24:32 -0700 (PDT)
Personal Recollections of Pardee Butler, by Pardee Butler 12973 [Editor: Mrs. Rosetta B. Hastings] [Contributor: Mrs. Rosetta B. Hastings] [Contributor: Elder John Boggs] [Contributor: Elder J. B. McCleery] [Link: http://www.gutenberg.net/1/2/9/7/12973 ] [Files: 12973.txt; 12973-h.htm; 12973-page-images]
Thanks to Roger for finding and scanning this book.
This is the first PG book to be posted with page images. We are now beginning to accept page images along with the regular postings. Of course, DP has always preserved its page images, and those will eventually be uploaded in a big batch, or series of batches, but non-DP contributions may now begin adding page images.
For now, we're setting the following guidelines for page image postings:
1. PG is now accepting page images of books posted. Page images will be posted _only_ as an addition to an etext posted in the normal way -- we will not post page images without plain text.
2. Page images are an option; they are not and will not be required for the posting of a text.
3. All page images should be good enough to work reasonably well with OCR packages, up to 600 dpi, and should be stored as black-and-white TIFFs with CCITT-4 (aka ITU-G4 or Fax Group 4) compression. This is important, so that we keep the overall file size down to a sustainable level. With this compression, a typical 600dpi page can be stored for about 40KB. Our ability to post these images depends on the file sizes staying fairly reasonable. Pages such as color pictures or greyscale photos that cannot reasonably be stored as black-and-white only should be stored as TIFF or JPEG with the best compression you can get for that image.
(Note: Irfanview for Windows does this nicely individually or in batch. ImageMagick v 6.x: convert myimage.png -compress group4 myimage.tif )
4. Each page image should be a separate file and named with the page number within the set; e.g. 001.tif, 002.tif, etc. Separate, non-page images, such as covers or color images scanned separately from the pages, should have suitable names, such as "cover.jpg" or "072-image.tif" All page images for the book will be zipped into one file, to be called FILENUMBER-page-images, e.g. 12345-page-images.piz (reverse the extension) for etext #12345, and stored in the main directory for that etext. It will unzip to a subdirectory ./page-images, but we will not post separate page images in that directory, since that would double the space used, and we believe that people who want to consult the images will probably want them all. So, for now at least, if you want the images, you download the PIZ (backwards again) file.
jim

I remember that there was some suggested naming scheme, but I don't remember it since it was complicated. 12973 was a happy exception, since numbering from 001 the titlepage and going on, the pages get the same number as the page number. I remember that the number in the filenames should match the page number, but there was a series of rules for inserted pages, for frontmatter, for initial pages numbered in roman, and for prefixes to keep them in order. I have two books that are really weird: one in which the page numbers are out of order in the original, and another in which a signature repeats the numbers of a previous one (after 1...208 we have 197a....208a, then 209...285). Carlo

I remember, because I use this scheme to name the files that are used in the Encyclopasdia Britannica project on DP. Divided pages have a character appended to maintain order and distinction: 001a, 001b, 002a, 002b, etc. This also gives me the happy distinction of having the only image files that can be submitted more or less directly to PG without renaming the files. And these files are consequently inherently cross-indexed with the texts. Other DP projects, otoh, can at best give PG thousands of stacks of images uncorrelated with the texts, whether they have page numbers or not. Now there's a crowd-source opportunity - renaming the image files. On Mon, Feb 13, 2012 at 5:49 AM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
I remember that there was some suggested naming scheme, but I don't remember it since it was complicated. 12973 was a happy exception, since numbering from 001 the titlepage and going on, the pages get the same number as the page number.
I remember that the number in the filenames should match the page number, but there was a series of rules for inserted pages, for frontmatter, for initial pages numbered in roman, and for prefixes to keep them in order.
I have two books that are really weird: one in which the page numbers are out of order in the original, and another in which a signature repeats the numbers of a previous one (after 1...208 we have 197a....208a, then 209...285).
Carlo
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Jana Srna Sent: Monday, February 13, 2012 4:24 AM To: Project Gutenberg Volunteer Discussion Subject: [gutvol-d] Posting page images at PG [was: 2020 vision]
Thanks, Greg (and, by extension, Jim).
So, to clarify:
One of the ways is to upload the page images with the new e-book. I'm assuming I just include another folder, next to the images folder and the text and HTML files I submit, called "page-images", including all the page scans. I can obviously not call it "FILENUMBER-page-images", since at that point I don't know the e-book number yet. Is that correct? I'll try with my next submission.
What if I haven't submitted the page images with a book and want to go about adding them later? How do I send them to the WWers? Do I upload them somewhere? If so, where? Shall I use the ordinary e-book submission form again and just upload the page images without the book, with a note which e-book number they are for? That seems rather inconvenient. (Not inconvenient for _me_, but for the _WWers_. _I_ wouldn't mind.) Or is it still done by FTP? If so, how do I (or anyone else) get an account? And exact instructions on where to put the images and how to notify the WWers?
What do I do if there are lots of pages and/or lots of illustrations and the file gets a little larger? The e-book submission web form has an upload limit. I don't know how large it is, but I know some of us have run up against it even for ordinary illustrated e-book submissions without page images. Is there an alternative way in that case?
Jana, really happy that she'll hopefully get to store her images somewhere other than on her own machine soon! (I have some that are _not_ in the Internet Archive or Google Books, or at least not as complete sets, so at least those would be valuable to have)
On Feb 13, 2012, at 11:04, Greg Newby wrote:
So ? can I get a reply on this from someone? Greg? One of the WWers?
Greg says he's ?disappointed? we aren't providing page scans for each title. He says it's ?part of the established WWer workflow? to
On Mon, Feb 13, 2012 at 09:43:36AM +0100, Jana Srna wrote: post them, either as part of a new book, or separately.
Can someone please explain that ?established workflow?? How exactly does one, as the contributor of a book, either
new or already posted, go about submitting page scans for it?
Sorry for not sending this earlier, but I checked with the WWers first, and then neglected to follow up on -d. Details below.
I don't know whether these guidelines are available elsewhere, but I did confirm with the WWers they are able & ready to accept page images. One volunteer was very active at getting these from DP, but that ended a few years ago.
Note that we do NOT need page images only as a zip. They are OK unzipped. (The PIZ extensions are because our old mailing list manager refused messages with the string 'zip')
From: "Jim Tinsley" <jtinsley@pobox.com> To: "Posted Etexts for Project Gutenberg" <posted@listserv.unc.edu> Subject: [posted] Posted (#12973, Butler) ! Date: Tue, 20 Jul 2004 20:24:32 -0700 (PDT)
Personal Recollections of Pardee Butler, by Pardee Butler 12973 [Editor: Mrs. Rosetta B. Hastings] [Contributor: Mrs. Rosetta B. Hastings] [Contributor: Elder John Boggs] [Contributor: Elder J. B. McCleery] [Link: http://www.gutenberg.net/1/2/9/7/12973 ] [Files: 12973.txt; 12973-h.htm; 12973-page-images]
Thanks to Roger for finding and scanning this book.
This is the first PG book to be posted with page images. We are now beginning to accept page images along with the regular postings. Of course, DP has always preserved its page images, and those will eventually be uploaded in a big batch, or series of batches, but non-DP contributions may now begin adding page images.
For now, we're setting the following guidelines for page image postings:
1. PG is now accepting page images of books posted. Page images will be posted _only_ as an addition to an etext posted in the normal way -- we will not post page images without plain text.
2. Page images are an option; they are not and will not be required for the posting of a text.
3. All page images should be good enough to work reasonably well with OCR packages, up to 600 dpi, and should be stored as black-and-white TIFFs with CCITT-4 (aka ITU-G4 or Fax Group 4) compression. This is important, so that we keep the overall file size down to a sustainable level. With this compression, a typical 600dpi page can be stored for about 40KB. Our ability to post these images depends on the file sizes staying fairly reasonable. Pages such as color pictures or greyscale photos that cannot reasonably be stored as black-and-white only should be stored as TIFF or JPEG with the best compression you can get for that image.
(Note: Irfanview for Windows does this nicely individually or in batch. ImageMagick v 6.x: convert myimage.png -compress group4 myimage.tif )
4. Each page image should be a separate file and named with the
My suggestions, Greg to clarify as needed: For new submissions: Name the folder "page-images", and include it in the new project's zip file. The WWers can prefix the name with the assigned etext number. The folder should not be inside the normal /images folder. The usual procedure for past large submissions is that the entire submission be FTPed to PG's /incoming folder, then a small placeholder file be uploaded the normal way, which will notify the WWers. (FTPing a file doesn't notify the WWers.) Mention in the Note to Whitewashers field the name of the uploaded file. Do *NOT* create a folder for your FTPed file. There's no need if you're uploading a zip file with a unique name, and the WWers can't delete the folder anyway (the files in the folder, yes; the folder itself, no). For an existing etext: Name the zip file "nnnnn-page-images.zip", where nnnnn is the etext number, then upload via FTP, as above. The zip file should contain a folder, also named "nnnnn-page-images", not just the page scans themselves. Notify the WWers by emailing "pgww@...", mentioning the name of the zip file and that it's for an existing etext. Al page
number within the set; e.g. 001.tif, 002.tif, etc. Separate, non-page images, such as covers or color images scanned separately from the pages, should have suitable names, such as "cover.jpg" or "072-image.tif" All page images for the book will be zipped into one file, to be called FILENUMBER-page-images, e.g. 12345-page-images.piz (reverse the extension) for etext #12345, and stored in the main directory for that etext. It will unzip to a subdirectory ./page-images, but we will not post separate page images in that directory, since that would double the space used, and we believe that people who want to consult the images will probably want them all. So, for now at least, if you want the images, you download the PIZ (backwards again) file.
jim
gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Thanks, Al, that's very helpful! Now there's just one little detail missing that I can think of off-hand: How do I access PG's /incoming folder? What's the URL or IP for the FTP server? Does anonymous access work? Thanks for bearing with all my questions! Jana On Feb 13, 2012, at 20:04, Al Haines wrote:
My suggestions, Greg to clarify as needed:
For new submissions:
Name the folder "page-images", and include it in the new project's zip file. The WWers can prefix the name with the assigned etext number. The folder should not be inside the normal /images folder.
The usual procedure for past large submissions is that the entire submission be FTPed to PG's /incoming folder, then a small placeholder file be uploaded the normal way, which will notify the WWers. (FTPing a file doesn't notify the WWers.) Mention in the Note to Whitewashers field the name of the uploaded file.
Do *NOT* create a folder for your FTPed file. There's no need if you're uploading a zip file with a unique name, and the WWers can't delete the folder anyway (the files in the folder, yes; the folder itself, no).
For an existing etext:
Name the zip file "nnnnn-page-images.zip", where nnnnn is the etext number, then upload via FTP, as above. The zip file should contain a folder, also named "nnnnn-page-images", not just the page scans themselves. Notify the WWers by emailing "pgww@...", mentioning the name of the zip file and that it's for an existing etext.
Al

I have questions. I have a project I just prepared for DP - another EB page set. It is now just beginning proofreading. I have low expectations that, if I leave things to follow their normal course, the images will ever arrive at PG. How do I upload the images to PG? The project won't complete the DP project cycle for another several years, over which time the image set can only decay ... and I may not be around. Can I upload images for pages already in the cycle? Can I upload fixes for images already uploaded? Can I find out which projects have uploaded images? Or more importantly, which ones don't? What is being done to transfer images from DP to PG? Can they just be uploaded based on project ids, or do they need to be checked over, and possibly renamed (and by whom, and where, and how?) Some of these are DP questions, I know. Who is responsible? What are the plans? On Mon, Feb 13, 2012 at 11:13 AM, Jana Srna <jana.srna@gmail.com> wrote:
Thanks, Al, that's very helpful!
Now there's just one little detail missing that I can think of off-hand: How do I access PG's /incoming folder? What's the URL or IP for the FTP server? Does anonymous access work?
Thanks for bearing with all my questions!
Jana
On Feb 13, 2012, at 20:04, Al Haines wrote:
My suggestions, Greg to clarify as needed:
For new submissions:
Name the folder "page-images", and include it in the new project's zip file. The WWers can prefix the name with the assigned etext number. The folder should not be inside the normal /images folder.
The usual procedure for past large submissions is that the entire submission be FTPed to PG's /incoming folder, then a small placeholder file be uploaded the normal way, which will notify the WWers. (FTPing a file doesn't notify the WWers.) Mention in the Note to Whitewashers field the name of the uploaded file.
Do *NOT* create a folder for your FTPed file. There's no need if you're uploading a zip file with a unique name, and the WWers can't delete the folder anyway (the files in the folder, yes; the folder itself, no).
For an existing etext:
Name the zip file "nnnnn-page-images.zip", where nnnnn is the etext number, then upload via FTP, as above. The zip file should contain a folder, also named "nnnnn-page-images", not just the page scans themselves. Notify the WWers by emailing "pgww@...", mentioning the name of the zip file and that it's for an existing etext.
Al
gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

There is a different problem with images that one got signing a license for non-commercial use. I consider that I don't violate such agreement giving these images to DP to transcribe them, or even to PG, provided that PG does not redistribute publicly these images, but keeps them for internal use, as a reference for the errata team. Does PG prefer to have the images and keep them for internal use, or not have them at all? Another possibility for these image providers could be to point where the images can be retrived. Many of these sites are as stable as PG, and possibly more. Some has even stable URLs. Would PG accept metadata indicating the image source as precisely as possible? Carlo

For that matter, given a table associating urls with page numbers, PG could harvest the images and organize them any way they would like. For that matter, they could harvest them directly from DP, since each image has a unique url. Guiguts could also be easily enhanced to produce the image zip file. On Mon, Feb 13, 2012 at 12:25 PM, Carlo Traverso <traverso@posso.dm.unipi.it
wrote:
There is a different problem with images that one got signing a license for non-commercial use. I consider that I don't violate such agreement giving these images to DP to transcribe them, or even to PG, provided that PG does not redistribute publicly these images, but keeps them for internal use, as a reference for the errata team.
Does PG prefer to have the images and keep them for internal use, or not have them at all?
Another possibility for these image providers could be to point where the images can be retrived. Many of these sites are as stable as PG, and possibly more. Some has even stable URLs.
Would PG accept metadata indicating the image source as precisely as possible?
Carlo
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Mon, February 13, 2012 12:04 pm, Al Haines wrote:
My suggestions, Greg to clarify as needed:
For new submissions:
Name the folder "page-images", and include it in the new project's zip file. The WWers can prefix the name with the assigned etext number. The folder should not be inside the normal /images folder.
[large snip] I haven't been able to find these instructions anywhere on the gutenberg.org wiki. Can you point me to where these instructions are documented for the HTTP:// world to find?

Jana> (I have some that are _not_ in the Internet Archive or Google Books, or at least not as complete sets, so at least those would be valuable to have) I would hope that you (also) post them to Internet Archive, since this seems an exact match to what they do.

Greg>The file naming scheme is well known, and page image sets for over 7000 eBooks are online at www.gutenberg.org right now. Sorry, they may be, but I go to www.gutenberg.org and search that page for "page images" and find nothing, and if I use the "search site" feature of www.gutenberg.org and enter "page images" I also find nothing. So, the file naming scheme may be "well known" to somebody, but not to "www.gutenberg.org"
participants (8)
-
Al Haines
-
Bowerbird@aol.com
-
don kretz
-
Greg Newby
-
Jana Srna
-
Jim Adcock
-
Lee Passey
-
traverso@posso.dm.unipi.it