
Hi again, First of all, let me make a suggestion. Greg said there are about 7000 books, if I remember correctly, for which PG has page images. Unfortunately, no-one actually knows about this, and it's hard to find the page images even if one does. It's a hassle to find out for which books they are there, and for which they aren't. Could we perhaps get a line added to the list of available file types, in the catalogue, for books with page images? Something like "Page Images ZIP", that will let you directly download the page images zip file from the book's overview page, instead of having to go through the “More Files…” link and the directory listing there? Secondly, I have noticed an inconsistency in the page images for different books. For the books that got them added back when we (some DP volunteers) uploaded them in bulk, there is only the xxxxx-page-images directory, but no zip file. Here's an example: http://www.gutenberg.org/files/21989/ For the books that Al has posted for me a few days ago (thanks, Al!), there is both that directory, and a zip of it. Here's an example: http://www.gutenberg.org/files/38402/ Why the difference? Personally, I find the zip file more valuable than the directory, so would hope that those could easily be created for the books that don't have them. I do, however, understand that that would take up more disk space, so might not be possible. It would also, perhaps, be possible to combine my first suggestion with this one, and add a line to the book overview page as stated above, but have that point to a script that will create the zip file on the fly (just like the ones for the epubs and mobis and such)—and just cache it for, say, a day, instead of having all those zip files lie around on disk, in addition to the directories. Jana

"Jana" == Jana Srna <jana.srna@gmail.com> writes:
Jana> Secondly, I have noticed an inconsistency in the page images Jana> for different books. For the books that got them added back Jana> when we (some DP volunteers) uploaded them in bulk, there is Jana> only the xxxxx-page-images directory, but no zip file. Jana> Here's an example: http://www.gutenberg.org/files/21989/ Jana> For the books that Al has posted for me a few days ago Jana> (thanks, Al!), there is both that directory, and a zip of Jana> it. Here's an example: Jana> http://www.gutenberg.org/files/38402/ Jana> Why the difference? Personally, I find the zip file more Jana> valuable than the directory, so would hope that those could Jana> easily be created for the books that don't have them. I do, Jana> however, understand that that would take up more disk space, Jana> so might not be possible. The individual files are more valuable if one wants to check a possible error, since one does not need to download the full zip file, just to look at one page. Your images are 20 MB; assuming that this is representative of the collection, all the 40000 books could fit in one disk of 1 TB. The last time that I bought a pocket one, it costed me 90 Euro, so probably PG can afford two to keep the whole collection zipped and unzipped. But a few years ago, when they started to keep page images it costed much more. Carlo

On 02/15/2012 05:37 PM, Carlo Traverso wrote:
"Jana" == Jana Srna<jana.srna@gmail.com> writes:
Jana> Secondly, I have noticed an inconsistency in the page images Jana> for different books. For the books that got them added back Jana> when we (some DP volunteers) uploaded them in bulk, there is Jana> only the xxxxx-page-images directory, but no zip file. Jana> Here's an example: http://www.gutenberg.org/files/21989/
Jana> For the books that Al has posted for me a few days ago Jana> (thanks, Al!), there is both that directory, and a zip of Jana> it. Here's an example: Jana> http://www.gutenberg.org/files/38402/
Jana> Why the difference? Personally, I find the zip file more Jana> valuable than the directory, so would hope that those could Jana> easily be created for the books that don't have them. I do, Jana> however, understand that that would take up more disk space, Jana> so might not be possible.
The individual files are more valuable if one wants to check a possible error, since one does not need to download the full zip file, just to look at one page.
You can get the individual files even if you post the zip only. The web server does all unpacking for you. In fact, for every file you get from gutenberg.org, you get the bits out of the zip file and not out of the uncompressed file that is stored along the zip file. (The server would have to compress those bits for every request, while inside the zip file the compressed bits are up for grabs.) All you need is an index of images to post along the zip file, like we do for audio files, and you save half the disk space. -- Marcello Perathoner webmaster@gutenberg.org

This brings up another issue: DP requires down-scaling of bit-densities on the page images that they use to such a degree as to make them hardly usable for any purposes, even those of DP, namely hand-eye manual turking of the OCR output. DP does this because they say some of their volunteers are working on such low-bandwidth machines that they have to do this to everyone. Hopefully PG isn't just storing DP down-scaled page images! That would be almost worse than nothing. At least 300 dpi please!

DP doesn't require it, although they prefer images under 100KB per page. I don't downscale my proofing images unless I'm consistently over 100KB, and only if the image is still legible afterwards. I've run projects with 4 to 8 color grayscale images[0], and have seen a couple of projects with color PNGs (some sort of children's story, IIRC). The squirrels are fine with it as long as you have a good reason and if you include a warning of large project images in the project comments. I have heard of projects being pulled because a new PM unnecessarily uploaded multi-MB proofing images, though. Now the proofing interface does downscale in software further than that, depending upon user settings, and many browsers don't properly promote the downscaled images from bitonal to grayscale. IIRC, Opera does, FF doesn't, IE7-8 only will if you display the image with some MS-only css element (ms-interpolation-mode or some such) that doesn't work in IE9... -R C [0] Mostly for my microfilm scanning experiments, which started with marginally useful page scans, but also for the occasional project with miniscule footnotes, or really old wood-block printed books... On Wed, Feb 15, 2012 at 3:20 PM, Jim Adcock <jimad@msn.com> wrote:
This brings up another issue: DP requires down-scaling of bit-densities on the page images that they use to such a degree as to make them hardly usable for any purposes, even those of DP, namely hand-eye manual turking of the OCR output. DP does this because they say some of their volunteers are working on such low-bandwidth machines that they have to do this to everyone.
Hopefully PG isn't just storing DP down-scaled page images! That would be almost worse than nothing.
At least 300 dpi please!
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (5)
-
Jana Srna
-
Jim Adcock
-
Marcello Perathoner
-
Robert Cicconetti
-
traverso@posso.dm.unipi.it