re: [gutvol-d] Scans and Texts (Re: Copyright Verification?)

greg said:
An automation process, to pull all images from DP at a time an eBook is posted, is very much non-trivial.
copying images from one place to another seems "trivial" enough to me. the non-trivial part will be setting up the ground-rules for the page-scans, and then making the current set of scans conform to those ground-rules... although i would imagine (or hope, anyway) that are smarter now, as recently as last year, i encountered _extremely_ stiff resistance at d.p. for suggesting even such _basics_ as file-naming conventions (e.g., that the scan for page 87 should be named "rootname087.png" not something like "rootname094.png", as was typical.) in a nutshell, if page-scans are maintained in the chaotic way that many other files are being handled in the library, you'll have a nightmare on your hands. one e-text can have literally hundreds of page-scans associated with it; thus, with 17,000 e-texts, we're talking now about 2-4 _million_ files. if you approach a task like that haphazardly, it will bite you back _bad_. simply put, the scans weren't created and haven't been maintained with an eye toward making them publicly available. that's not the _fault_ of the d.p. people, because that was not their concern, but it is a reality that needs to be faced if we want to make them public. it's going to be a _lot_ of work to mold them into something useable. juliet recognizes that -- that's precisely what she was telling people.
But just doing one or two titles as a sample would help.
well, that would depend on whether wise conventions are adopted first, or develop out of those samples. if these samples instead merely boost the "do it however you want" idea that permeates the rest of the library, they will do more harm than good. it's also very important to understand that the stimulus here is all wrong. rather than driven by some vague notion that scans "should" be available, a "needs analysis" has to be done to determine _who_ will use the scans, and _how_, so that the policies that are put into place are _wise_ ones. for instance, i might be wrong about this, but i think the current policy is to wrap all the page-scans up into one zip file. there are merits to that, but it's also the case that a zip file is not the best thing for many useful applications, not the least of which is checking the scan of a single page (e.g., to see if an error-report is supported or negated by the page-scan). (in regard to this specific point, i think scans should be stored both ways -- as individual files and as a single zip-file -- perhaps on different servers.) as there is, at present, no user clamor for the scans, we _do_not_know_ how the end-users might want to use the scans, so we're _in_the_dark_ about the factors that we should apply in the development of any policies. so if i were a decision-maker, i would wait until a clamor actually developed before i moved forward on this. perhaps the absence of many volunteers who are willing to actually expend their energy on this project will serve as the brake necessary to keep it from lurching ahead prematurely.
As Juliet & others mentioned, the *archiving* is already being done. The next step is distribution.
well, i would imagine (or hope anyway) that a formally-trained librarian would use the term "archiving" with a little more sensitivity. the scans are being _saved_, but there is a huge gulf to cross before they can be considered to have been "archived". the page-scans as they are now are a _very_ long way from being ready for the "distribution" step... maybe jon can mobilize some volunteers to do all the work necessary. but otherwise, i don't see any people stepping forward at this time... -bowerbird

Bowerbird wrote:
greg said:
An automation process, to pull all images from DP at a time an eBook is posted, is very much non-trivial.
copying images from one place to another seems "trivial" enough to me.
the non-trivial part will be setting up the ground-rules for the page-scans, and then making the current set of scans conform to those ground-rules...
[snip] All the points brought up by Bowerbird are excellent and cut to the heart of the various issues to both archive and make available to the public the scans that are submitted to PG/DP for conversion to SDT. My prior message this morning, providing a few of my initial observations on the scan repository project, show that it will be quite laborious to build a *publicly-useful* page scan archive from PG/DP activities because of the lack of standardization and other related factors. One suggestion is likely to be controversial, but I offer it anyway for discussion purposes: DP and PG should set up minimal scan submission requirements. These could include requirements such as page image naming requirements, metadata requirements, etc. It would also standardize the space by which scan sets are submitted, so it will be easier to move the scans over to their final resting place. This way, at least all new submissions will be easier to integrate into a publicly-useful repository. In the meanwhile, then, the backlog of older non-standardized stuff can be sifted through and fixed (such as renaming page scan images as both Bowerbird and I agree is important to do right). How fast this fixing of the older stuff will happen depends upon the extent of the work required to normalize the old scan sets (normalized to whatever standards are established), and the number of volunteers to help out with both the machine- and human-processing required for normalization. At least this way we make sure the problem won't continue to grow over time while what to do with the present set of scans is given more time to study. Thoughts? Jon
participants (2)
-
Bowerbird@aol.com
-
Jon Noring