
On 14 Jul 2005, at 9:39, Jon Noring wrote:
Collin wrote:
Jon:
And yet you seem completely unable to do this. I wonder why? In the time you took to write this lengthy e-mail, you could have set up a page scan archive at TIA. If this is so important to you, why haven't you done this already?
This is an interesting question.
My answer is another question: Shouldn't the decision to archive and make the scans publicly available *alongside* the digital text versions be a collective decision among the PG/DP folk?
The scans are public domain. The etexts are public domain. The meta- data is for the large part non-ownable (if not completely non- ownable). You don't have to ask anybody anything to do something with these items. It sounds to me you have stumbled upon a wonderful way to enhance that which Project Gutenberg produces, as have many publishers before you. Please, run with it!
After all, making the scans available alongside the texts requires those maintaining the PG index to provide links to the scans. It also requires metadata and other types of coordination with DP and those independent of DP who use the scan/OCR process for transcribing books.
_If_ PG would like to offer page scans, and I can give you several reasons why it would, I am sure it would also like to link between the scans and the etexts. As it is, Distributed Proofreaders already produces HTML versions of most of its etexts, and most of these HTML versions retain the page numbers one way or the other. Brewster Kahle suggested in the forums of the Canadian Libraries project that we make use of a feature in DJVU that even allows one to link between words in pagescans and textpages.
Someone stepping forth to provide a home for the scans will not change how PG/DP does their thing so long as there is no collective decision that it is a good thing to do, and willing to at least help make the preserving process go smoother.
I have no idea what you are talking about. DP and PG do not work using collective decisions. Consensus, yes, and the force of stubborn volunteers who push through their way until others come around. A good example of this are the proofing guidelines at DP. These have evolved to where they are now, because stubborn Project Managers kept demanding proofers did it their way or not at all, until either the proofers caved in (new guideline accepted) or the project managers caved in (new guideline rejected). You may wish to ask Jon Ingram how painless this process was. :-)
I'm trying to provide some rationale on the pro side for preserving and making the scans publicly available alongside the text versions -- whether the rationale will be accepted or not in a collective sense is another matter.
Exactly! That's how it works.
I'll be very happy to take the time and effort to solicit/collect/make available the scans when there is a majority consensus by both DP and PG folk that this is important to them, and that PG will provide links to the original page scans (wherever the scans will reside.)
I would say, just try and make a small collection. The people you should talk to, IMHO, are the White Washers, as they have (or should have) a very direct interest in having page scans available. So much easier to deal with that error report you receive in the year 2011 if you have a page scan to compare with. Also Post-Processors may find such a database useful. Not to mention people outside PG: scholars, publishers, archivists, etc.
Jon, at the moment you come across like the nth of the Vapourware Kings that are regularly trolling this board. "Why don't you do X? Any idiot could do X in two working days!" Now I know you are not a Vapourware King, so what's with the act? The most likely reason why we have no page scan archive is because no-one has taken the time to set it up.
Well, maybe I do come forth this way. But then you are saying no one should share thoughts and ideas for reasoned discussion *before* any sort of collective decision?
I am saying that you should walk the walk, not just talk the talk. I am saying that the chance of collective decisions ever arriving here are minimal. Consensus, yes. A decision by somebody who has the power to push buttons, yes. The rest you can forget about. At the most you will get some sort of consensus about the meaning of poll results.
That no one should bring up discussion regarding the basic goals and approach of PG and DP?
No, that is not what I am saying either. If you have trouble understanding what I am saying, please try and outline what exactly you are having trouble with, rather than offering me ready-made opinions, which I will most likely disagree with.
So what says you (to everyone reading this)?
1) Should the original page scans be made publicly available alongside the structured digital texts?
I would love the page scans to be available in a more accessible manner than currently is the case.
2) If not, should they at least be preserved with limited or separate access (such as donate them to IA for IA to do as they wish)?
They are preserved.
3) Or should the scans be erased when the SDT is proofed and out the door?
I don't believe this, and even if I did, it's not going to happen.
DP's page scans are accessible to anyone with an account. (Probably even to those without an account.) The only hard bit is knowing which PG posted text goes with which DP text ID, so that you can recombine them when necessary. I believe we even save bibliographical data with our texts, so that you could extract all kinds of metadata to go with the pagescans.
I have accessed the page scans at DP (in order to submit some page scans before the recent DP revamp). Now, if DP had a policy/system where page scans were more carefully indexed as you mention, then it would certainly be easier for someone to collect them.
Currently the scans are available on a per-project basis as separate files or ZIP archives. (YMMV depending on where a project is in the DP workflow.) Once projects have run through DP and been posted at PG, I believe the site administrators archive them off-site to preserve disk space, but I am afraid I am totally not clear on this.
However, I think the issue is more with PG which actually makes the texts available online (DP focuses on producing the texts -- that the scans are an important part of the DP work flow involves them, too.) Will PG provide links in its catalog to the original page scans alongside the SDT versions?
I am PG, but I cannot speak for it.
To better understand some things myself, I have to ask a fundamental workflow question of the PG-side of the house. If a finished text is donated to PG and original page scans are submitted alongside the text, what will happen to the scans? Will they be made publicly available alongside the SDT, or will they be preserved but not linked to or made public, or will they be rejected and essentially erased?
I believe the site administrators store them somewhere, but others are way more qualified to answer this.
And another question for PG: In a philosophical sense (ignoring the technical/administrative realities for the moment), would PG, in its catalog, provide links to the original page scans used as the source for the cataloged digital texts? Or is there a philosophical reason why PG would not do this? I've yet to hear an answer as to whether PG will philosophically consider providing links to the original page scans used to produce the texts in its catalog.
Again, I am PG but cannot speak for it. You'd have to ask this of the people who maintain the catalog, for instance. -- branko collin collin@xs4all.nl