Re: [gutvol-d] Copyright Verification?

newer
Re: Scans and Texts

older
re: [gutvol-d] RFC: Posting Page...

Joshua Hutchinson

13 Jul 2005 13 Jul '05

2:44 p.m.

----- Original Message ----- From: "Jon Noring" <jon@noring.name>

...

However, the scans themselves have value, including for direct reading. Scanning at higher resolution gives more to play with respect to image restoration for direct reading uses.

But for our purposes, getting text into PG, they don't have any other purpose. I realize that scans are important to you, but they simply are NOT to most of us, beyond their use as OCR sources. I don't have a problem with archiving the images but I will not change the work flow I do (or recommend the change in DP) to get "archival quality" scans. It simply is not something I deem important to the work we do here. Josh

Show replies by date

Karl Eichwalder

14 Jul 14 Jul

6:10 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

"Joshua Hutchinson" <joshua@hutchinson.net> writes:

...

I realize that scans are important to you, but they simply are NOT to most of us, beyond their use as OCR sources.

The "us" seems to be a very limited crowd ;) Scans are very important to all of "us" who rely on texts they can verify. Without scans the scientific world (universities and similar intitutes) will simply ignore the Gutenberg texts.

...

I don't have a problem with archiving the images but I will not change the work flow I do (or recommend the change in DP) to get "archival quality" scans. It simply is not something I deem important to the work we do here.

Texts and scans, both are important. Combining them makes a valuable e-book. Because it isn't that difficult to combine them and offer them side by side for reading we should work on this issue to make it happen. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

David Starner

6:37 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 7/14/05, Karl Eichwalder <ke@gnu.franken.de> wrote:

...

The "us" seems to be a very limited crowd ;) Scans are very important to all of "us" who rely on texts they can verify. Without scans the scientific world (universities and similar intitutes) will simply ignore the Gutenberg texts.

I'd say the universities and similar institutes are a very limited crowd, compared to the wide world. Most people don't find verification a hugely important thing. Honestly, I don't see that academics find it terribly important; the Oxford Text Archive doesn't have scans. Neither do all the print editions lining the shelves of the library; if I want to compare the EETS edition to the original, I've got to go to the one library in England that has a copy of the original manuscript. If you want to verify the Gutenberg texts, you can ask for ILL to get you a copy of the original. That'll let you get the edition you want and mean you don't have to trust our scans.

Jon Noring

8:03 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

David Starner wrote:

...

Karl Eichwalder wrote:

...

...
The "us" seems to be a very limited crowd ;) Scans are very important to all of "us" who rely on texts they can verify. Without scans the scientific world (universities and similar intitutes) will simply ignore the Gutenberg texts.

...

I'd say the universities and similar institutes are a very limited crowd, compared to the wide world. Most people don't find verification a hugely important thing. Honestly, I don't see that academics find it terribly important; the Oxford Text Archive doesn't have scans. Neither do all the print editions lining the shelves of the library; if I want to compare the EETS edition to the original, I've got to go to the one library in England that has a copy of the original manuscript.

If you want to verify the Gutenberg texts, you can ask for ILL to get you a copy of the original. That'll let you get the edition you want and mean you don't have to trust our scans.

Since scanning/OCR is now the dominant system to transcribe printed works, why not save the scans as part of the final product? With disk space dirt cheap, the availability of high-speed Internet, and willing archivers (e.g., IA), there's no longer a valid reason not to make the small effort to preserve the scans. The view that verification/integrity/authentication of digital texts is not important is, to be blunt, simply short-sighted, as I've explained in many prior messages. These are important factors. Why is there such a resistance by some here for preserving the scans along-side the structured digital text versions? It boggles my mind, frankly, especially, as I just noted, it is now *possible* to do relatively easy. If disk space is still an issue, IA will gladly take the scans and allow linking to them so the scans can be virtually "next to" the SDT versions. It's almost as if some here view scans as somehow "dirty" and not worthy of preserving -- or that they are even dangerous in some manner. That some say they are "unneeded" is equally illogical. How does one, for example, readily correct transcription errors in digital texts without the original scans to compare to? The argument that someone can go to the library to dig up a paper copy for error correction or verification of authenticity is ludicrous when you already *had* the paper copy in a convenient virtual form -- right in your hands so to say -- that the whole world can readily use without having to contribute to global warming by traveling to the library (assuming the local library even has a copy)! It makes absolutely no sense whatsoever. It is illogical, as Spock would say. Any trained engineer will say (and I am a mechanical engineer in a prior life), that one never fully knows future needs and requirements of any product built today. This leads to the principle that it is better to include/save more information than one thinks they need at the time if the effort to preserve such information is minimal. The issue of "should we preserve and archive scans?" falls within the scope of this principle. Preserve the scans and make them available along-side the SDT versions. Jon

Jon Noring

7:27 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Karl Eichwalder wrote:

...

Joshua Hutchinson wrote:

...

...
I realize that scans are important to you, but they simply are NOT to most of us, beyond their use as OCR sources.

...

The "us" seems to be a very limited crowd ;) Scans are very important to all of "us" who rely on texts they can verify. Without scans the scientific world (universities and similar intitutes) will simply ignore the Gutenberg texts.

...

...
I don't have a problem with archiving the images but I will not change the work flow I do (or recommend the change in DP) to get "archival quality" scans. It simply is not something I deem important to the work we do here.

...

Texts and scans, both are important. Combining them makes a valuable e-book. Because it isn't that difficult to combine them and offer them side by side for reading we should work on this issue to make it happen.

It is interesting in that many who volunteer for PG and DP don't view scans as important in and of themselves, while the Internet Archive (and maybe this has changed) doesn't view structured/proofed digital text (which DP produces) as being vital -- IA's focus has been on acquiring, archiving and making available page scans (which is a GOOD thing, I'm not knocking it.) I do agree with Karl that both are equally important, and technology improvements the last few years has now made it feasible to archive and deliver both side-by-side. As a result of the experimental/demo "My Antonia", a few of us are now exploring redoing the top 500 or 1000 classic English-language Works. We would mobilize the scholars and enthusiasts (including those expert with bibliography), and on a case-by-case basis determine the particular public domain edition(s) for each Work from which we'd like to somehow secure high-resolution, high-quality page scans *done right* (we'll probably have to work with participating libraries for source material -- a very good reason to establish close ties to the library world which PG has not yet been able to establish.) Then the scans would be submitted to DP for conversion into structured digital text (SDT -- of course, PG would then get the work product as they want it.) We would then take the SDT and mark it up using a selected subset of TEI (probably PGTEI or something close to it). Meanwhile, the scans will be cleaned up and provided at various resolutions by another team of volunteers, as well as encapsulated within DjVu and PDF. The SDT-TEI would be converted using XSLT to XHTML 1.1 for web display (allowing end-users to pick their own preferred CSS stylesheets), and to other formats as there is a demand for. We'd put together a "Distributed Catalogers", mostly librarian volunteers, to assemble high-quality and authoritative/uniform metadata for each edition, probably (but not necessarily -- have to decide on this) encoded in MARC and/or MARC-XML. Archiving/organizing the editions will be done using the WEMI principle/system (Work-Edition-Manifestation-Item). Of course, the full database of SDT will be searchable with at least a Google-level search engine. We will also work hard to make it much easier to immediately correct the texts if any errors are found (having the original page scans available makes it much easier to check for errors -- hopefully the DP process will keep the error rate really low. Bowerbird is thanked for suggesting the need for a robust system for continuous, "post-publication" error correction -- having the scans available and online is *critical* for this functionality.) Since the markup will include paragraph level unique identifiers, it will now be possible to allow the world to link to each edition down to the paragraph level (to produce uniform links), and when XLink becomes more common, to link down to individual words. It will now be possible to build a community around each Work (at least where there's enough interested people to organize a community for a particular Work or author) who will then be able to annotate, interlink, blog and discuss each edition/Work -- for this it is *necessary* we use XML markup *done right*. (This is an answer to those who believe plain text is sufficient -- it is NOT sufficient if we want to integrate the texts at a high level with various human endeavors, to build a more robust knowledge management system, etc. -- those who view the only purpose of digital texts is for casual, private reading are taking a very limited view of the many possibilities. Anyway, XML is self-describing text, so we fulfill the requirement for longevity which Michael Hart has preached from the very beginning.) There's more, but that gives a rough flavor. If anyone here is interested in becoming a part of this project, let me know in private email. We're now working on funding this as a non-profit so there will be adequate funds to *hire* full-time developers to build the necessary infrastructure. It will be independent of academia to assure the work product will be completely and totally open and free to the world (it disturbs me to no end that so many academic-sponsored digitization projects over the years, including those at publicly-funded universities, keep their work product under wraps -- I won't name names -- this is the beauty and the power of the PG vision.) This project is not intended to compete or replace PG/DP -- it will be limited in scope -- but rather will work synergistically with PG/DP to properly redo the most popular English-language works. Of course, others can copy the process for doing/redoing the most popular works in other languages. We probably will also welcome DP to submit to us the scans of works they've already done and for which they can release the scans. For a primitive, largely-unfinished view of what could be done, refer to the "My Antonia" project at http://www.openreader.org/myantonia/ . Jon Noring

collin＠xs4all.nl

7:54 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Karl:

...

"Joshua Hutchinson" <joshua@hutchinson.net> writes:

...
I realize that scans are important to you, but they simply are NOT to most of us, beyond their use as OCR sources.

The "us" seems to be a very limited crowd ;) Scans are very important to all of "us" who rely on texts they can verify. Without scans the scientific world (universities and similar intitutes) will simply ignore the Gutenberg texts.

Although it would be nice if we could also cater for the academic world, that is by no means necessary. The academic world has access to our texts; if that for some reason is unsufficient, we need to determine whether that is through some fault of ours. I submit it isn't. Nevertheless, if we could somehow also make the page scans accessible, that could be handy for several reasons. I believe Charles Franks was working on such a system?

Karen Lofstrom

8:29 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On Thu, 14 Jul 2005 collin@xs4all.nl wrote:

...

Although it would be nice if we could also cater for the academic world, that is by no means necessary. The academic world has access to our texts; if that for some reason is unsufficient, we need to determine whether that is through some fault of ours. I submit it isn't.

I'm not an academic -- I'm just a scholar. I want the original page numbers so I cite things. I want all sorts of things I don't get with most PG texts. Providing them would take only a little extra bit of time, and pay big dividends. Even money dividends -- we could start getting foundation money. -- Karen Lofstrom Zora on DP

David Starner

8:40 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 7/14/05, Karen Lofstrom <lofstrom@lava.net> wrote:

...

I'm not an academic -- I'm just a scholar. I want the original page numbers so I cite things.

Sure, but you don't want the page images for that; if you don't have any page numbers in the text version already, it'll be a pain to find your place in the book. What you want is the page numbers in the text version, and more and more people are doing that.

Hugh MacDougall

2:38 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

I rarely enter gutvol-d discussions, though I follow them. I did place a number of texts on Gutenberg some years ago (James Fenimore Cooper and Susan Fenimore Cooper), but in more recent years have placed Cooper texts on my own James Fenimore Cooper Society website where I can easily correct them. On this discussion, I would not that it is not just a question of transcribed text and/or images, but also frequently of editions. For a writer like James Fenimore Cooper it is not just a question of accuracy of transcription (and imposition of publishers' editorial styles) in the numerous editions made of his works, but also that he frequently made significant changes in his novels in new editions published during his lifetime. This is presumably true of many other writers. So even the "first edition" is not always the "best edition." On page numbering, it is my custom to include page numbers from the edition I am transcribing by placing them in {curly brackets} -- a typographical form I don't otherwise use. This makes it easy for the user not only to determine what page he is "on" in the electronic version, but also to look for specific page numbers, or provide citations to them, easily. To further complicate the edition issue, the Modern Language Association has developed a fairly rigid style form for editions bearing its seal of approval, which results in a synthetic (and hence copyrighted) version which (to make a long matter short) combines the latest text on which the author is known to have worked, with the earliest form (ideally the manuscript) for his spelling, punctuation, and other stylistic matters that usually get changed by publishers to suit their own style manuals. In the case of JF Cooper these editions have been issued in the so-called "Cooper Edition" -- first by the SUNY Albany Press and more recently by AMS, but have been licensed to other publishers such as Library of America, Oxford, and Penguin. Problems of this sort are going to plague the Gutenberg editions of almost any author of the 19th century or earlier, but I have not seen them raised in this discussion. Hugh MacDougall, Secretary/Treasurer James Fenimore Cooper Society 8 Lake Street, Cooperstown, NY 13326-1016 jfcooper@stny.rr.com http://www.oneonta.edu/external/cooper

Jon Noring

4:34 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Hugh wrote:

...

On this discussion, I would not that it is not just a question of transcribed text and/or images, but also frequently of editions. For a writer like James Fenimore Cooper it is not just a question of accuracy of transcription (and imposition of publishers' editorial styles) in the numerous editions made of his works, but also that he frequently made significant changes in his novels in new editions published during his lifetime. This is presumably true of many other writers. So even the "first edition" is not always the "best edition." On page numbering, it is my custom to include page numbers from the edition I am transcribing by placing them in {curly brackets} -- a typographical form I don't otherwise use. This makes it easy for the user not only to determine what page he is "on" in the electronic version, but also to look for specific page numbers, or provide citations to them, easily.

In the demonstration "My Antonia" project, one form of presentation includes showing the page markers which provide links to images of the page scans: http://www.openreader.org/myantonia/basic-design/myantonia.html Another form exposes the paragraph id's so there can be direct automated linking to any particular paragraph (must use Firefox or Opera, won't work in IE): http://www.openreader.org/myantonia/basic-design-nopagenum-paranum/myantonia... (to see paragraph linking in action, see example later on regarding error correction.) It is also possible to link to a certain original source page, for example: http://www.openreader.org/myantonia/basic-design-nopagenum/myantonia.html#pa... (this brings up the approximate spot in the text where the original paper Page 11 started.) Or to do the same for the document where page numbers and links are exposed: http://www.openreader.org/myantonia/basic-design/myantonia.html#page011

...

To further complicate the edition issue, the Modern Language Association has developed a fairly rigid style form for editions bearing its seal of approval, which results in a synthetic (and hence copyrighted) version which (to make a long matter short) combines the latest text on which the author is known to have worked, with the earliest form (ideally the manuscript) for his spelling, punctuation, and other stylistic matters that usually get changed by publishers to suit their own style manuals. In the case of JF Cooper these editions have been issued in the so-called "Cooper Edition" -- first by the SUNY Albany Press and more recently by AMS, but have been licensed to other publishers such as Library of America, Oxford, and Penguin.

Interesting. With respect to Willa Cather's "My Antonia", the first edition is now Public Domain, but the second and subsequent editions (which had some corrections) are still under copyright. So what we did was to put together a faithful textual reproduction of the first edition (still undergoing some proofing -- should have submitted it to DP in the first place but that's another discussion thread I'd rather not discuss.) Then, using known scholarly information on that Work, marked up corrections made in subsequent Cather-approved editions. To see this in action in "My Antonia", first go to: http://www.openreader.org/myantonia/basic-design-nopagenum/myantonia.html#p0... Notice in that paragraph the word "Austrians" is highlighted in gray. If you put your pointer over the word, in most browsers a little popup window will appear saying "UNL Cather Edition: Prussians". UNL is the University of Nebraska at Lincoln (who we have been in contact with regarding "My Antonia"), and they have reprinted this Work in a scholarly edition which notes what corrections were made to the first edition -- in this example the grayed word should be "Prussians" instead of "Austrians".

...

Problems of this sort are going to plague the Gutenberg editions of almost any author of the 19th century or earlier, but I have not seen them raised in this discussion.

Definitely. I think what PG and similar projects should strive to do: 1) accurate transcriptions of specific editions with editing only done in limited and well-defined situations (and keep track of such changes right within the document -- esssentially the source book will be textually preserved as it was printed, errors and all. With XML it is relatively easy to produce a "corrected" edition which would be noted as such.) 2) for more popular Works (the "classics") which have varying multiple editions, query with scholars and enthusiasts as to which Public Domain edition(s) should be transcribed. Note the plural of editions since there may be more than one edition worthy of transcription. With DP now in existence, this is no longer an issue. For example, Mary Shelley's "Frankenstein" exists in essentially two different editions. The second edition was significantly changed from the first edition by Shelley, including some differences in the ending. It is one Work, but definitely two unique Expressions as defined in the WEMI system. 3) Certainly accept "modern" edited editions of a Work so long as the licensing is acceptable (Creative Commons) *and* it is identified as a modern edited edition (so the consumer knows what they are getting), *and* at least one edition, faithful to some acceptable Public Domain printing, is already in the archive. Having the modern editions follow MLA guidelines (if allowed) sounds like a good idea. These are my thoughts which I think touch upon your thoughts. Jon

Greg Newby

15 Jul 15 Jul

10:29 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

...

On 7/14/05, Karen Lofstrom <lofstrom@lava.net> wrote:

...
I'm not an academic -- I'm just a scholar. I want the original page numbers so I cite things.

All major citation formats (APA, Chicago, IEEE, MLA....) have citation methods for online/electronic materials. Citing page numbers in a PG eBook is not consistent with these citation methods. I think you should cite what you use, not its alleged source. -- Greg

Marcello Perathoner

14 Jul 14 Jul

4:19 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Karen Lofstrom wrote:

...

I'm not an academic -- I'm just a scholar. I want the original page numbers so I cite things.

Either you cite the PG electronic edition, then page numbers are irrelevant because its faster to just search the cited phrase. Or you cite the paper edition. Then the rules of citation require you to actually get a physical copy of what you are citing and accurately verify your citation. -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

4:44 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Marcello wrote:

...

Karen Lofstrom wrote:

...

...
I'm not an academic -- I'm just a scholar. I want the original page numbers so I cite things.

...

Either you cite the PG electronic edition, then page numbers are irrelevant because its faster to just search the cited phrase.

Or you cite the paper edition. Then the rules of citation require you to actually get a physical copy of what you are citing and accurately verify your citation.

To add to the list of possibilities, one can have an XHTML version of the PG text, with internal markup and id's indicating original page information. (If the markup is TEI, then it is possible to use XLink, and the new xml:id now provides a standard way to add id tags to arbitrary XML documents.) In my previous message, I give a few URLs of how this was done in the XHTML demo "My Antonia" project. Jon

Jon Noring

8:45 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Collin wrote:

...

Although it would be nice if we could also cater for the academic world, that is by no means necessary. The academic world has access to our texts; if that for some reason is unsufficient, we need to determine whether that is through some fault of ours. I submit it isn't.

I believe that preserving and making the scans accessible is not only for the academic world, but for other users of the texts. For example, a modern publisher may wish to make a carbon particle on pressed dead tree version, and having the original scans assists with typography and layout and may resolves markup ambiguities (trust me, I've done this and needed to consult the original scans.) Someone may see what looks like an error, and by quick consulting with the scans, can verify if it was an error before submitting an error report (of course, those at the PG end making the corrections would love to have the scans readily available to verify the error report.) Some greedy IP attorney contemplating a lawsuit for his/her client claiming copyright infringement (e.g., the PG text is claimed to be derived from a modern edited edition) will take one look at the original scans and realize the text came from a true Public Domain source -- lawsuit does not proceed. And some casual readers will take an interest in the history of the PG Work they are reading and want to experience the look/feel of the original book, which they can easily do if the scans are available online. There's no doubt other "non-academic" uses of the page scans. There seems to be a limited, dichotomous view of the uses and users of structured digital texts: either casual reading by the average Joe, or hard-core academic use. Reality is more complex than this, and there's a whole rainbow of users and uses. Why PG wants to constrain itself to the most casual reader who will simply read the book from start to finish in his/her spare time, *intentionally* ignoring other types of users and uses, is mysterious -- especially when the effort to meet a greater range of users and uses is pretty minimal. It borders on the surreal.

...

Nevertheless, if we could somehow also make the page scans accessible, that could be handy for several reasons. I believe Charles Franks was working on such a system?

In the meanwhile, here's one system that could be tried until the more permanent system is developed: 1) Dial 415-561-6767 during working hours. 2) When Beatrice or Astrid answers, ask for Brewster Kahle. 3) Identify yourself as a DP person, and ask Brewster if he will archive and make available DP's page scans via a stable URL. 4) Await his answer, which may include an alternate suggestion. But I suspect it will be a positive reply. Brewster *loves* any and all high-quality public domain content. This may take a half hour. That doesn't seem too difficult to me. This is not rocket science. All you need is a directory/folder somewhere on the Internet to dump the scans (hopefully mirrored), and add a link in the right place(s) in the PG directory pointing to the scans. Of course have a couple people do periodic backups on DVD-ROM. More elaborate systems can be developed later and the scans moved to the new system. But get the darn scans archived and made available online, even if it is primitive to start out with. Jon Noring

collin＠xs4all.nl

11:53 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Jon:

...

There seems to be a limited, dichotomous view of the uses and users of structured digital texts: either casual reading by the average Joe, or hard-core academic use.

Undoubtedly that view exists, perhaps even at PG. Er, so what?

...

In the meanwhile, here's one system that could be tried until the more permanent system is developed:

1) Dial 415-561-6767 during working hours.

2) When Beatrice or Astrid answers, ask for Brewster Kahle.

3) Identify yourself as a DP person, and ask Brewster if he will archive and make available DP's page scans via a stable URL.

4) Await his answer, which may include an alternate suggestion. But I suspect it will be a positive reply. Brewster *loves* any and all high-quality public domain content.

This may take a half hour.

That doesn't seem too difficult to me.

And yet you seem completely unable to do this. I wonder why? In the time you took to write this lengthy e-mail, you could have set up a page scan archive at TIA. If this is so important to you, why haven't you done this already? Jon, at the moment you come across like the nth of the Vapourware Kings that are regularly trolling this board. "Why don't you do X? Any idiot could do X in two working days!" Now I know you are not a Vapourware King, so what's with the act? The most likely reason why we have no page scan archive is because no-one has taken the time to set it up. DP's page scans are accessible to anyone with an account. (Probably even to those without an account.) The only hard bit is knowing which PG posted text goes with which DP text ID, so that you can recombine them when necessary. I believe we even save bibliographical data with our texts, so that you could extract all kinds of metadata to go with the pagescans. I'd do it, but I have other things to do.

Jon Noring

3:39 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Collin wrote:

...

Jon:

...

And yet you seem completely unable to do this. I wonder why? In the time you took to write this lengthy e-mail, you could have set up a page scan archive at TIA. If this is so important to you, why haven't you done this already?

This is an interesting question. My answer is another question: Shouldn't the decision to archive and make the scans publicly available *alongside* the digital text versions be a collective decision among the PG/DP folk? After all, making the scans available alongside the texts requires those maintaining the PG index to provide links to the scans. It also requires metadata and other types of coordination with DP and those independent of DP who use the scan/OCR process for transcribing books. Someone stepping forth to provide a home for the scans will not change how PG/DP does their thing so long as there is no collective decision that it is a good thing to do, and willing to at least help make the preserving process go smoother. I'm trying to provide some rationale on the pro side for preserving and making the scans publicly available alongside the text versions -- whether the rationale will be accepted or not in a collective sense is another matter. I'll be very happy to take the time and effort to solicit/collect/make available the scans when there is a majority consensus by both DP and PG folk that this is important to them, and that PG will provide links to the original page scans (wherever the scans will reside.) (Well, I'll even offer to go ahead with this if two or three others who are higher-level volunteers in the DP system, who are familiar with it, believe it is a good idea and step forward, offering to actively help in the effort to collect/solicit/catalog/archive the scans from the DP work flow, as well as those done outside of DP. If so, then I'll contact IA, probably Molly first, and see if we can setup something official. It could be called the "PG/DP Scan Archive" or some similar name.) To restate what is being discussed, we each will take one of two basic positions: 1) The scans should be made available to the public alongside the structured digital texts. 2) The scans should not be made available to the public alongside the structured digital texts. (Archiving the scans is a related issue, but not the same, since some may take the position that making the scans publicly available should not be done -- e.g., it is considered a waste of effort and disk space by those maintaining the PG catalog -- but that the scans should be preserved somewhere for internal future access by PG/DP volunteers.) If the majority of the volunteers who have lead roles in PG/DP embrace #2, then it is a lot more difficult for any single person to be proactive and do it on their own since the effective preservation requires the systems (work flows) to be more friendly to preservation and availability (e.g., procedural requirements to aid in collecting the scans with sufficient metadata for identification/correlation to the structured digital text work product.) They are not at present. (Btw, just to note. In a conversation with Juliet a few months ago, she noted that some scans cannot be released to the public because of agreements with the scan providers or those who hold the original paper documents. Although this is unfortunate, at least the scans are available to make SDTs, which is better than nothing -- it is always possible to secure scans from another copy of the Edition at a future time provided the full catalog information is preserved, which it is at DP. Now, let's add to this fact the need for the scan preservation activity to acquire metadata conformant with DP's internal metadata tracking (otherwise it is more difficult to correlate a scan set with its associated SDT, both at DP and at PG.) These two facts alone require that the scan preservation effort should have fairly high level cooperation and blessing from the DP people -- it cannot be done without their consent and without some minimal help. Not to mention that downloading the scans from DP will increase the stress on DP's servers, so from that consideration DP also has to 'bless' the effort and provide procedural requirements. So no matter how one looks at it, the DP leadership has to bless the activity and provide some help to make it work. And they will only take the effort to "bless" such an activity if they believe it to be worth doing. Thus I am discussing the "why it should be done" first, which in my book is the proper order in decision-making.)

...

Jon, at the moment you come across like the nth of the Vapourware Kings that are regularly trolling this board. "Why don't you do X? Any idiot could do X in two working days!" Now I know you are not a Vapourware King, so what's with the act? The most likely reason why we have no page scan archive is because no-one has taken the time to set it up.

Well, maybe I do come forth this way. But then you are saying no one should share thoughts and ideas for reasoned discussion *before* any sort of collective decision? That no one should bring up discussion regarding the basic goals and approach of PG and DP? So what says you (to everyone reading this)? 1) Should the original page scans be made publicly available alongside the structured digital texts? 2) If not, should they at least be preserved with limited or separate access (such as donate them to IA for IA to do as they wish)? 3) Or should the scans be erased when the SDT is proofed and out the door?

...

DP's page scans are accessible to anyone with an account. (Probably even to those without an account.) The only hard bit is knowing which PG posted text goes with which DP text ID, so that you can recombine them when necessary. I believe we even save bibliographical data with our texts, so that you could extract all kinds of metadata to go with the pagescans.

I have accessed the page scans at DP (in order to submit some page scans before the recent DP revamp). Now, if DP had a policy/system where page scans were more carefully indexed as you mention, then it would certainly be easier for someone to collect them. However, I think the issue is more with PG which actually makes the texts available online (DP focuses on producing the texts -- that the scans are an important part of the DP work flow involves them, too.) Will PG provide links in its catalog to the original page scans alongside the SDT versions? To better understand some things myself, I have to ask a fundamental workflow question of the PG-side of the house. If a finished text is donated to PG and original page scans are submitted alongside the text, what will happen to the scans? Will they be made publicly available alongside the SDT, or will they be preserved but not linked to or made public, or will they be rejected and essentially erased? And another question for PG: In a philosophical sense (ignoring the technical/administrative realities for the moment), would PG, in its catalog, provide links to the original page scans used as the source for the cataloged digital texts? Or is there a philosophical reason why PG would not do this? I've yet to hear an answer as to whether PG will philosophically consider providing links to the original page scans used to produce the texts in its catalog. Jon Noring

Jon Niehof

4:41 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

--- Jon Noring <jon@noring.name> wrote:

...

My answer is another question: Shouldn't the decision to archive and make the scans publicly available *alongside* the digital text versions be a collective decision among the PG/DP folk? *snip* Someone stepping forth to provide a home for the scans will not change how PG/DP does their thing so long as there is no collective decision that it is a good thing to do

DP existed for, what was it? two years? before it became an "official" PG project by demonstrating its effectiveness. Even then, nobody's required to buy into it--you can always submit texts straight to PG. If you build a *reality* that's even a quarter of your vision, it's often far more effective than just talking about the final result. Gradually, people will come to accept it...or not. That's the risk. I'd say get some infrastructure in place to support those PM's, CP's, and PP's that agree with, or at least are willing to provide material for, your vision. Keep those scans. Link 'em to the PG texts. Maybe take the guiguts .bin file (for page break information) and use that to set up a user-submitted corrections form like CCEL. Talk to Harry Plantinga--he's both smart and a nice guy. Shucks, there might be CS students at Calvin who'd be looking for a project and could help. I'm not willing to say "yes, do it this way!" and support changing existing workflow (which was designed to produce PG e-texts) to facilitate your project. But I'm willing to send up the work I've done if it could be useful--a *lot* of CP'ers would love to donate their page scans. Why not start from that and see how it grows? __________________________________ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250

Jon Noring

5:37 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Jon Niehof wrote:

...

DP existed for, what was it? two years? before it became an "official" PG project by demonstrating its effectiveness. Even then, nobody's required to buy into it--you can always submit texts straight to PG. If you build a *reality* that's even a quarter of your vision, it's often far more effective than just talking about the final result. Gradually, people will come to accept it...or not. That's the risk.

Yes, well put. There is an activity going on, with a full-fledged business plan in an advanced stage of development, to implement a next-generation digital library system, called LibraryCity (it is planned to be a non-profit.) In fact, on 28 July, there will be an online conference where we will first publicly present our vision: http://www.planetlibrary.info/lgle200508.htm Our Executive Director, Lori Watrous-deVersterre, will give the presentation. David Rothman and I will also talk about OpenReader, which is only tangentially-related to the interests of the PG/DP crowd. See http://www.openreader.org/ . The "My Antonia" project was simply an "internal demo", although I've brought it up a lot here, to explore some of the issues associated with making public domain texts more accessible in the library, education, research and general "social" and "personal" contexts. It is also intended to aid in better integration of public domain texts with all other kinds of content. This adds a set of requirements, but these requirements are not onerous. When I have given comments on the various changes in both process and philosophy I'd love to see occur in the PG/DP system, these come from having studied, in the last few years, how to make digital texts much more useful for everyone and for all kinds of purposes -- many types of uses and many types of users. Since the PG philosophy has focused on a smaller subset of uses and users for many years, there is, understandably, a general inertia about discussing an expansion of focus, and the additional work product requirements such an expansion of focus requires. Now, the public domain text aspect of LibraryCity is not the main thrust of LibraryCity -- it is just one component and certainly DP and PG prominently figure into our equation -- we want to work with PG and DP for mutual benefit. So should LC not secure even seed funding to finish our demonstration prototype and move to the next level, then the public domain text aspect can certainly be broken out and developed separately -- I'm pretty certain we can package it and get substantial funding interest since we will be able to demonstrate substantial ROI, a well-defined scope with a clear end game, and that it will be of benefit to many. I gave a little more detail on this last night in this forum (gutvol-d) where I described what I had in mind for redoing the top 500 or 1000 English-language classics (the most popular pre-1923 books for English language speakers), in cooperation of course with PG/DP. But it's more than just redoing them ala DP -- it is to make them available in more advanced ways to encourage community to be built around the texts ala LibraryCity principles. Again, doing this sets requirements which are more stringent and specific than what is normally done in PG and DP, but not way outside the scope of what everyone is talking about now (e.g., PGTEI.) Of course the final product will still follow the general philosophy of PG of a free and open Public Domain and will not be incompatible with PG's own activities.

...

I'd say get some infrastructure in place to support those PM's, CP's, and PP's that agree with, or at least are willing to provide material for, your vision. Keep those scans. Link 'em to the PG texts. Maybe take the guiguts .bin file (for page break information) and use that to set up a user-submitted corrections form like CCEL. Talk to Harry Plantinga--he's both smart and a nice guy. Shucks, there might be CS students at Calvin who'd be looking for a project and could help.

I'm not willing to say "yes, do it this way!" and support changing existing workflow (which was designed to produce PG e-texts) to facilitate your project. But I'm willing to send up the work I've done if it could be useful--a *lot* of CP'ers would love to donate their page scans. Why not start from that and see how it grows?

Again, thanks for your insights and clarification. An important aspect of any project which relies upon volunteers is to find a critical mass of like-minded people who want to see something happen. So one of the reasons I have for addressing philosophical issues in the open here ("what should be done") is to find like-minded folk to join a team effort. It is not easy to do this because of the 80-20 rule taken twice: only 20% of those who hear about something will take some interest, and only 20% of those will actually move from interest to action. I can only do so much alone based on my particular skill and lack thereof -- the rest requires a team effort leveraging the unique talents of each team member. Thanks again, Jon. The other Jon N.

Robert Cicconetti

8:31 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 7/14/05, Jon Niehof <jon_niehof@yahoo.com> wrote:

...

--- Jon Noring <jon@noring.name> wrote: I'm not willing to say "yes, do it this way!" and support changing existing workflow (which was designed to produce PG e-texts) to facilitate your project. But I'm willing to send up the work I've done if it could be useful--a *lot* of CP'ers would love to donate their page scans. Why not start from that and see how it grows?

I don't usually retain full grayscale or color page images because of space / workflow issues. I generally do a full pass in grayscale directly into Finereader, which is set to convert it to black and white; Finereader's thresholding is usually pretty good (and doesn't dither, unlike my scanner driver.) I will then go back and scan any illustrations at 600 DPI, more if needed (very seldom). I did scan a few of the Beatrix Potter books completely in color and someone else has posted them to an album (unfortunately on a site with several sex related books; I'm sure censorship programs are or will block it when it becomes noticed.) Most of the page images used and archived at DP are a compromise between image quality and size; many proofers (and many CPs) do not use any form of broadband. But they are all archived somewhere, and were good enough to do the original proofing, if not as nice as My Antonia. However, the squirrels are even busier than usual; this is not a good time to ask for a dump of the project archives. R C

Andrew Sly

7:15 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On Thu, 14 Jul 2005, Jon Noring wrote:

...

And another question for PG: In a philosophical sense (ignoring the technical/administrative realities for the moment), would PG, in its catalog, provide links to the original page scans used as the source for the cataloged digital texts? Or is there a philosophical reason why PG would not do this? I've yet to hear an answer as to whether PG will philosophically consider providing links to the original page scans used to produce the texts in its catalog.

Please be careful not to perpetuate the misunderstanding of PG as having a cabal who lurk in the shadows and dictate their "rules" for what can and cannot be done. Rather there is a group of loosely organized volunteers who look after the nuts and bolts of what actually needs to be done to keep processing copyright clearances, posting texts, keep the website running, etc. So in answering here, I'm not giving you "the official PG response". Rather, it's my understanding as a volunteer who happens to be rather involved with the catalog. Personally, I don't have a problem with linking to page scans as you mention, and I don't belive that it would be too hard to implement. Although the actual time to physically enter links could be a consideration. One of the concerns I have for the PG catalog is that it be valid over the longer term (decades) without requiring a great deal of upkeep of external links. I could easily imagine including links of many different types (as some would like) only to find in a few years that some material has been moved, some has disappeared, etc.

Jon Noring

9:02 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Andrew wrote:

...

Personally, I don't have a problem with linking to page scans as you mention, and I don't belive that it would be too hard to implement. Although the actual time to physically enter links could be a consideration.

One of the concerns I have for the PG catalog is that it be valid over the longer term (decades) without requiring a great deal of upkeep of external links.

I could easily imagine including links of many different types (as some would like) only to find in a few years that some material has been moved, some has disappeared, etc.

I appreciate this feedback. Your reply strongly indicates that any online archive of page scans which is linked to the associated digital texts needs to carefully consider the long-term stability of links in both directions (we are assuming the scans are kept separate from the digital text archives), and to make it easier to change the link addresses should URL changes become unavoidable. Any ideas, anyone? In reply to Robert Cicconetti, who wrote: "However, the squirrels are even busier than usual; this is not a good time to ask for a dump of the [DP] project archives." That is my sense, too, and my involvement with DP has not been that deep other than the submission of the Kama Sutra scans a couple months ago, just before the change in the DP system occured. This change of the DP system has certainly kept a lot of the DP regulars busier than usual. Hopefully normalcy will return soon, if it hasn't already. Jon

Branko Collin

9:52 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 14 Jul 2005, at 9:39, Jon Noring wrote:

...

Collin wrote:

...
Jon:

...
And yet you seem completely unable to do this. I wonder why? In the time you took to write this lengthy e-mail, you could have set up a page scan archive at TIA. If this is so important to you, why haven't you done this already?

This is an interesting question.

My answer is another question: Shouldn't the decision to archive and make the scans publicly available *alongside* the digital text versions be a collective decision among the PG/DP folk?

The scans are public domain. The etexts are public domain. The meta- data is for the large part non-ownable (if not completely non- ownable). You don't have to ask anybody anything to do something with these items. It sounds to me you have stumbled upon a wonderful way to enhance that which Project Gutenberg produces, as have many publishers before you. Please, run with it!

...

After all, making the scans available alongside the texts requires those maintaining the PG index to provide links to the scans. It also requires metadata and other types of coordination with DP and those independent of DP who use the scan/OCR process for transcribing books.

_If_ PG would like to offer page scans, and I can give you several reasons why it would, I am sure it would also like to link between the scans and the etexts. As it is, Distributed Proofreaders already produces HTML versions of most of its etexts, and most of these HTML versions retain the page numbers one way or the other. Brewster Kahle suggested in the forums of the Canadian Libraries project that we make use of a feature in DJVU that even allows one to link between words in pagescans and textpages.

...

Someone stepping forth to provide a home for the scans will not change how PG/DP does their thing so long as there is no collective decision that it is a good thing to do, and willing to at least help make the preserving process go smoother.

I have no idea what you are talking about. DP and PG do not work using collective decisions. Consensus, yes, and the force of stubborn volunteers who push through their way until others come around. A good example of this are the proofing guidelines at DP. These have evolved to where they are now, because stubborn Project Managers kept demanding proofers did it their way or not at all, until either the proofers caved in (new guideline accepted) or the project managers caved in (new guideline rejected). You may wish to ask Jon Ingram how painless this process was. :-)

...

I'm trying to provide some rationale on the pro side for preserving and making the scans publicly available alongside the text versions -- whether the rationale will be accepted or not in a collective sense is another matter.

Exactly! That's how it works.

...

I'll be very happy to take the time and effort to solicit/collect/make available the scans when there is a majority consensus by both DP and PG folk that this is important to them, and that PG will provide links to the original page scans (wherever the scans will reside.)

I would say, just try and make a small collection. The people you should talk to, IMHO, are the White Washers, as they have (or should have) a very direct interest in having page scans available. So much easier to deal with that error report you receive in the year 2011 if you have a page scan to compare with. Also Post-Processors may find such a database useful. Not to mention people outside PG: scholars, publishers, archivists, etc.

...

...
Jon, at the moment you come across like the nth of the Vapourware Kings that are regularly trolling this board. "Why don't you do X? Any idiot could do X in two working days!" Now I know you are not a Vapourware King, so what's with the act? The most likely reason why we have no page scan archive is because no-one has taken the time to set it up.

Well, maybe I do come forth this way. But then you are saying no one should share thoughts and ideas for reasoned discussion *before* any sort of collective decision?

I am saying that you should walk the walk, not just talk the talk. I am saying that the chance of collective decisions ever arriving here are minimal. Consensus, yes. A decision by somebody who has the power to push buttons, yes. The rest you can forget about. At the most you will get some sort of consensus about the meaning of poll results.

...

That no one should bring up discussion regarding the basic goals and approach of PG and DP?

No, that is not what I am saying either. If you have trouble understanding what I am saying, please try and outline what exactly you are having trouble with, rather than offering me ready-made opinions, which I will most likely disagree with.

...

So what says you (to everyone reading this)?

1) Should the original page scans be made publicly available alongside the structured digital texts?

I would love the page scans to be available in a more accessible manner than currently is the case.

...

2) If not, should they at least be preserved with limited or separate access (such as donate them to IA for IA to do as they wish)?

They are preserved.

...

3) Or should the scans be erased when the SDT is proofed and out the door?

I don't believe this, and even if I did, it's not going to happen.

...

...
DP's page scans are accessible to anyone with an account. (Probably even to those without an account.) The only hard bit is knowing which PG posted text goes with which DP text ID, so that you can recombine them when necessary. I believe we even save bibliographical data with our texts, so that you could extract all kinds of metadata to go with the pagescans.

I have accessed the page scans at DP (in order to submit some page scans before the recent DP revamp). Now, if DP had a policy/system where page scans were more carefully indexed as you mention, then it would certainly be easier for someone to collect them.

Currently the scans are available on a per-project basis as separate files or ZIP archives. (YMMV depending on where a project is in the DP workflow.) Once projects have run through DP and been posted at PG, I believe the site administrators archive them off-site to preserve disk space, but I am afraid I am totally not clear on this.

...

However, I think the issue is more with PG which actually makes the texts available online (DP focuses on producing the texts -- that the scans are an important part of the DP work flow involves them, too.) Will PG provide links in its catalog to the original page scans alongside the SDT versions?

I am PG, but I cannot speak for it.

...

To better understand some things myself, I have to ask a fundamental workflow question of the PG-side of the house. If a finished text is donated to PG and original page scans are submitted alongside the text, what will happen to the scans? Will they be made publicly available alongside the SDT, or will they be preserved but not linked to or made public, or will they be rejected and essentially erased?

I believe the site administrators store them somewhere, but others are way more qualified to answer this.

...

And another question for PG: In a philosophical sense (ignoring the technical/administrative realities for the moment), would PG, in its catalog, provide links to the original page scans used as the source for the cataloged digital texts? Or is there a philosophical reason why PG would not do this? I've yet to hear an answer as to whether PG will philosophically consider providing links to the original page scans used to produce the texts in its catalog.

Again, I am PG but cannot speak for it. You'd have to ask this of the people who maintain the catalog, for instance. -- branko collin collin@xs4all.nl

Juliet Sutherland

10:07 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Branko Collin wrote:

...

Currently the scans are available on a per-project basis as separate files or ZIP archives. (YMMV depending on where a project is in the DP workflow.) Once projects have run through DP and been posted at PG, I believe the site administrators archive them off-site to preserve disk space, but I am afraid I am totally not clear on this.

Page scans for any project that is in progress are available on the DP site. Whether or not you can access them depends on the job that you are doing at DP. We make them available where they are needed in the work flow. Once a project has posted to PG, the scans are archived onto a different server to make space on our production server. The archived scans include what was used for proofing as well as anything else that the content provider uploaded (illustration files, etc). We do intend to make these scans available, but our development priority is getting the DP workflow process finished first. We have made a huge step forward recently, but there are several more large pieces that still have to be implemented. Aside from not having the development resources to set up some kind of system for accessing and using the scans, we also have not yet found someone who will wade through all the archived material to sort it out so that it can actually be used. There are lots of seemingly minor details that have to be taken care of, ranging from figuring out which DP projects go with which PG texts (often not a 1-1 relationship there) to what happened to missing pages from various projects. Sometimes the pages are there, but with odd names, often they are just missing. And there is no standard for how illustration files are named. Some external sources of images have requested that we refer people back to them for the original page scans. We have to sort out and mark those projects. All in all, although it sounds like it should be easy, it isn't. It will take a volunteer with motivation and presistance to make this happen. Making page images available both to the general public and for checking on reported errata will happen eventually. We are not erasing or losing any of the basic material. But I'm sure it won't happen nearly as quickly as Jon Noring would like. ;-) JulietS DP site admin

Jon Noring

10:40 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Juliet wrote:

...

Making page images available both to the general public and for checking on reported errata will happen eventually. We are not erasing or losing any of the basic material. But I'm sure it won't happen nearly as quickly as Jon Noring would like. ;-)

Thank you for the authoritative clarification of the current status of the page scans used in DP's work flow. And I'm glad all the scans have been saved and that there is an intention to eventually make them available to the public (except those scans which cannot be made public.) Regarding the volunteer(s) to help out with the page scan sorting process. Well, one never knows. If I wasn't so overwhelmed with a few proposals and projects which are coming to a boil, I'd volunteer right away (but I will volunteer to help with organizing the sorting process, and to help with some of the actual sorting, if a couple others step forward to also help with the organization.) This almost seems like a job best tackled by a small group of people considering there must be several thousand book projects and it is important it proceed at a pace much faster than the number of new projects being added. It could be doable in a distributed fashion from a dedicated server (separate from the DP servers.) It would require one or two people who are familiar at a high-level with the DP system, its metadata collection procedure, etc. That puts me at a slight disadvantage to lead that project. Regarding my impatience. Yes, I'm impatient. Sorry. <smile/> Jon

Branko Collin

15 Jul 15 Jul

4:02 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 14 Jul 2005, at 18:07, Juliet Sutherland wrote:

...

Making page images available both to the general public and for checking on reported errata will happen eventually. We are not erasing or losing any of the basic material. But I'm sure it won't happen nearly as quickly as Jon Noring would like. ;-)

I think even an incomplete archive would already be an asset. Of a lot of projects page scans simply exist. We know that, because we use them for proofing. :-) Also, if somebody set up such an incomplete archive, they could document the problems they ran into. So I think Jon should go ahead and try and set something up. -- branko collin collin@xs4all.nl

Jon Noring

5:58 p.m.

New subject: Initial thoughts on a PG/DP scan repository

Banko wrote:

...

Juliet Sutherland wrote:

...

...
Making page images available both to the general public and for checking on reported errata will happen eventually. We are not erasing or losing any of the basic material. But I'm sure it won't happen nearly as quickly as Jon Noring would like. ;-)

...

I think even an incomplete archive would already be an asset. Of a lot of projects page scans simply exist. We know that, because we use them for proofing. :-)

Also, if somebody set up such an incomplete archive, they could document the problems they ran into. So I think Jon should go ahead and try and set something up.

After reading the various replies, it is clear there is more interest in preserving and (sooner if not later) making the scans available to the public online than I originally perceived based on the statements of a few here. It is welcome to see the majority, including Greg and Juliet, come out in support of scan preservation and availability. Now, with about 15 minutes of thought (so we're at a very early stage of conceptualization), here's some of the issues and unknowns as I see them -- in no particular order -- a sort of stream of consciousness. Feel free to comment, counter-point, and to add more items. 1) There are scan sets all over the place, in various resolutions, color depths, quality, and completeness. (The variation in file size between sets will be significant.) 2) Some to many scan sets comprise various derivative subsets (e.g., the original scans, a cleaned up set, etc.) Question: If a scan set contains derivatives of some sort, do we restrict what will be saved? 3) Many if not most scan sets have no external metadata. Thus their identification and source of metadata will be internal (title page info, etc.) 4) Some scan sets we acquire may have license encumberances not allowing them to be accessible to the general public. (From a conversation with Juliet a while back.) This will impact upon the design of the scan repository(ies). 5) There is likely to be interest by some outside of PG/DP to contribute scans of public domain texts to the raw archive. Should we allow this? If so, keeping this straight is necessary. More metadata. (I note this thinking specifically of David Reed here in Utah who is scanning large numbers of books and doing most of the conversion himself -- I'm not sure if all the texts he is doing are being submitted to PG, but since he associates with PG do we disallow him from adding his scans to the mash?) 6) We have to worry about scan sets of works still under copyright. (And there's the related difference in life+ countries versus the fixed 1923 date in the U.S., ignoring the renewal aspects. Will the repository be used for PG-affiliated projects world-wide?) 7) When the scans are stored, do we zip each set/subset up, or keep each page scan separate? 8) What are our space requirements, for now and in the future? (This probably requires a survey, plus we have to decide whether we save all derivative subsets, save the original scans or a cleaned up set, or whatever.) 9) Do we consider conversion of all scans to DjVu? (If we ask Brewster for space, he may suggest this in order to cut down the size of the scans.) 10) Those possessing scan sets will have their own preferences for how they are submitted. Some will want to upload them, others will want to send them via CD/DVD-ROM. If some scan sets comprise a few gigs, it may make more sense to burn them on optical disk and send them. This now requires us to possibly save what is sent. It also requires someone to accept the disks and transfer the data to the repository. Who will do this depends upon various factors. 11) Access to the repository(ies). Only PG/DP volunteers, or the general public? The answer depends upon the structure of the repository(ies), who is hosting it, legal/technical issues, etc. I could go on, but the above items are a good start at some of the issues/decisions that need to be considered/resolved before bulling ahead with establishing the repository(ies). Obviously a lot of the design of the system has to integrate with how things are now done in DP with respect to handling/processing scans, which I don't have a good handle on at the present. And should we consider a "two repository" model? The first repository, which will be the first one started, will simply be a "centralized dumping ground" for the scan sets (and derivatives) which are produced as part of PG/DP activities. Access to it will be limited to the PG/DP volunteers. It has yet to be determined whether it should only include scans for finished projects, or also for ongoing projects and scans donated by those outside of PG? The second repository could come later, where a group of volunteers sort through the pile in the first repository, check for copyright/license aspects, completeness, sort the pages properly, then maybe convert the preferred derivative set to DjVu, PDF, and/or other more compact formats. Anyone who wants any of the stuff in the first repository can request it. And finally, since DP is the major player, should we move this design discussion to the relevant forum at DP? Or keep it here on gutvol-d? Jon

Jon Niehof

9:18 p.m.

New subject: Initial thoughts on a PG/DP scan repository

--- Jon Noring <jon@noring.name> wrote:

...

1) There are scan sets all over the place, in various resolutions, color depths, quality, and completeness. (The variation in file size between sets will be significant.)

Take them as they are, or request certain standards for submitters. I'd lean towards the former for now, even though it might mean work later on.

...

2) Some to many scan sets comprise various derivative subsets (e.g., the original scans, a cleaned up set, etc.)

Every processing operation loses information. So, I'd suggest keeping original scans for archival and fully cleaned scans for viewing (I usually keep both around, although if the cleaning is entirely expressed in a series of shell scripts, with no human intervention, I'll just keep the originals and regenerate the cleans when necessary).

...

3) Many if not most scan sets have no external metadata.

I don't see this as a huge issue. I try to have png numbers match page numbers, but sometimes this causes problems (DP is fine with it; guiguts not so much so in some cases). You can leave it as a big blob for people to rifle through (like paper) or set up a simple workflow for capturing it. I'd say KISS for now, though. (Bonus: some of the metadata capturing code might be useful to dp in the future if handled carefully and with luck ;) )

...

4) Some scan sets we acquire may have license encumberances not allowing them to be accessible to the general public.

I'd say leave 'em out.

...

5) There is likely to be interest by some outside of PG/DP to contribute scans of public domain texts to the raw archive.

Sure, why not? As long as we're allowed to use them later and produce PG texts as well ^^ Another way to attract content.

...

7) When the scans are stored, do we zip each set/subset up, or keep each page scan separate?

A simple zip of all scans would be useful, easy, and eliminate a lot of the metadata requirement. One can always add code for online viewing later.

...

8) What are our space requirements, for now and in the future?

You know how to figure the amount of firewood you need for the night, right? Think of the fire burning. Think of the maximum amount of wood you could possibly need. Then gather triple that amount. (As a guess, my old job figured 100 dpi, legal, B&W TIFF's were about 100k a page. My 300dpi scans of History of a Lie are 4.7MB for two pages in an 8-bit, compression 9 PNG. Somewhere in between those two :) )

...

9) Do we consider conversion of all scans to DjVu?

No proprietary formats, please. It makes little sense to have a long-term repository dependent on the whim of a particular company.

...

10) Those possessing scan sets will have their own preferences for how they are submitted.

Get one way working first, and people can work around as necessary until there are other ways. I'd be happy to spend some upload bandwidth if necessary, for example (for some people in Europe shipping a DVD may be cheaper than paying for the connect time, right?)

...

11) Access to the repository(ies). Only PG/DP volunteers, or the general public?

Ultimately I think it should be general public. How you want to start probably depends on who's providing the resources and support. Good luck, Jon; I'd like to see this happen. ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Branko Collin

10:35 p.m.

New subject: Initial thoughts on a PG/DP scan repository

On 15 Jul 2005, at 11:58, Jon Noring wrote:

...

After reading the various replies, it is clear there is more interest in preserving and (sooner if not later) making the scans available to the public online than I originally perceived based on the statements of a few here. It is welcome to see the majority, including Greg and Juliet, come out in support of scan preservation and availability.

Now, with about 15 minutes of thought (so we're at a very early stage of conceptualization),

Very early stage of conceptualization? Yesterday you had a plan for implementation. What is wrong with the half-hour-plan you had yesterday? Do not sink the momentum you have got in the morass of overplanning. Having scans available is handy for: - The PG errata folks who do not want to accidentally "fix" the wrong thing. - Publishers who want to make richer versions of our texts based on the original lay-out. - Gatekeepers who claim references cannot be made without looking at a page number (they will--of course--invent a new claim as to why PG etexts are unusable as soon as we provide them with the means to name page numbers). For these purposes, a simple system that links a PG etext with its page scans suffice.

...

here's some of the issues and unknowns as I see them -- in no particular order -- a sort of stream of consciousness. Feel free to comment, counter-point, and to add more items.

1) There are scan sets all over the place, in various resolutions, color depths, quality, and completeness. (The variation in file size between sets will be significant.)

Yes and no. For the purposes outlined above, the scan set that was used to create DP's OCR from will suffice.

...

2) Some to many scan sets comprise various derivative subsets (e.g., the original scans, a cleaned up set, etc.)

Question: If a scan set contains derivatives of some sort, do we restrict what will be saved?

No. Why?

...

3) Many if not most scan sets have no external metadata. Thus their identification and source of metadata will be internal (title page info, etc.)

4) Some scan sets we acquire may have license encumberances not allowing them to be accessible to the general public. (From a conversation with Juliet a while back.) This will impact upon the design of the scan repository(ies).

Not quite. The problem with some scan sets is that their sources claim some kind of ownership. I believe that PG believes that mere scans of public domain material are themselves in the public domain. However, there are many ways in which these sources could make things difficult for us (for instance, putting scans behind a login, so that we cannot get at them), so DP plays nice.

...

5) There is likely to be interest by some outside of PG/DP to contribute scans of public domain texts to the raw archive. Should we allow this?

I am not sure if PG wants to run a scan archive. Assume that this is going to take place outside of (but with the support of) PG. If this were a PG project, PG might require copyright clearances. But assuming it is not, you will probably have to work something out with the hosting folks so that they cannot be sued. Purely on practical reasons I think outside scans should be banned; if somebody wants to contribute to DP, after which the scans will automatically trickle down to the scan archive. Bonus: we will have an accessible etext.

...

If so, keeping this straight is necessary. More metadata. (I note this thinking specifically of David Reed here in Utah who is scanning large numbers of books and doing most of the conversion himself -- I'm not sure if all the texts he is doing are being submitted to PG, but since he associates with PG do we disallow him from adding his scans to the mash?)

The great thing about metadata is that it can be added afterwards by those who actually care about that sort of thing.

...

6) We have to worry about scan sets of works still under copyright. (And there's the related difference in life+ countries versus the fixed 1923 date in the U.S., ignoring the renewal aspects. Will the repository be used for PG-affiliated projects world-wide?)

I like PG's philosophy, which is that PG itself does not worry about the rest of the world; let the rest of the world worry. There are no copyright concerns. A minor issue might be copyrighted texts in PG; but do we typically have the scans of those?

...

7) When the scans are stored, do we zip each set/subset up, or keep each page scan separate?

Separate at first. Software that will zip it up is trivial to write and add.

...

8) What are our space requirements, for now and in the future? (This probably requires a survey, plus we have to decide whether we save all derivative subsets, save the original scans or a cleaned up set, or whatever.)

I would first talk to Brewster Kahle. Mentioning space requirements to him is probably as useless as discussing snow balls with planets.

...

9) Do we consider conversion of all scans to DjVu? (If we ask Brewster for space, he may suggest this in order to cut down the size of the scans.)

First build the repository, then see how you can add value. Reminds me of something I keep hearing: apparently a lot of people won't use PG etexts, because these texts have errors in them. Funny how none of these people has bothered to report these errors to PG. The thing is, people who have a real need for extras, will probably come to us, and may even contribute what is necessary to let these extras become a reality. People who merely have a perceived need will just whine that we do not fullfil that need.

...

10) Those possessing scan sets will have their own preferences for how they are submitted. Some will want to upload them, others will want to send them via CD/DVD-ROM. If some scan sets comprise a few gigs, it may make more sense to burn them on optical disk and send them. This now requires us to possibly save what is sent. It also requires someone to accept the disks and transfer the data to the repository. Who will do this depends upon various factors.

Again, I would outright ban external scan sets.

...

11) Access to the repository(ies). Only PG/DP volunteers, or the general public? The answer depends upon the structure of the repository(ies), who is hosting it, legal/technical issues, etc.

General public.

...

I could go on, but the above items are a good start at some of the issues/decisions that need to be considered/resolved before bulling ahead with establishing the repository(ies). Obviously a lot of the design of the system has to integrate with how things are now done in DP with respect to handling/processing scans, which I don't have a good handle on at the present. And should we consider a "two repository" model? The first repository, which will be the first one started, will simply be a "centralized dumping ground" for the scan sets (and derivatives) which are produced as part of PG/DP activities. Access to it will be limited to the PG/DP volunteers.

Why? DP already has a scan repository accessible to the volunteers; the only snag being that only scans of running projects are available. -- branko collin collin@xs4all.nl

Marcello Perathoner

16 Jul 16 Jul

12:14 a.m.

New subject: Initial thoughts on a PG/DP scan repository

Branko Collin wrote:

...

For these purposes, a simple system that links a PG etext with its page scans suffice.

A djvu multi-page archive posted into the ebook directory would suffice. There are GPLed plugins for Linux browsers and commercial (but free as in beer) plugins for Windows and Mac. The plugin works just the same as the pdf plugin. You can browse the djvu just as you would a multi-page pdf document. I just compressed the #12973 pagescans into djvu (using GPLed tools) and they came out at 4.3 MB. (From a 12.3 MB as zipped collection of tiffs.) http://www.gutenberg.org/internal/12973.djvu user: internal pass: books You can get browser plugins starting here: http://djvulibre.djvuzone.org/ Or, if you are on a real computer, become root and say: apt-get install djvulibre-plugin -- Marcello Perathoner webmaster@gutenberg.org

Joshua Hutchinson

3:47 a.m.

New subject: Initial thoughts on a PG/DP scan repository

Marcello Perathoner wrote:

...

Branko Collin wrote:

...
For these purposes, a simple system that links a PG etext with its page scans suffice.

A djvu multi-page archive posted into the ebook directory would suffice.

The problem with DjVu is that there are no Windows-based encoders. I know you like to disparage Windows users, Marcello! ;) But the fact of the matter is that the vast majority of our volunteers are using Windows. Without tools on their platform, we lose a huge section of our "workforce" and something this big will need that workforce. The only alternative that I can think of would be to have a DjVu encoder running as a webpage script on pglaf.org or somewhere similar. The volunteer could specify a single .zip with all the images in it, the server would break it apart, run the DjVu encoder on the bitmaps, and then spit back out the DjVu file. JHutch

Jon Niehof

4:44 a.m.

New subject: Initial thoughts on a PG/DP scan repository

--- Joshua Hutchinson <joshua@hutchinson.net> wrote:

...

The problem with DjVu is that there are no Windows-based encoders.

It would probably be possible to make the DjVuLibre encoders work under Cygwin. However: "...the best encoders (as of today) are owned by LizardTech Inc and kept proprietary. The smarts in the encoder can make a big difference in terms of file size and image quality...certain types of document compressed with LizardTech's commercial compressors or with the on-line conversion services (such as Any2DjVu) will end up smaller (and in some cases higher-quality) than the ones compressed with the DjVuLibre encoders." The format is also patent-encumbered. The terms are pretty liberal (although locked in to the GPL) and I wouldn't expect a Unisys-style bait-and-switch, but it really, really rubs me the wrong way. And finally, why encourage 600dpi scans and then muck 'em up with lossy compression? ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs

David Starner

5:43 a.m.

New subject: Initial thoughts on a PG/DP scan repository

On 7/15/05, Jon Niehof <jon_niehof@yahoo.com> wrote:

...

And finally, why encourage 600dpi scans and then muck 'em up with lossy compression?

DJVU is not necessarily lossy compression. In lossless mode, it still compress B&W scans better than TIFF or PNG. I don't think we're getting 600 dpi scans from most DP scanners anytime soon. It takes at least twice as long for me to scan at 600 dpi as it does at 300 dpi, and I'm more worried about scanning the long list of books I've checked out than producing very high resolution scans.

Marcello Perathoner

12:09 p.m.

New subject: Initial thoughts on a PG/DP scan repository

Jon Niehof wrote:

...

--- Joshua Hutchinson <joshua@hutchinson.net> wrote:

...
The problem with DjVu is that there are no Windows-based encoders.

It would probably be possible to make the DjVuLibre encoders work under Cygwin.

There is a binary package for Cygwin at: http://djvulibre.djvuzone.org/ So we are not letting Window$ users down.

...

However: "...the best encoders (as of today) are owned by LizardTech Inc and kept proprietary. The smarts in the encoder can make a big difference in terms of file size and image quality...certain types of document compressed with LizardTech's commercial compressors or with the on-line conversion services (such as Any2DjVu) will end up smaller (and in some cases higher-quality) than the ones compressed with the DjVuLibre encoders."

But the open source decoder can always decode what the commercial encoder generates. A djvu is really a multi-layered image in which each layer can be compressed using a different method. For our purpose (mostly b/w text scans) the jb2 compressor would be used. This is what "man djvu" says about the open source jb2 compressor: cjb2(1) A DjVuBitonal command line encoder. This soft-pattern-matching compressor produces DjVuBitonal images from PBM images. It can encode images without loss, or introduce small changes in order to improve the compression ratio. The lossless encoding mode is competitive with that of the Lizardtech commercial encoders.

...

The format is also patent-encumbered. The terms are pretty liberal (although locked in to the GPL) and I wouldn't expect a Unisys-style bait-and-switch, but it really, really rubs me the wrong way.

Its the only current format that gives the user basic comfort. - You just need to download one file. - You don't need to decompress it. - You can view it inside your browser just like you would a pdf. - It compresses better than anything else at present. I don't want to unload hundreds of image files on the user. Can you imagine reading Ulysses from 1.000 tiff files? Is there even any picture gallery software that can handle that gracefully? The only viable alternative would be pdf. Are you sure pdf is not patent-encumbered? Adobe never released any GPLed tools to produce pdfs. Sadly, current legislation encorages "information highway robbers" to waylay the citizen by springing patents on them.

...

And finally, why encourage 600dpi scans and then muck 'em up with lossy compression?

Even the lossy compression will retain enough detail, as you can see if you download my example file. Its not about preserving the Mona Lisa or the last complete copy of the Gutenberg Bible, but about scans of printed books. We don't care if the 3rd letter of the 25th line on page 234 has a tiny smear which the compression will lose. If you still can read and ocr the text, its good enough for us. -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner

10:30 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

PREFACE PG allows the posting of page images along an existing ebook. The only currently accepted format is tiff files collected in a zip archive. This format is cumbersome for the reader, wasteful on storage space and doesn't allow live deep linking to the page images. SCOPE This RFC specifies an alternative format that is less cumbersome for the reader, less wasteful of storage space and allows live deep linking to the page images. This new format will not replace the old format. Page images for any book can be posted in whichever of the 2 formats the poster chooses. The format specified here will permit online viewing of the page images (with djvu plugin). It is not required to download the file and unpack it as with the old format, although it is still possible to do so. The new format also allows for linking from an html document to an arbitrary page image, so that a click on a link will open the right page image in the djvu browser plugin. The new format compresses text pages 2-3 times better than the old format with no loss of readability or ocr-ability. The new format has GPLed compressors and decompressors. Even if the format should happen to become encumbered by patents or other licensing issues, PG will be able to easily and automatically convert the format into a legally unencumbered format. FORMAT The page images for a book will be posted into the main ebook directory in one multi-page djvu file. The file will be named #####.djvu (replace ##### with the ebook number). The multi-page djvu file contains a collection of single-page djvu files. Each single-page djvu file will contain one single-sided page of the book (cover, back and spine also count as single-sided pages) or an illustration scanned in a different resolution or color depth. Numbering / Naming of page files A book usually contains 2 page number sequences, a roman one followed by an arabic one. We considered the cover pages as yet another sequence. A filename for a single-page djvu file MUST follow this pattern: <prefix><page number>.djvu The prefix for the cover pages is: "c". The prefix for the roman pages is: "f". The prefix for the arabic pages is: "p". If there are more page number sequences in the book, they MUST be handled in a similar fashion, using an arbitrary free letter. The <page number> is the true page number as seen on the physical page (or inferred from the previous / next pages) expressed in arabic numerals and left-padded with zeroes to a length of 4 digits. For blank pages there should be no file and the page number should be skipped. Optionally an image saying: "This page is blank in the original." may be inserted. Missing pages MUST be replaced by an image saying: "This page is missing." A filename for a single-page djvu file containing an illustration scanned in a different resolution or color depth MUST follow this pattern: <prefix><page number>-<image position on the page>.djvu The <image position on the page> is "1" for the first image, "2" for the second, etc. If present, front cover, back cover and spine MUST be named as follows: front cover outside: c0001.djvu front cover inside: c0002.djvu back cover inside: c0003.djvu back cover outside: c0004.djvu spine: c0005.djvu Example of file naming: front cover c0001.djvu back cover c0004.djvu spine c0005.djvu i title page f0001.djvu ii title verso f0002.djvu iii dedication f0003.djvu iv is blank v contents f0005.djvu page 1 p0001.djvu page 2 p0002.djvu image on page 2 p0002-1.djvu image on page 2 p0002-2.djvu page 3 p0003.djvu page 4 is blank page 5 p0005.djvu ... ... page 9999 p9999.djvu Compression To produce the single-page djvu files you should use the most appropriate compressor: lossy jb2 for bitonal text, iw44 for continuous-tone images. Assemblage All single-page djvu files MUST be assembled into a multi-page djvu file. Only the multi-page djvu file will be posted. All "roman" and "arabic" pages MUST appear in the same order in the multi-page file as in the book. All cover pages and the book spine should appear at the front. The naming scheme was chosen so that saying: djvm -c 12345.djvu *djvu in a directory containing all single-page djvu files will assemble the multi-page djvu file in the correct sequence. APPENDIX Open Source compressors and browser plugins for Linux and Windows are available here: http://djvulibre.djvuzone.org/ Windows users also have to get and install Cygwin from here: http://www.cygwin.com/ -- Marcello Perathoner webmaster@gutenberg.org

David Starner

17 Jul 17 Jul

6:27 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

On 7/16/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

A filename for a single-page djvu file MUST follow this pattern: [...]

I think that's overspecifying it. I scan directly to DJVU, splitting the pages only in the conversion to PNGs for ABBYY. You get some advantage taking the unsplit scans, since some number of problems with the scans involves missplit pages. I would have no problem converting the split pages to DJVU. But it takes a lot more effort to renumber all the pages, especially given the fact that illustrations aren't always numbered as part of the overall page numbers and the fact that the older books and reprints of older books can lack page numbers, have multiple page number systems or have inconsistent numbering systems. And really, why is the extra effort worth it? Most of our etexts don't have page numbers to make it make a difference for errata, and for the rare times that someone will be looking for something in the images by page number, it's not hard to search for the right page number manually, just like you would with a book.

Jeroen Hellingman (Mailing List Account)

7:53 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

When I want to check something against my scans, I normally just use the find option in the OCR software, for a phrase close to where the issue is. I normally can locate the correct page very quickly that way. No need to split pages or renumber them... Jeroen. David Starner wrote:

...

And really, why is the extra effort worth it? Most of our etexts don't have page numbers to make it make a difference for errata, and for the rare times that someone will be looking for something in the images by page number, it's not hard to search for the right page number manually, just like you would with a book. _______________________________________________

Marcello Perathoner

10:17 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

David Starner wrote:

...

...
A filename for a single-page djvu file MUST follow this pattern:

And really, why is the extra effort worth it? Most of our etexts don't have page numbers to make it make a difference for errata, and for the rare times that someone will be looking for something in the images by page number, it's not hard to search for the right page number manually, just like you would with a book.

Most PG texts don't have page numbers, but those texts don't have scans either. So no problem there. Most new texts that come thru DP have page numbers and also scans. DP scans already are single-page per file. They just have to implement some small code to rename the files while compressing them to djvu. Your assumptions are that only humans need to find the right page in the scans and that this will be a rare task. I don't agree with either of them. I'm just trying to pre-empt problems that may appear down the line once we have a considerable number of page image files posted. Applications may pop up that we don't imagine yet. I don't want to weep over the lost page numbers as we are weeping today over the lost accents. And yes, there already are a few applications today that need a uniform page numbering style: - links to the page images from (generated) html - user reporting of errata - validation of user reported errata - side-by-side readers - scholarly online citation of PG texts -- Marcello Perathoner webmaster@gutenberg.org

Sebastien Blondeel

10:57 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

On Sun, Jul 17, 2005 at 12:17:25PM +0200, Marcello Perathoner wrote:

...

I'm just trying to pre-empt problems that may appear down the line once we have a considerable number of page image files posted. Applications may pop up that we don't imagine yet. I don't want to weep over the lost page numbers as we are weeping today over the lost accents.

I second that. Losing (meta-)information is always a bad idea. This means markup, non-ASCII characters, non-English texts, ... [*] In many historical books and essays, the author directs the reader to some other such work, page so and so. If/when PG had set up a way to recognize those and link to them, it will be possible to create links to the right part of the other e-text. Phony example: John Smith writes an _Essay on the History of Rome_ This is PG text number 23456 page 123, footnote 4, he says: [...] see James King, _Study of ancient Greece_, page 456. Now suppose that in a few years time 1/ this _Study of ancient Greece_ gets into PG, number 78967 (in the same edition John Smith used) 2/ some text-crawling program detects its fuzzy quotation in the Smith's essay 3/ some robot and/or human reworks Smith's essay to include hyperlinks to PG text 78967, at the correct page number (which would land at the right paragraph in the HTML) and checks them one by one That would be a killer ultimate library! [*] For this reason I am happy about the recent change at PGDP-US and creation of PGDP-EU, even though more work/details remain to be done (mostly, but not only, regarding the existing database of e-texts).

Marcello Perathoner

12:39 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

Sebastien Blondeel wrote:

...

In many historical books and essays, the author directs the reader to some other such work, page so and so.

If/when PG had set up a way to recognize those and link to them, it will be possible to create links to the right part of the other e-text.

This is not exactly what we were talking about. We were talkin about page images. But the problem is similar. Suppose we post a multi-page djvu, where the filenames of the single-page djvu files it contains are not in sync with the real page numbers. Some scholar will then cite a work like this: See "Foo" by Bar, Page 13 http://www.gutenberg.org/files/12345/12345.djvu#19.djvu Note: she links to internal file 19.djvu because this file just happens to contain page 13. This link will break if we ever decide to reorganize the djvu file to reflect the real page numbers. The Right Thing to do is to start so that you wont have to change the filenames even if you later insert cover pages / images scanned with higher resolution / missing pages etc. and even if you have to pull pages (eg. because you learn that they contain recent copyrighted material). Note: knowing DP project managers, some will want to link the page numbers in the html to the page images. So anybody can see how it is done and somebody external will follow. -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

3:07 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

Sebastien wrote:

...

Phony example:

John Smith writes an _Essay on the History of Rome_ This is PG text number 23456 page 123, footnote 4, he says: [...] see James King, _Study of ancient Greece_, page 456.

Now suppose that in a few years time 1/ this _Study of ancient Greece_ gets into PG, number 78967 (in the same edition John Smith used) 2/ some text-crawling program detects its fuzzy quotation in the Smith's essay 3/ some robot and/or human reworks Smith's essay to include hyperlinks to PG text 78967, at the correct page number (which would land at the right paragraph in the HTML) and checks them one by one

That would be a killer ultimate library!

Definitely! This example illustrates that as we begin considering how the PG collection can become more useful to more users in the future, PG will have to require the digital texts to be more carefully and strictly structured -- to meet a set of requirements. This is unavoidable -- it borders upon being a law of the universe. Jon Noring

Jon Niehof

2:45 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

--- Marcello Perathoner <marcello@perathoner.de> wrote:

...

DP scans already are single-page per file.

Well, not necessarily. I've dealt with projects where a two-page panoramic photograph was a single scan/file. Then there are the joys of illustrations/plates with no page number. And a project where, I kid you not, the actual page numbering ran 19, 20, 20a, 20b, 21. Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing. Perhaps this should be changed (I would argue yes :) ), but that does create complications with how page renumbering should work, for example. And I'm not sure how much we can change FineReader's behaviour; some of us have working with gocr on the list of things to do in our copious free time, but I expect it will be a long while before it's practical (years, not months). ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Marcello Perathoner

3:42 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

Jon Niehof wrote:

...

Well, not necessarily. I've dealt with projects where a two-page panoramic photograph was a single scan/file.

Then insert the photograf as p0042-1.djvu and skip p0043.djvu

...

Then there are the joys of illustrations/plates with no page number.

Same as above. If the illustration comes after page 42 name it p0042-1.djvu.

...

And a project where, I kid you not, the actual page numbering ran 19, 20, 20a, 20b, 21.

You should be able to get a dispense for that (from Pope Noring I) and unstandardly use: p0020.djvu p0020a.dvju p0020b.djvu p0021.djvu

...

Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing.

The only thing that has to be grafted on for now is a function to export the images into renamed files. Write a tool that asks: Starting Page No. in Database: 12 Ending Page No. in Database: 241 Prefix: p Starting Page No. in Files: 1 It will then generate files p0001.png thru p0230.png or whatever extension is appropriate for the internal image format you are using. If the book uses creative page numbering you'll have to call this repeatedly for every chunk. I don't know how frequent such projects are. For most books you'll need two calls of this tools and do the covers etc. manually. -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

8:07 p.m.

New subject: Page naming exceptions?

Marcello wrote:

...

Jon Niehof wrote:

...

You should be able to get a dispense for that (from Pope Noring I) and unstandardly use:

<laugh/> And I'm not even Roman Catholic!

...

...
Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing.

...

The only thing that has to be grafted on for now is a function to export the images into renamed files. Write a tool that asks:

What would be done for books where the endmatter is not paginated the same way as the body of the book? For example, appendices which may have their own page number system (such as A-1, B-3, etc.) or where there are pages at the back which are unnumbered (e.g., the Kama Sutra of Vatsyayana book I recently submitted to DP)? And what about unnumbered blank pages? And what about two-sided foldouts? And what about certain tracts which use no page numbering at all? It's handling exceptions which always complicates things such as page naming conventions for page scans. And this I am definitely interested in for the projects I've been studying. It would seem the page naming system has to be self-descriptive enough so that it can be machine-processed to provide an unambiguous picture of how the pages were numbered/named (if at all) in every document, as well as describing other things such as page ordering, blank pages, foldouts, etc. Maybe the page numbering should be a two part system where the first part is a number describing the order of the scan as found in the book (how to handle foldouts has to be defined by some convention), and the second which describes what the publisher named or numbered that page scan? At least with a machine-readable filename, it would be possible to transform the filenames for particular needs, such as DjVu. Anyway, what other unusual page numbering systems do we find in the old books? I'm sure the PG and DP veterans can share some unusual things they've encountered. Jon

Geoff Horton

8:12 p.m.

New subject: Page naming exceptions?

...

Anyway, what other unusual page numbering systems do we find in the old books? I'm sure the PG and DP veterans can share some unusual things they've encountered.

Denzinger's _Echiridion Symbolorum et Definitionum_ has: Preferatory material with Roman numerals (fortunately starting with the title page as page I) Text (arabic numerals) An appendix with a separate arabic numeration scheme (page numbers followed by asterisks) An index, with a separate arabic numeration scheme (page numbers in square brackets) As an additional wrinkle, each paragraph is individually numbered, and the work is cited by paragraph number, not by page. Geoff

Jon Noring

8:40 p.m.

New subject: Page naming exceptions?

Geoff wrote:

...

Jon asked:

...

...
Anyway, what other unusual page numbering systems do we find in the old books? I'm sure the PG and DP veterans can share some unusual things they've encountered.

...

Denzinger's _Echiridion Symbolorum et Definitionum_ has:

Preferatory material with Roman numerals (fortunately starting with the title page as page I)

Text (arabic numerals)

An appendix with a separate arabic numeration scheme (page numbers followed by asterisks)

An index, with a separate arabic numeration scheme (page numbers in square brackets)

As an additional wrinkle, each paragraph is individually numbered, and the work is cited by paragraph number, not by page.

Wow! It seems to me that a page scan naming system has to include the following metadata (or that it can be inferred, calculated, etc., by machine processing): 1) The sequence number of the page in the scanning project. This way we know the exact order each page scan appears among all the pages scanned. This would include blank pages that do not contribute to the publisher-supplied page numbering. Of course, as usual, some foldouts (especially if exotic) throw us a few curve balls. 2) How the publisher named/numbered a particular page. So maybe we might do something like (strawman example): 00035-28 Where 00035 is the 35th page in the page sequence of the book (including blank pages), and "28" is what the original publisher used for the page number ("blank" could be used for a totally blank page.) If there are really oddball publisher/author page naming (like the examples Geoff gives), that whole string could be incorporated into the second part of the filename (and if needed, properly escaped.) If there are applications that require a different page scan naming convention (such as DjVu), a script can be run to autochange the filenames for the particular use. As noted above, if there are foldouts and other similar oddities, then that might cause some difficulties with this numbering system. So, will this system work in general, or will it cause problems? Handling of references to paragraphs, verses, etc., can be done within XML markup (and publisher/author page numbers can also be put within markup.) Jon

Marcello Perathoner

18 Jul 18 Jul

12:44 a.m.

New subject: Page naming exceptions?

Geoff Horton wrote:

...

Denzinger's _Echiridion Symbolorum et Definitionum_ has:

Preferatory material with Roman numerals (fortunately starting with the title page as page I)

Use f0001 - f9999

...

Text (arabic numerals)

Use p0001 - p9999

...

An appendix with a separate arabic numeration scheme (page numbers followed by asterisks)

Use q0001 - q9999

...

An index, with a separate arabic numeration scheme (page numbers in square brackets)

Use r0001 - r9999

...

As an additional wrinkle, each paragraph is individually numbered, and the work is cited by paragraph number, not by page.

Citing by paragraph is outside the scope of the RFC. -- Marcello Perathoner webmaster@gutenberg.org

Hugh MacDougall

17 Jul 17 Jul

8:20 p.m.

New subject: Page naming exceptions?

Where there are unnumbered pages that are part of the overall pagination scheme (as frequently in preliminary pages) I give them the numbers they would have had they been numbered, but in square brackets, and then enclose them in curly brackets. Thus {[2]} The same would be true for unnumbered pages at the end if they seem to fit into the overall numbering scheme, or with unnumbered pamphlets, etc. Where there is a separate numbering system, as for appendices, I'd be inclined to follow your suggestion and label them as A1, B1 etc. -- i.e. {A3} {C5} etc. Inserted maps, pictures, etc. which are in addition to the numbering system might be identified as {37a}, etc. A further problem is raised when, as is frequent with early 19th century books, the book was issued in two separately numbered volumes, but most reprinted editions have combined them with a single numbering system. I don't suppose there's ever going to be a wholly satisfactory universal system, but necessary divergences from the usual can always be explained in preferatory note. Hugh C. MacDougall 8 Lake Street Cooperstown, NY 13326-1016 hmacdougall@stny.rr.com http://www.oneonta.edu/external/cooper http://www.oneonta.edu/external/ccal ----- Original Message ----- From: "Jon Noring" <jon@noring.name> To: <gutvol-d@lists.pglaf.org> Sent: Sunday, July 17, 2005 4:07 PM Subject: [gutvol-d] Page naming exceptions?

...

Marcello wrote:

...
Jon Niehof wrote:

...
You should be able to get a dispense for that (from Pope Noring I) and unstandardly use:

<laugh/> And I'm not even Roman Catholic!

...
...
Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing.

...
The only thing that has to be grafted on for now is a function to export the images into renamed files. Write a tool that asks:

What would be done for books where the endmatter is not paginated the same way as the body of the book? For example, appendices which may have their own page number system (such as A-1, B-3, etc.) or where there are pages at the back which are unnumbered (e.g., the Kama Sutra of Vatsyayana book I recently submitted to DP)? And what about unnumbered blank pages? And what about two-sided foldouts? And what about certain tracts which use no page numbering at all?

It's handling exceptions which always complicates things such as page naming conventions for page scans. And this I am definitely interested in for the projects I've been studying.

It would seem the page naming system has to be self-descriptive enough so that it can be machine-processed to provide an unambiguous picture of how the pages were numbered/named (if at all) in every document, as well as describing other things such as page ordering, blank pages, foldouts, etc. Maybe the page numbering should be a two part system where the first part is a number describing the order of the scan as found in the book (how to handle foldouts has to be defined by some convention), and the second which describes what the publisher named or numbered that page scan? At least with a machine-readable filename, it would be possible to transform the filenames for particular needs, such as DjVu.

Anyway, what other unusual page numbering systems do we find in the old books? I'm sure the PG and DP veterans can share some unusual things they've encountered.

Jon

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

David Starner

8:21 p.m.

New subject: Page naming exceptions?

On 7/17/05, Jon Noring <jon@noring.name> wrote:

...

Anyway, what other unusual page numbering systems do we find in the old books? I'm sure the PG and DP veterans can share some unusual things they've encountered.

In the newer facsimile reprints, I find many cases where there's a modern (uncopyrighted) introduction, and several pamphlets/books/whatever, with their own paging systems, with or without a overall new paging system. I also have a book proofing right now that has page numbers 95, 96, 95, 96, 97 ... And no, they didn't reprint the page, just reused the page numbers.

Robert Shimmin

18 Jul 18 Jul

3:37 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

...

Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing. Perhaps this should be changed (I would argue yes :) ), but that does create complications with how page renumbering should work, for example.

I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present). This seems to be the route DP is headed with the proposed "metadata collection" round. -- RS

Andrew Sly

6:50 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

On Mon, 18 Jul 2005, Robert Shimmin wrote:

...

I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).

I believe something similar is done for the page images at Early Canadiana Online. The images are just numbered sequentially starting from something like 001 and then in the interface presented to the user, actual page numbers are shown. (front matter is handled a little differently) Andrew

Marcello Perathoner

11:10 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

Robert Shimmin wrote:

...

I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).

They may have a completely different usage case than we have. PG is an archive and its primary scope is preserving material. We also make this material accessible to outside people. One of the principal points in online archiving is the permanence of the url. If you keep moving things around people will not be able to reference your material. Putting all pages of a book into a container and numbering them in one sequence may work well if you don't have external references to the sequence number. It may also work well if you are sure, *really* sure that you won't ever go back and add or remove some pages from the collection. But if you want people to link to the page images *and* be able to insert or delete pages from the collection after the first publication you cannot use the single sequence numbering technique. Suppose we have published page images for "Alice in Wonderland". Suppose the first edition just contains the pages and does not contain separate images for the illustrations. Later somebody goes back and scans the illustrations in a better resolution and naturally wants them to be inserted in the right position. Using sequential numbering you are dead in the water. Suppose you have collected a lot of page images but thrown away the images of the cover pages if there was no text on them. Later you decide to add those images too. Again, using single-sequence numbering you are dead in the water. But if you used the true page numbers none of these changes poses the slightest problem because you never have to change the urls of the pages you published before. -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

19 Jul 19 Jul

2:40 p.m.

New subject: A general scan image filename syntax (conceptual)

RS wrote:

...

...
Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing. Perhaps this should be changed (I would argue yes :) ), but that does create complications with how page renumbering should work, for example.

...

I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).

This seems to be the route DP is headed with the proposed "metadata collection" round.

This returns to my idea that one could embed some of the core metadata information into the scan image filename itself, so when it comes to pagination issues the metadata database need not be queried. (It is often an advantage to not have to query the database all the time to find out certain basic information -- it depends upon who will use the data and for what purpose(s), and if there's a chance the user will not have access at the time to the database.) To reiterate, we might consider the following syntax for scan image file naming (this is not necessarily what one would use when the images are embedded within a DjVu for deeplinking purposes, as Marcello is proposing in his RFC -- we have to differentiate between general page image naming and image naming within DjVu encapsulation.) ScanFileName == BookID : SeqPage# : [PrintedPage#] : [OtherInfo] : FormatSuffix The BookID could itself comprise multiple parts, depending upon the structure of the source BookID (we have the issue of multivolume sets, for example -- tbd.) It would be an identifier that will point to metadata elsewhere. The SeqPage# is an integer giving the position of the scan in the full scan set -- it is not the same as the page numbers printed in the work. We'd include all blank pages (both sides) from cover to cover as part of the numbering. PrintedPage# is a string comprising what the publisher used (or implied using) for the page number in that scan. It could be boring like '135' (for page 135) or something more bizarre as Geoff gave some examples. OtherInfo will state various oddities, such as implied page numbering, illustrations, foldout, etc., whatever is deemed necessary. This is where the system can be expanded when we run into some really odd and unforeseen stuff. Example (using '-' as a field delimeter, there's probably a better delimeter that could be used): DP0003579-00313-296-IMP.png 0003579 is the BookID -- an identifier used in the system. I have this as a 7-digit decimal (which is more human friendly) meaning that up to 10 million IDs (less one) are possible. But as noted above, this field might itself comprise multiple parts depending upon how everything is setup. 00313 is the sequential page number of the scan set for the book. Five decimal digits is more than enough for any single book volume under the sun that I can think of, 99998 pages, or 49,999 leaves! the insides of both covers can be considered pages as well since sometimes they contain interesting stuff (like my copy of the Kama Sutra.) '296' is the string the publisher used to identify the page (here it is a straight page number: "page 296".) 'IMP' means it is implied. the publisher actually did not print '296' on the page, but from looking at the page numbering sequence, it is clearly implied. If the '296' was printed, this field would be left blank,, viz. DP0003579-00313-296.png . Hmmm, as a final point, we may even want to put another field into the filename syntax which describes something about the scan image generation, so if the scan is resampled, cleaned-up, etc., that would differentiate it from the others which represent the same page image but at a different resolution or stage of cleanup -- or that it is the original image that came off the scanner. I recall when I was handling the Kama Sutra scan set, I'd do some bulk image processing on the lot, and generate a new scan set. I had to somehow keep track of filenaming issues so I didn't mix up the scan sets. It got to be an interesting exercise to keep everything straight since I had not yet come up with a filenaming system that made sense (I was just ad-hoc'ing' it as I went along, and didn't devote much thought to it.) So, yes, it does appear the filename should include a field about generation -- maybe here we have a system such as "00", "01", etc., where "00" is the raw scan right from the scanner (and the metadata would give the details about the "00" scan set, such as resolution and color depth, type of scanner, etc.), and the other numbers denote various conversions and cleanups also specified in detail in the metadata. Since image processing can be quite complicated and vary from job to job, it's seems impossible to come up with a simple system that could be embedded right in the filename, thus the proposal to stick to a non-descript system to differentiate various converted scan sets from the original scan set. I think if someone got a hold of the various derivative scan sets, and they don't have access to the database, by careful inspection of the images they could infer, in a general sense, most of what was done to produce each scan set. (Of course, most image formats allow one to write metadata within the header, but such information can be brittle. It is something worth looking into, however.) I know what I recommend above is not the input Marcello wants on his proposed RFC, but before we tackle the specific application Marcello is discussing (linking to particular pages within a DjVu), I think it wise we first look at the general issue of scan file image naming for a variety of purposes -- as part of the work flow, and for the end-use side. As noted before, the above syntax should be trivially remapped by machine processing to Marcello's proposed RFC syntax -- at least it appears the mapping (in one direction) is easily doable. Jon

David Starner

17 Jul 17 Jul

8:37 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

On 7/17/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

DP scans already are single-page per file.

But my originals aren't. That's not a big deal, but missplit pages can be a pain. So can pages the OCR badly despeckled.

...

They just have to implement some small code to rename the files while compressing them to djvu.

It's not a matter of code; it's a matter of human effort for this on every project. Most of my projects can't be described as page number + offset; if nothing else, illustrations aren't numbered in with the rest of the book.

...

Your assumptions are that only humans need to find the right page in the scans and that this will be a rare task. I don't agree with either of them.

Computers will have a hard time using page numbers considering how inconsistent they are.

...

I don't want to weep over the lost page numbers as we are weeping today over the lost accents.

We aren't losing anything; we just aren't recording it today.

...

- user reporting of errata - validation of user reported errata

Flip through the pages.

...

- scholarly online citation of PG texts

Read the page number off the bottom of the screen or off the HTML file.

...

Suppose we post a multi-page djvu, where the filenames of the single-page djvu files it contains are not in sync with the real page numbers. Some scholar will then cite a work like this:

See "Foo" by Bar, Page 13 http://www.gutenberg.org/files/12345/12345.djvu#19.djvu

Note: she links to internal file 19.djvu because this file just happens to contain page 13.

And the autoredirect code will break the link. Recently, I linked to http://www.gutenberg.org/files/12345/images/gothic.png in an email message, and it refused to load for some of my recipients. In any case, page 13 is enough to find the section. I don't know that we can guarentee that the page numbers will be stable, given that we may need to rename files to insert a missing page or illustration, and in practice if they insist on using the internal djvu files, they will usually stay the same. At least as likely as any URL you hand around.

Jon Noring

8:56 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

David Starner wrote:

...

Marcello wrote:

...

...
DP scans already are single-page per file.

...

But my originals aren't. That's not a big deal, but missplit pages can be a pain. So can pages the OCR badly despeckled.

[snip]

David brought up the issue of illustrations, which Juliet also mentioned yesterday. It is common in a DP book scan job to scan the pages at one resolution sufficient for text, then return and redo all the illustrations at a higher resolution. (There can be multiple illustrations per page, and an iluustration can be embedded within text.) So the page scan filename system has to include this possibility. Another thing I forgot to mention is if the page scan filename system should include an identifier pointing to the source book? So we might preface each filename with an ID associated with the source book. This way a page scan can be associated with a particular book (and its metadata) should it be copied somewhere else and stands alone. Or, we could dump millions of page scans from thousands of books together, and trivially identify those belonging to a particular book. So in the prior example I gave of "00035-28", we might now have: 0003857-00035-28 Where '0003857' is the decimal identifier for the source book which was scanned -- 7 digits gives us 10,000,000 books (hexadecimal would be slightly more compact but not as human friendly) -- '00035' is the sequential page as appears in the source book (independent of any page numbering scheme which includes unnumbered blank pages), and "28" is the page number (or 'string') the publisher/author actually printed on the page to identify it. No doubt there's problems with this system (as noted, how to deal with oddities such as foldouts and the like), but am proposing it as a sort of strawman. Jon

Marcello Perathoner

18 Jul 18 Jul

12:10 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

Jon Noring wrote:

...

It is common in a DP book scan job to scan the pages at one resolution sufficient for text, then return and redo all the illustrations at a higher resolution. (There can be multiple illustrations per page, and an iluustration can be embedded within text.)

So the page scan filename system has to include this possibility.

Did you actually *read* my RFC before commenting on it? I ask, because if you had read it, you would have noticed this section: ------ A filename for a single-page djvu file containing an illustration scanned in a different resolution or color depth MUST follow this pattern: <prefix><page number>-<image position on the page>.djvu The <image position on the page> is "1" for the first image, "2" for the second, etc. --------

...

0003857-00035-28

Where '0003857' is the decimal identifier for the source book which was scanned -- 7 digits gives us 10,000,000 books (hexadecimal would be slightly more compact but not as human friendly) -- '00035' is the sequential page as appears in the source book (independent of any page numbering scheme which includes unnumbered blank pages), and "28" is the page number (or 'string') the publisher/author actually printed on the page to identify it.

This is more complicated and less robust than my proposal. 1. You don't need the ebook number because the ebook number will be in the filename of the multi-page djvu. 2. You don't want the ebook number because at the time of scanning, the ebook number is unknown. 3. You don't want the sequence number in the filename because it increases the probability that links to the page image break. If you have to insert a page all subsequent files will have to be renamed, and all links to them will break. In my proposal no link will break if you insert or remove pages (except a link to the removed page). 4. Who wants to know about "unnumbered blank pages"? You are not going to cite a blank page, are you? -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

1:09 a.m.

New subject: Oops Re: RFC: Posting Page Scans in DJVU Format

Marcello wrote:

...

Jon Noring wrote:

...

...
It is common in a DP book scan job to scan the pages at one resolution sufficient for text, then return and redo all the illustrations at a higher resolution. (There can be multiple illustrations per page, and an iluustration can be embedded within text.)

So the page scan filename system has to include this possibility.

...

Did you actually *read* my RFC before commenting on it? I ask, because if you had read it, you would have noticed this section:

Oops, I apologize for not making it clear, but I was focusing not on your particular RFC (and its purpose), but on page scan filenaming in general (not embedded within DjVu or whatever). I renamed the Subject: header line on most of my messages to reflect this, but not all of them. I have a book, I am scanning it, and I want to apply an appropriate filename to each separate image so I (and others) can keep everything straight. I also want the filename to be machine processible so important page-related info can be machine read at a future time. I am not thinking of any particular application of using the page scans, such as DjVu. Your system for image naming within DjVu is interesting, and certainly there can be a mapping between a system like I propose, and the one to be used strictly within DjVu.

...

...
0003857-00035-28

Where '0003857' is the decimal identifier for the source book which was scanned -- 7 digits gives us 10,000,000 books (hexadecimal would be slightly more compact but not as human friendly) -- '00035' is the sequential page as appears in the source book (independent of any page numbering scheme which includes unnumbered blank pages), and "28" is the page number (or 'string') the publisher/author actually printed on the page to identify it.

...

This is more complicated and less robust than my proposal.

Well, we are sort of comparing apples and oranges. Sorry for not making that clear.

...

1. You don't need the ebook number because the ebook number will be in the filename of the multi-page djvu.

Of course, once the images are embedded within a DjVu, the source book ID need not be, and probably should not be, part of the page image naming. So you are right here.

...

2. You don't want the ebook number because at the time of scanning, the ebook number is unknown.

True. However, for my different situation, when a scan set is submitted to some repository, along with the metadata associated with the scan set, the repository may append the source book id that they assign to the front of the filename. Now, if they produce a single DjVu file from the scan set, then they can transform/remap the filenames to something appropriate for that specific purpose.

...

3. You don't want the sequence number in the filename because it increases the probability that links to the page image break. If you have to insert a page all subsequent files will have to be renamed, and all links to them will break. In my proposal no link will break if you insert or remove pages (except a link to the removed page).

Within DjVu, certainly! You bring up an interesting point, though in that if someone scans a book, and misses a page, then the sequential page scan numbering (not the same as the publisher page numbering) gets messed up. So once it is discovered a page is missing and is scanned, the sequential, integer numbering has to be fixed to "insert" that page. I am thinking, though, that any book scanning project will go through some kind of quality control checking, as well as generating a metadata/catalog record. During this process the scan file name will be finalized. If the page scans are later incorporated into a DjVu, then the filenames before embedding can be mapped into your proposed system.

...

4. Who wants to know about "unnumbered blank pages"? You are not going to cite a blank page, are you?

Again, I am not thinking specifically of deeplinking into DjVu and trying to maintain stable links, but rather the filenaming scheme for a bunch of page scans. With respect to understanding how a source book is laid out, it is a good idea to know where the blank pages were -- this also aids in knowing if the book scan set is complete. (There are reasons why many official documents add the statement "this page intentionally left blank" on blank pages. Also, it would not surprise me that in rare cases a page which should have been printed, turned out to be unprinted. But this is a different problem.) For deeplinking into a finalized DjVu file, the blank pages can be left out. You are right -- it is unlikely one will ever encounter a reference in one book to a blank page in another book, except maybe as some elaborate joke. Jon

Marcello Perathoner

12:38 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

David Starner wrote:

...

It's not a matter of code; it's a matter of human effort for this on every project. Most of my projects can't be described as page number + offset; if nothing else, illustrations aren't numbered in with the rest of the book.

Of course, if illustrations are stored out-of-sequence, they will have to be put into sequence manually. But this happens to be the case no matter which numbering scheme you use.

...

Computers will have a hard time using page numbers considering how inconsistent they are.

I guess that 95 % of all books fit my proposed scheme. There will always be the one book that can't be handled automatically. But then, with a little extra effort you can fix it up manually.

...

...
- user reporting of errata - validation of user reported errata

Flip through the pages.

So you are sitting all day before this machine that can do a lot of work for you and you still prefer to do the work by hand? If there is an error reported on page 634 I don't want to "flip through the pages", I want the right page to open in the first time.

...

And the autoredirect code will break the link. Recently, I linked to http://www.gutenberg.org/files/12345/images/gothic.png in an email message, and it refused to load for some of my recipients.

And rightfully so. Deep linking to images is not permitted. This is documented at: http://www.gutenberg.org/howto-link -- Marcello Perathoner webmaster@gutenberg.org

David Starner

19 Jul 19 Jul

6:14 a.m.

New subject: RFC: Posting Page Scans in DJVU Format

First, I subscribe to the mailing list; please don't CC me. On 7/17/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

So you are sitting all day before this machine that can do a lot of work for you and you still prefer to do the work by hand?

You're missing the point; I don't want to do the work by hand. You're the one that wants me to do the work.

...

If there is an error reported on page 634 I don't want to "flip through the pages", I want the right page to open in the first time.

Ask them to cite the image page number.

...

...
And the autoredirect code will break the link. Recently, I linked to http://www.gutenberg.org/files/12345/images/gothic.png in an email message, and it refused to load for some of my recipients.

And rightfully so. Deep linking to images is not permitted. This is documented at:

So what's the point of worrying about deep linking to images if we're not permitting deep linking to images? And frankly, it's not rightly so. It's a pain that I couldn't toss a link to an image and have people not be able to see it depending on their email program. Maybe it's a nessicity, but it makes it hard to refer to specific images online.

...

http://www.gutenberg.org/howto-link

-- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner

12:01 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

David Starner wrote:

...

First, I subscribe to the mailing list; please don't CC me.

Then you should set a reply-to header so my mailer knows you don't want to get the answer to your mail. Reply-To: Project Gutenberg Volunteer Discussion <gutvol-d@lists.pglaf.org> This is a perfect example for burdening your work onto other people's shoulders, like we will presently see another one. Why should I have to remember your wishes and manually delete your address every time I answer to one of your posts when there is a perfectly standard way for you in which you can configure your mailer so it happens automatically.

...

You're missing the point; I don't want to do the work by hand. You're the one that wants me to do the work.

I want you (the publisher) to do some more work, so thousands of people (your readers) will have less work to do.

...

...
If there is an error reported on page 634 I don't want to "flip through the pages", I want the right page to open in the first time.

Ask them to cite the image page number.

So everybody who wants to report an error has to install the djvu plugin and browse thru the file until they match the real page number with the imaginary one and then report the imaginary one? This is absurd, as the real page number is right there in the html file for everybody to see. This is not the way PG should treat people who want to help PG by reporting errors.

...

So what's the point of worrying about deep linking to images if we're not permitting deep linking to images?

The point is that we may want to deep link to the page images from our site (which works, otherwise you wouldn't be able to read illustrated books online). Furthermore the probability that deep links to the djvu file will crop up in weblogs by the thousands (like it used to happen for deep links to our images) is negligible, because not so many people will have the djvu plugin installed to make this technique feasible for bloggers. The danger of being /.ed by bloggers is minimal for djvu files so there is no reason to restrict access.

...

And frankly, it's not rightly so. It's a pain that I couldn't toss a link to an image and have people not be able to see it depending on their email program. Maybe it's a nessicity, but it makes it hard to refer to specific images online.

Why don't you just download the image and attach it to the mail? -- Marcello Perathoner webmaster@gutenberg.org

Branko Collin

1:07 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

On 19 Jul 2005, at 14:01, Marcello Perathoner wrote:

...

David Starner wrote:

...
First, I subscribe to the mailing list; please don't CC me.

Then you should set a reply-to header so my mailer knows you don't want to get the answer to your mail.

Er, no, this is the mailing list's job.

...

This is a perfect example for burdening your work onto other people's shoulders, like we will presently see another one.

Indeed. Who is the administrator for this mailing list? Perhaps they could fix this. (If indeed it is broken: I seem to receive these messages with the Reply-to field set to the list's address.) -- branko collin collin@xs4all.nl

Marcello Perathoner

2:32 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

Branko Collin wrote:

...

...
Then you should set a reply-to header so my mailer knows you don't want to get the answer to your mail.

Er, no, this is the mailing list's job.

If I hit "reply all" on your message, it gets addressed to the mailing list only. If I hit "reply all" on Davids mail, it gets addressed to him and to the mailing list. Lets figure out ... looking at the source of Davids message I see: Reply-To: David Starner <prosfilaes@gmail.com>, Project Gutenberg Volunteer Discussion <gutvol-d@lists.pglaf.org> So my mailer is just acting according to Davids request. -- Marcello Perathoner webmaster@gutenberg.org

Jon Niehof

3:21 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

--- Branko Collin <collin@xs4all.nl> wrote:

...

(If indeed it is broken: I seem to receive these messages with the Reply-to field set to the list's address.)

So do I. I also would hesitate to declare it "broken" otherwise; it tends to not fit my preferences but: http://www.unicom.com/pw/reply-to-harmful.html __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

David Starner

7:36 p.m.

New subject: RFC: Posting Page Scans in DJVU Format

On 7/19/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

I want you (the publisher) to do some more work, so thousands of people (your readers) will have less work to do.

Not thousands of people. Possibly not even one.

...

This is not the way PG should treat people who want to help PG by reporting errors.

Fine, then you flip through the book at the time. If it's so important and trivial, fix the page numbers at the same time. You'll have to look at a grand total of at most five pages, provided you're willing to do linear interpolation in your head.

...

Why don't you just download the image and attach it to the mail?

Because some mailing lists don't permit attachments; because some mailing list archives don't store attachments; and because it's a waste to email an image all over the place instead of just providing a link to it

Marcello Perathoner

14 Jul 14 Jul

10:08 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Branko Collin wrote:

...

_If_ PG would like to offer page scans, and I can give you several reasons why it would, I am sure it would also like to link between the scans and the etexts.

Are these scans online and accessible at DP ? If so, linking them from the PG catalog would be a matter of a few hours work, assuming I can get a list of etext-no => page-scan-url -- Marcello Perathoner webmaster@gutenberg.org

Greg Newby

15 Jul 15 Jul

10:24 a.m.

New subject: Scans and Texts (Re: Copyright Verification?)

I can volunteer one of my development servers, or call Brewster myself, if it would help someone ready to do the work. As Juliet said, there are easy parts but also some non-trivial parts. An automation process, to pull all images from DP at a time an eBook is posted, is very much non-trivial. But just doing one or two titles as a sample would help. The post-10K file structure (described in GUTINDEX.ALL and elsewhere) allows specifically to include page scans. These don't need to exist separately from their eBook, though for first efforts it might make sense to store them elsewhere. As Juliet & others mentioned, the *archiving* is already being done. The next step is distribution. -- Greg On Thu, Jul 14, 2005 at 01:53:57PM +0200, collin@xs4all.nl wrote:

...

Jon:

...
There seems to be a limited, dichotomous view of the uses and users of structured digital texts: either casual reading by the average Joe, or hard-core academic use.

Undoubtedly that view exists, perhaps even at PG. Er, so what?

...
In the meanwhile, here's one system that could be tried until the more permanent system is developed:

1) Dial 415-561-6767 during working hours.

2) When Beatrice or Astrid answers, ask for Brewster Kahle.

3) Identify yourself as a DP person, and ask Brewster if he will archive and make available DP's page scans via a stable URL.

4) Await his answer, which may include an alternate suggestion. But I suspect it will be a positive reply. Brewster *loves* any and all high-quality public domain content.

This may take a half hour.

That doesn't seem too difficult to me.

And yet you seem completely unable to do this. I wonder why? In the time you took to write this lengthy e-mail, you could have set up a page scan archive at TIA. If this is so important to you, why haven't you done this already?

Jon, at the moment you come across like the nth of the Vapourware Kings that are regularly trolling this board. "Why don't you do X? Any idiot could do X in two working days!" Now I know you are not a Vapourware King, so what's with the act? The most likely reason why we have no page scan archive is because no-one has taken the time to set it up.

DP's page scans are accessible to anyone with an account. (Probably even to those without an account.) The only hard bit is knowing which PG posted text goes with which DP text ID, so that you can recombine them when necessary. I believe we even save bibliographical data with our texts, so that you could extract all kinds of metadata to go with the pagescans.

I'd do it, but I have other things to do.

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Robert Shimmin

14 Jul 14 Jul

12:39 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

...

Although it would be nice if we could also cater for the academic world, that is by no means necessary. The academic world has access to our texts; if that for some reason is unsufficient, we need to determine whether that is through some fault of ours. I submit it isn't.

Nevertheless, if we could somehow also make the page scans accessible, that could be handy for several reasons. I believe Charles Franks was working on such a system?

The Christian Classics Etherial Library, which was doing distributed proofreading even before DP, has an interesting approach to this. Rather than wait until a project is "done" before posting it do their site, they post the scans and raw OCR the day they get it, and it's this live version that undergoes continuous proofreading and markup a page at a time. My memory says they do have a way of indicating to the end user how thoroughly "done" any particular text is, but cannot recall any details at this juncture. -- RS

Jon Noring

3:50 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Robert wrote:

...

The Christian Classics Etherial Library, which was doing distributed proofreading even before DP, has an interesting approach to this. Rather than wait until a project is "done" before posting it do their site, they post the scans and raw OCR the day they get it, and it's this live version that undergoes continuous proofreading and markup a page at a time.

...

My memory says they do have a way of indicating to the end user how thoroughly "done" any particular text is, but cannot recall any details at this juncture.

Interesting! I know from long-ago firsthand experience that Christians (particularly Fundamentalists) take seriously the *textual integrity* of the books and texts they use. Thus, it is not surprising that CCEL has setup procedures to assure fidelity, with community oversight to further assure nothing gets corrupted in any way, either accidentally or intentionally. The community aspect adds trustworthiness to the project. It would not surprise me if the CCEL texts have quite low error rates and those errors which still exist are pretty innocuous as errors go. Jon

Branko Collin

9:55 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 14 Jul 2005, at 9:50, Jon Noring wrote:

...

I know from long-ago firsthand experience that Christians (particularly Fundamentalists) take seriously the *textual integrity* of the books and texts they use.

...

From what I understand Muslems and Jews take this a step further, although I forget the specifics. Something about exactly copying the sacred texts, down to using a specific handwriting.

-- branko collin collin@xs4all.nl

Andrew Sly

6:20 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

Anyone interested in this idea of posting page images first, and then gradually tracking the proofreading, may want to check out Project Runeberg. They have been using a process similar to this for a while now. Andrew On Thu, 14 Jul 2005, Robert Shimmin wrote:

...

The Christian Classics Etherial Library, which was doing distributed proofreading even before DP, has an interesting approach to this. Rather than wait until a project is "done" before posting it do their site, they post the scans and raw OCR the day they get it, and it's this live version that undergoes continuous proofreading and markup a page at a time.

My memory says they do have a way of indicating to the end user how thoroughly "done" any particular text is, but cannot recall any details at this juncture.

David Starner

8:13 p.m.

New subject: Scans and Texts (Re: Copyright Verification?)

On 7/14/05, Robert Shimmin <shimmin@uiuc.edu> wrote:

...

The Christian Classics Etherial Library, which was doing distributed proofreading even before DP, has an interesting approach to this. Rather than wait until a project is "done" before posting it do their site, they post the scans and raw OCR the day they get it, and it's this live version that undergoes continuous proofreading and markup a page at a time.

I've looked at a similar system on Project Runeberg, but it doesn't look very successful at outputing a number of accurate and complete etexts. DP does a very good job at keeping attention focused on a few texts and keeping them moving forward page by page in the system, where as those system seem to disperse the effort over a lot of books and a lot of pages that are corrected more or less at random.

7291

Age (days ago)

7297

Last active (days ago)

List overview

Download

69 comments

18 participants

participants (18)

Andrew Sly
Branko Collin
collin＠xs4all.nl
David Starner
Geoff Horton
Greg Newby
Hugh MacDougall
Jeroen Hellingman (Mailing List Account)
Jon Niehof
Jon Noring
Joshua Hutchinson
Juliet Sutherland
Karen Lofstrom
Karl Eichwalder
Marcello Perathoner
Robert Cicconetti
Robert Shimmin
Sebastien Blondeel