
Marcello Perathoner <marcello@perathoner.de> wrote:
Robert Shimmin wrote:
I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).
They may have a completely different usage case than we have.
Putting all pages of a book into a container and numbering them in one sequence may work well if you don't have external references to the sequence number. It may also work well if you are sure, *really* sure that you won't ever go back and add or remove some pages from the collection.
But if you want people to link to the page images *and* be able to insert or delete pages from the collection after the first publication you cannot use the single sequence numbering technique.
Suppose we have published page images for "Alice in Wonderland". Suppose the first edition just contains the pages and does not contain separate images for the illustrations. Later somebody goes back and scans the illustrations in a better resolution and naturally wants them to be inserted in the right position. Using sequential numbering you are dead in the water.
Suppose you have collected a lot of page images but thrown away the images of the cover pages if there was no text on them. Later you decide to add those images too. Again, using single-sequence numbering you are dead in the water.
But if you used the true page numbers none of these changes poses the slightest problem because you never have to change the urls of the pages you published before.
This is an extremely important point that I feel has been woefully neglected up to this point. Just what are the use case scenarios this solution is intended to solve? You cannot effectively answer the question of 'How?' until you have answered the question of 'Why?' If you can come up with a number of reasonable scenarios as to how these page scans could be used, the naming mechanism almost defines itself. I'm very skeptical that more than an extremely small number of people will be interested in downloading a full set of page scans. As I see it, the most likely use case is that someone is reading a Gutentext offline and encounters a passage that doesn't seem right. This person would say to himself, "Hmm, that doesn't seem right. I wonder if that's really what was in the book." He (or she) would then go online, look up the image of the page in question and conclude, "I guess that's right", or "wow, what a glaring mistake; I'd better tell someone", or "ohh, the emphasis is missing, _now_ it makes sense." Of course, it is impossible to search for a word in an image, so in this scenario it is important that the name of the page scan be embedded in the resulting text. The actual name used is not important, so long as it is immutable and unique across the namespace of all literature. Also, in this scenario it is important that the page scans be available individually on the 'net, and not solely as part of an encapsulation. Thus, there must exist some mechanism to identify which scans belong to which edition of which work. Another possible use case is when an individual or company says, "The production values of Gutentexts are extremely poor, I would like to create a high-quality version of a public domain work that I could then sell for 5 bucks." (http://www.hidden-knowledge.com/) In this case, it is probably important that the page scans be available in a format that can be easily used by an OCR program (because in most cases it is easier to re-OCR to retain formatting than it is to try and clean up a Gutentext), and it is probably important that the page scans be named in some way that preserves their ordering. Of course there is the use case Mr. Perathoner postulated where a scan is simply incomplete and it needs to be augmented in the future. It may be that the scanner was simply so short-sighted as to discard otherwise important data, thinking that it wasn't really important, it may be that the scanner was forced to excise material that was still covered by copyright at the time of scanning, or it may be that it is a scan of the "Kama Sutra", and the only available copy had been razor-bladed by sophomores from the local high school to get at "the good parts." In this case it is important that the scans be maintained as individual pages, and that the naming convention permits new images to be "inserted" between two already existing scans. While I can't come up with a good use case scenario off the top of my head, it seems to me that it will be important that the naming convention will need to support a work flow which is 100% machine based; that is, once the scans and OCRs are completed any number of other processes can occur which do not require _any_ human intervention. At first blush, Mr. Perathoner's proposal for a naming convention seems to satisfy the requirements of all these use cases (although there are some abiguities and uncertainties that may need to be tightened up). But it seems to me that there should be a bit more discussion about the use cases before discussing implementation. Now I need to digress for a minute. David Starner <prosfilaes@gmail.com> wrote:
On 7/17/05, Marcello Perathoner <marcello@perathoner.de> wrote:
And rightfully so. Deep linking to images is not permitted. This is documented at:
So what's the point of worrying about deep linking to images if we're not permitting deep linking to images?
I think there may be some misunderstanding of what Jon Noring is talking about when he uses the term "deep-linking". Consider the case of reported United States legal decisions. A typical legal decision is cited as, e.g., "MATTHEW BENDER & CO. v. WEST PUBLISHING CO., 158 F.3d 674 (2nd Cir. 1998)." This citation contains some metadata (the names of the parties, the appeals court which rendered the decision [the Second Circuit Court of appeals], and the year of the decision). It also contains a reference to how this case may be found: look in the 158th volume of the Federal Reporter, 3rd series at page number 674 (the official Federal Reporter is published by West Publishing Company). Now suppose I wanted to point a person to that point in the case where Judge Jacobs first cites the U.S. Supreme Court case of "FEIST PUBLICATIONS, INC. v. RURAL TELEPHONE SERV. CO., 499 U.S. 340, 111 S.Ct. 1282 (1991)". I would do this by citing '158 F.3d 674 at 679'. As you can see from this last example, the citation actually contains two parts: the shallow link, '158 F.3d 674', and the deep link, 'at 679.' The shallow link and the deep link are actually two orthogonal concepts; related but independant. The shallow link tells you how you can find the document, the deep link tells you how you find the point you are looking for inside the document once you have found it. Now in the online version of the _Bender v. West case_ (http://www.law.cornell.edu/copyright/cases/158_F3d_674.htm), deep linking is acheived through "star pagination"; scattered throughout the text you will find strings of the form "[p*679]" which tell you where the page number was in the official West publication. (If you want to know more about "star pagination", read the case; it was an attempt by West Publishing to get a ruling that star pagination, or deep linking via page numbers, was a violation of their copyright. West lost.) Deep linking in XHTML can be acheived in a couple of different ways. On way is to include id attributes on tags throughout the file. Thus, the aforementioned star pagination could be encoded as '<a id="p679">[p*679]</a>', which would allow a person to cite the location as '158_F3d_674.htm#p679.' On the other hand, using XPATH that citation could be cited as ''158_F3d_674.htm//p[13]' which identifies the 13th <p> element in the document. The XPATH mechanism gives you greater granularity (the citation to _Feist_ is actually in paragraph 13, which is the third full paragraph after the page marker) and doesn't require _any_ fragment identifiers to be imbedded in the document, but it is not yet well supported in user agents, and it _does_ require the document to be in a canonical form. The point of this little digression is that funtional requirements for shallow linking and deep linking are very different, as are their solutions. Shallow linking has to solve the problem of "How do I find a copy of a particular document in the great digital library that is the web", and deep linking has to solve the problem of "How do I indicate a specific point in a document that I have already found." This brings me to the last use case I will present (I'm sure there are many others, as well). Many, if not most, lower court cases are not yet digitized, yet the cases citing them are. I want to be able to present the digitized text with deep links (but perhaps not shallow links) that will allow me to get to an as yet undigitized cited case _when it appears in the future_. Page scans which are named according to existing page numbers will allow be to do that in the future, as will embedding page number links into structured digital texts as they are produced. Now the inevitable counter to my desire to be forward-looking is the argument "we don't need it right now, so why should we do it now." My response is much more philosopical than it is empirical: If we don't have the time to do it right, how will we find the time to do it over?