[gutvol-d] Re: gutvol-d Digest, Vol 12, Issue 25

19 Jul 2005

      Marcello Perathoner <marcello@perathoner.de> wrote:
...
Robert Shimmin wrote:
...
I know of one fairly prominent commercial digital library (Eighteenth 
Century Online) that made the decision that the issues with using 
page numbers as unique identifiers are sufficiently hairy that they 
went with sequential *image* numbers (which are unique), and the 
database maintains as metadata the page number that goes with each 
image number (which may not be unique, sequential, numerical, or even 
present).
They may have a completely different usage case than we have.
Putting all pages of a book into a container and numbering them in one 
sequence may work well if you don't have external references to the 
sequence number. It may also work well if you are sure, *really* sure 
that you won't ever go back and add or remove some pages from the 
collection.
But if you want people to link to the page images *and* be able to 
insert or delete pages from the collection after the first publication 
you cannot use the single sequence numbering technique.
Suppose we have published page images for "Alice in Wonderland". 
Suppose the first edition just contains the pages and does not contain 
separate images for the illustrations. Later somebody goes back and 
scans the illustrations in a better resolution and naturally wants 
them to be inserted in the right position. Using sequential numbering 
you are dead in the water.
Suppose you have collected a lot of page images but thrown away the 
images of the cover pages if there was no text on them. Later you 
decide to add those images too. Again, using single-sequence numbering 
you are dead in the water.
But if you used the true page numbers none of these changes poses the 
slightest problem because you never have to change the urls of the 
pages you published before.
This is an extremely important point that I feel has been woefully 
neglected up to this point. Just what are the use case scenarios this 
solution is intended to solve? You cannot effectively answer the 
question of 'How?' until you have answered the question of 'Why?' If you 
can come up with a number of reasonable scenarios as to how these page 
scans could be used, the naming mechanism almost defines itself.

I'm very skeptical that more than an extremely small number of people 
will be interested in downloading a full set of page scans. As I see it, 
the most likely use case is that someone is reading a Gutentext offline 
and encounters a passage that doesn't seem right. This person would say 
to himself, "Hmm, that doesn't seem right. I wonder if that's really 
what was in the book." He (or she) would then go online, look up the 
image of the page in question and conclude, "I guess that's right", or 
"wow, what a glaring mistake; I'd better tell someone", or "ohh, the 
emphasis is missing, _now_ it makes sense."

Of course, it is impossible to search for a word in an image, so in this 
scenario it is important that the name of the page scan be embedded in 
the resulting text. The actual name used is not important, so long as it 
is immutable and unique across the namespace of all literature. Also, in 
this scenario it is important that the page scans be available 
individually on the 'net, and not solely as part of an encapsulation. 
Thus, there must exist some mechanism to identify which scans belong to 
which edition of which work.

Another possible use case is when an individual or company says, "The 
production values of Gutentexts are extremely poor, I would like to 
create a high-quality version of a public domain work that I could then 
sell for 5 bucks." (http://www.hidden-knowledge.com/) In this case, it 
is probably important that the page scans be available in a format that 
can be easily used by an OCR program (because in most cases it is easier 
to re-OCR to retain formatting than it is to try and clean up a 
Gutentext), and it is probably important that the page scans be named in 
some way that preserves their ordering.

Of course there is the use case Mr. Perathoner postulated where a scan 
is simply incomplete and it needs to be augmented in the future. It may 
be that the scanner was simply so short-sighted as to discard otherwise 
important data, thinking that it wasn't really important, it may be that 
the scanner was forced to excise material that was still covered by 
copyright at the time of scanning, or it may be that it is a scan of the 
"Kama Sutra", and the only available copy had been razor-bladed by 
sophomores from the local high school to get at "the good parts." In 
this case it is important that the scans be maintained as individual 
pages, and that the naming convention permits new images to be 
"inserted" between two already existing scans.

While I can't come up with a good use case scenario off the top of my 
head, it seems to me that it will be important that the naming 
convention will need to support a work flow which is 100% machine based; 
that is, once the scans and OCRs are completed any number of other 
processes can occur which do not require _any_ human intervention.

At first blush, Mr. Perathoner's proposal for a naming convention seems 
to satisfy the requirements of all these use cases (although there are 
some abiguities and uncertainties that may need to be tightened up). But 
it seems to me that there should be a bit more discussion about the use 
cases before discussing implementation.

Now I need to digress for a minute.

David Starner <prosfilaes@gmail.com> wrote:
...
On 7/17/05, Marcello Perathoner <marcello@perathoner.de> wrote:
...
And rightfully so. Deep linking to images is not permitted. This is
documented at:
So what's the point of worrying about deep linking to images if we're
not permitting deep linking to images?
I think there may be some misunderstanding of what Jon Noring is talking 
about when he uses the term "deep-linking". Consider the case of 
reported United States legal decisions. A typical legal decision is 
cited as, e.g., "MATTHEW BENDER & CO. v. WEST PUBLISHING CO., 158 F.3d 
674 (2nd Cir. 1998)." This citation contains some metadata (the names of 
the parties, the appeals court which rendered the decision [the Second 
Circuit Court of appeals], and the year of the decision). It also 
contains a reference to how this case may be found: look in the 158th 
volume of the Federal Reporter, 3rd series at page number 674 (the 
official Federal Reporter is published by West Publishing Company).

Now suppose I wanted to point a person to that point in the case where 
Judge Jacobs first cites the U.S. Supreme Court case of "FEIST 
PUBLICATIONS, INC. v. RURAL TELEPHONE SERV. CO., 499 U.S. 340, 111 S.Ct. 
1282 (1991)". I would do this by citing '158 F.3d 674 at 679'. As you 
can see from this last example, the citation actually contains two 
parts: the shallow link, '158 F.3d 674', and the deep link, 'at 679.'  
The shallow link and the deep link are actually two orthogonal concepts; 
related but independant. The shallow link tells you how you can find the 
document, the deep link tells you how you find the point you are looking 
for inside the document once you have found it.

Now in the online version of the _Bender v. West case_ 
(http://www.law.cornell.edu/copyright/cases/158_F3d_674.htm), deep 
linking is acheived through "star pagination"; scattered throughout the 
text you will find strings of the form "[p*679]" which tell you where 
the page number was in the official West publication. (If you want to 
know more about "star pagination", read the case; it was an attempt by 
West Publishing to get a ruling that star pagination, or deep linking 
via page numbers, was a violation of their copyright. West lost.)

Deep linking in XHTML can be acheived in a couple of different ways. On 
way is to include id attributes on tags throughout the file. Thus, the 
aforementioned star pagination could be encoded as '<a 
id="p679">[p*679]</a>', which would allow a person to cite the location 
as '158_F3d_674.htm#p679.' On the other hand, using XPATH that citation 
could be cited as ''158_F3d_674.htm//p[13]' which identifies the 13th 
<p> element in the document. The XPATH mechanism gives you greater 
granularity (the citation to _Feist_ is actually in paragraph 13, which 
is the third full paragraph after the page marker) and doesn't require 
_any_ fragment identifiers to be imbedded in the document, but it is not 
yet well supported in user agents, and it _does_ require the document to 
be in a canonical form.

The point of this little digression is that funtional requirements for 
shallow linking and deep linking are very different, as are their 
solutions. Shallow linking has to solve the problem of "How do I find a 
copy of a particular document in the great digital library that is the 
web", and deep linking has to solve the problem of "How do I indicate a 
specific point in a document that I have already found."

This brings me to the last use case I will present (I'm sure there are 
many others, as well). Many, if not most, lower court cases are not yet 
digitized, yet the cases citing them are. I want to be able to present 
the digitized text with deep links (but perhaps not shallow links) that 
will allow me to get to an as yet undigitized cited case _when it 
appears in the future_. Page scans which are named according to existing 
page numbers will allow be to do that in the future, as will embedding 
page number links into structured digital texts as they are produced.

Now the inevitable counter to my desire to be forward-looking is the 
argument "we don't need it right now, so why should we do it now." My 
response is much more philosopical than it is empirical: If we don't 
have the time to do it right, how will we find the time to do it over?

[gutvol-d] Re: gutvol-d Digest, Vol 12, Issue 25

Lee Passey