re: [gutvol-d] RFC: Posting Page Scans in DJVU Format

robert said:
I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).
and that's a perfectly reasonable fall-back position that will work in any situation where the normal system won't. (actually, most normal cases are a subset of this, where the page-number and the image-number happen to match.) after all, these are books, comprised of sheets of paper that were bound together, with a natural consequence of _being_ bound being the enforcing of a specific linearity on those sheets. as long as the filenames -- when sorted by our various tool apps -- result in that _same_ specific linearity, there will be no uncertainty. (and, since we have the power to name them, we can ensure that.) the focus on exceptions here is making too much ado about nothing. it means you're talking about the wrong things -- there are questions which _do_ need to be discussed , but this is frankly not one of them. and it also means that you're making things unnecessarily complicated, which is not surprising, since that's the jack-in-trade forte of jon noring. the right approach is _a_simple_flexibility_ that gives the right outcome. -bowerbird

Bowerbird wrote:
robert said:
I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).
and that's a perfectly reasonable fall-back position that will work in any situation where the normal system won't. (actually, most normal cases are a subset of this, where the page-number and the image-number happen to match.)
after all, these are books, comprised of sheets of paper that were bound together, with a natural consequence of _being_ bound being the enforcing of a specific linearity on those sheets.
as long as the filenames -- when sorted by our various tool apps -- result in that _same_ specific linearity, there will be no uncertainty. (and, since we have the power to name them, we can ensure that.)
the focus on exceptions here is making too much ado about nothing. it means you're talking about the wrong things -- there are questions which _do_ need to be discussed , but this is frankly not one of them. and it also means that you're making things unnecessarily complicated, which is not surprising, since that's the jack-in-trade forte of jon noring. the right approach is _a_simple_flexibility_ that gives the right outcome.
It is the exceptions which drive the system one designs, even if it is a flexible system. Understand most of the exceptions, then the system that makes sense will become readily apparent. In just about every technical and social endeavor I know of and participated in (e.g., as the lead project manager for a very big, 30 person engineering project at Yucca Mountain), handling the exceptions is critical to success since reality never perfectly fits with our simplistic models. There's a lot of truth in the saying: "The devil is in the details." One can come up with all kinds of cool and simple systems to do a job, and 9 times out of 10 none of them will work in the real world, at least without some specific exceptions handling capability. This does not mean one does not seek simplicity, but as Einstein once said: "Everything should be as simple as possible, but no simpler." (there are variations on this quote.) I've been working on a discographical database for 78 rpm sound recordings, and after a couple years of discourse with the experts and the record collectors and others who will *use* the database, the system *has to* take into account the various exceptions, oddities and deviations from the simple 99.9% "norm". If not, the system will be useless and broken from the users' point of view. During this design, I'd identify one or two exceptions that I know of, and then come up with the simplest most straightforward design that made sense. Then another exception gets identified, and the design becomes insufficient, even though I tried to design flexibility into it. I could jerry-rig something with the current design, but then the result makes database developer's lives more difficult. You can talk all you want about flexibility in design, but flexibility in design is an elusive animal -- and the problem is compounded in that once a design is implemented and embraced, applications built around them, and huge amounts of data organized by the design -- having to change the design because of an unforeseen exception which comes up is a real pain in the butt. Look at the difficulties faced to fix the various problems in the PG collection acquired because of decisions made years ago; because the collection now has 15,000 texts, retroactive fixing is a daunting task (especially the non-DP portion.) Another point brought up, "right outcome", is equally ambiguous. The "right outcome" depends upon what *we* believe should be the "right outcome", and this can only be determined by discussion of the purpose and specific long-term goals of PG. It is not your "right outcome", it is not my "right outcome." Some of my comments yesterday proposing the kinds of future uses and the supported user groups PG should target is just such an attempt to keep some public discussion alive on "what should we support in the future?" If this is not discussed, then whatever transpires will be hit-or-miss -- will it luckily get us to some desired future point, or will we look back upon what could have been with regret? Anyway, the people here who work with public domain texts all the time know about the exceptions, and there are many (In my close work with over a dozen PD texts I've encountered exceptions in *every* one of them when compared to the "simple model" we all seem to understand. And in some cases they are significant drivers.) So it does pay to identify and discuss the exceptions before deciding on any design which will be employed in a system that will handle thousands and eventually millions of texts. If one "does it right" at the start, it makes things easier down the road, even if mid-course corrections are later required due to things that just could not be foreseen. The more exceptions that are handled today, the fewer that will need to be handled tomorrow -- that sounds good when are talking about one million etexts. Jon
participants (2)
-
Bowerbird@aol.com
-
Jon Noring