Scan file naming -- another comment

newer
re: [gutvol-d] 600 dpi vs. 300 dpi...

Jon Noring

22 Jul 2005 22 Jul '05

5:45 p.m.

Bowerbird's thoughts on scanning is a good summary of some of the issues. And despite his view that we've gotten off-track on the discussion, his points about filenaming, image processing (deskewing), etc., align pretty well with the ongoing discussion. Regarding scan filenaming, he rightfully notes that a source book (or Work) identifier be prepended to the filename. This is what I have also proposed. Where we differ in filename convention is that I believe right after the source ID be a sequential number which describes where the page side is in the linear order of all the page sides in the book (totally independent of how the publisher may have paginated the book.) This way one will unambiguously and immediately know the position of every page scan in the bound book (starting with the inside of the front cover, which can be "side 1", and end with the inside of the back cover -- alternatively we can start with the front cover as "side 1" which has some advantages with respect to the dominant recto/verso page numbering convention.) All blank pages will be included. Now, this sequential number will not correlate at all with whatever pagination the publisher uses to 'id' the pages. So, after the sequential number we have a third field in the filename which gives the actual publisher supplied page number (if any; can be implied). This way we decouple the publisher pagination with the page sequence in the book, thereby simplifying the system and making it more flexible. It will be able to handle *any* bizarre pagination system the publisher/author dreamed up (the publisher could number the pages backwards for all we care, and this system will handle it without any complications -- yet we preserve the publisher-supplied page "number" in the filename which is important for referencing/citation.) Example: DP0000239-00125-106.png "DP0000239" is the source book identifier, here a DP identifier. If the scan project is independent of DP, it could be 'PG0014239' to associate the scan set with PG text number 14239. "00125" says this is the 125th "side" in the full sequence of sides in the book, starting from the front cover or wherever else is considered the starting point. "106" is the string (which can be more complicated like "A2", "5-4", "ix", "ABCD" whatever), which the publisher printed on that page to identify it (that's really what a publisher-supplied page "number" is: a page identifier.) (My proposed system has a couple more fields after these three, dealing with exceptions and generation of the scan set from the original, which aid in keeping tracking of multiple derivative scan sets and a few other oddities. The details are described in previous messages.) Jon Noring [Note: In the "My Antonia" project, it is interesting that there is no "Page 1" and "Page 2". The book starts (after the Roman numbered foreword section) with Page 3! Now imagine getting a scan set of "My Antonia" where we have defined page scan sequencing using the page numbering the publisher used (which are the systems proposed by Marcello and Bowerbird.) The first question I will have is "where are pages 1 and 2? Are they missing from the set?" However, if the scans are sequentially numbered based on their position in the book (with the knowledge the book passed QC checking), then I would know that at least the project saw this, too, and likely there were no missing pages, thus it is likely Page 1 and 2 never existed. And for those who will ask, before scanning the book I took it apart to determine if there was a page which got ripped out there, but there definitely was no ripped out page. There might have been an inserted/ glued plate which "fell out", but checking with the "My Antonia" experts there definitely was no such insert in the First Edition. How many other books start pagination of the body with something other than 1?]

Show replies by date

Marcello Perathoner

22 Jul 22 Jul

10:37 p.m.

Jon Noring wrote:

...

Example: DP0000239-00125-106.png

"DP0000239" is the source book identifier, here a DP identifier. If the scan project is independent of DP, it could be 'PG0014239' to associate the scan set with PG text number 14239.

"00125" says this is the 125th "side" in the full sequence of sides in the book, starting from the front cover or wherever else is considered the starting point.

"106" is the string (which can be more complicated like "A2", "5-4", "ix", "ABCD" whatever), which the publisher printed on that page to identify it (that's really what a publisher-supplied page "number" is: a page identifier.)

Incredibly awkward and broken in several ways: 1. While scanning you have no feedback on the correctitude of your scanning. You are scanning page "42" and saving to file "58.tif". There is no immediate relation between the page you are putting on the scanner and the filename you are saving it under. 2. To add the real page number to the filename you need a second run over all files. Errors galore! Proof: your example filename DP0000239-00125-106.png is bogus: page 125 "starting from the front cover" must be a right-hand side, but page 106 is sure a left-hand one. You got confused even with one file alone. What about handling hundreds of them at once? 3. Being composed of 2 keys, the probability that a link to this file breaks is much higher than using whichever one key. 4.

...

It will be able to handle *any* bizarre pagination system the publisher/author dreamed up (the publisher could number the pages backwards for all we care, and this system will handle it without any complications -- yet we preserve the publisher-supplied page "number" in the filename which is important for referencing/citation.)

Bogus claim. The publisher might put something in the page "number" that doesn't work as filename or url. What about page "4/2"? Makes a good filename, huh? -- Marcello Perathoner webmaster@gutenberg.org

Jon Noring

11:38 p.m.

Marcello wrote:

...

Jon Noring wrote:

...

...
Example: DP0000239-00125-106.png

"DP0000239" is the source book identifier, here a DP identifier. If the scan project is independent of DP, it could be 'PG0014239' to associate the scan set with PG text number 14239.

"00125" says this is the 125th "side" in the full sequence of sides in the book, starting from the front cover or wherever else is considered the starting point.

"106" is the string (which can be more complicated like "A2", "5-4", "ix", "ABCD" whatever), which the publisher printed on that page to identify it (that's really what a publisher-supplied page "number" is: a page identifier.)

...

Incredibly awkward and broken in several ways:

1.

While scanning you have no feedback on the correctitude of your scanning. You are scanning page "42" and saving to file "58.tif". There is no immediate relation between the page you are putting on the scanner and the filename you are saving it under.

But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.

...

2.

To add the real page number to the filename you need a second run over all files. Errors galore!

Which is important to do anyway. Just think through how you would construct a volunteer, multi-people effort to scan lots of books. There is the need for people to scan, for people to look over the work and determine if there are problems (QC), people to deskew and crop the images, people to regularize the filenaming (to whatever system), the making of derivative scan sets, etc. It will look more and more like DP.

...

Proof: your example filename DP0000239-00125-106.png is bogus: page 125 "starting from the front cover" must be a right-hand side, but page 106 is sure a left-hand one. You got confused even with one file alone. What about handling hundreds of them at once?

I did not get confused, but your observation is correct, and I am revising where seq# starts. I first started my seq# scheme from the inside of the front cover, which is better to be an even number. The front of the front cover should be given Seq#= 1, the inside of the front cover=2, etc. One can think of the cover like a very thick leaf of the book, and any scan project should scan the covers anyway. (And maybe set Seq#=0 for the spine, which should also be scanned. It is important for bibliographic purposes to get scans of the outside covers and spine. I learned this in trying to date my particular copy of the Kama Sutra, where the spacing and color of the lettering on the spine is critical to determination of both the edition and printing, and whether it is original or a pirate copy.)

...

Being composed of 2 keys, the probability that a link to this file breaks is much higher than using whichever one key.

As I noted before, for certain end-uses, one could do a conversion. I'm thinking of the work flow and scan set archiving stages.

...

...
It will be able to handle *any* bizarre pagination system the publisher/author dreamed up (the publisher could number the pages backwards for all we care, and this system will handle it without any complications -- yet we preserve the publisher-supplied page "number" in the filename which is important for referencing/citation.)

...

Bogus claim.

The publisher might put something in the page "number" that doesn't work as filename or url. What about page "4/2"? Makes a good filename, huh?

For these rare circumstances where the string contains a disallowed character for filenaming (of course, we have to think internationally, too), it would be possible to use an escape character, as is done for URL's when they contain disallowed characters. Page referencing as found in written literature will likely use whatever system was used in the target book, such as "see 'Lust in the Dust' by John Rust, page 4/2, ...". It is wise that the exact string be preserved for linking purposes, rather than renaming it to something else and losing that information. Not recording the exact character string the publisher used to number (or id) a page is not good. If it is occasionally necessary to escape characters, then so be it. This is done all the time in URLs. Jon

Robert Cicconetti

23 Jul 23 Jul

12:44 a.m.

On 7/22/05, Jon Noring <jon@noring.name> wrote:

...

Marcello wrote:

...
Jon Noring wrote:

...
Example: DP0000239-00125-106.png

"DP0000239" is the source book identifier, here a DP identifier. If the scan project is independent of DP, it could be 'PG0014239' to associate the scan set with PG text number 14239. Incredibly awkward and broken in several ways:

1.

While scanning you have no feedback on the correctitude of your scanning. You are scanning page "42" and saving to file "58.tif". There is no immediate relation between the page you are putting on the scanner and the filename you are saving it under.

But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.

I think you are both going rather far afield. The majority of the books scanned at DP are scanned directly into Abbyy Finereader, which has the following limitations/features: *Numbers only. *Will split/deskew automatically, and assign consecutive numbers. If the page is upside down, both halves will be upside down and assigned reversed numbers. *Has batch renumbering capability that will move a range of pages to another range, but not affect the order. *Will threshold greyscale/color to b/w if told (it doesn't handle pages left in grey well.) and will despeckle if told; the despeckle is fairly aggressive, and has been known to eat punctuation. *Is not suited to making archival scans of illustrations; greyscale images are quantized to about half the normal color space, and it deskews using the shear method. Generally I start the beginning material at 1, run until I hit real page numbers, then push 'real' numbered pages to 101. I believe most other PMs use a variant of this. Pages without a number often, but not always, do not have OCRable text and get scanned separately. If they do have text I make a note where they fit and put them up in the 900 range. After scanning, everything gets run through Guiprep (Excellent tool!) which renumbers all of the illustrations to fit in with the DB restrictions at DP (0001.png, 0002.png, etc.) consecutively. DP image numbers have very little to do with actual page numbers, although in most cases it is a fixed offset. As Juliet said before, there are few conventions regarding illustration numbering; I generally use pic[xxxx].png, where xxxx is the true page #, or a variant thereof. No room for fancy metadata in the filename, although I believe DP now accepts an XML file with each project; you may be able to store 'real' page # information there. Although I believe the long term plan is to leave it to a metadata round.. R C

Jon Noring

2:30 a.m.

Robert wrote:

...

Jon Noring wrote:

...

...
But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.

...

I think you are both going rather far afield. The majority of the books scanned at DP are scanned directly into Abbyy Finereader, which has the following limitations/features: [snip of summary of practice.]

Yes, agreed. As I was sitting outside thinking of the problem (and sipping on a beer), it became clear that an important driver for the file naming of the *existing scan sets* is how they are presently named at DP, and how much effort will be required (read: volunteer effort) to rename them if needed. (I think they will require a human being to look them over and make needed changes if the scans will be used in a linking environment that Marcello is thinking of doing for PG -- and which I'd like to see. We might consider a sort of DP-like environment to assist with scan set QCing and filename changes.) There is the issue of what a project to create high-quality book scans would embrace for its naming system, and I think my system is the better (but not necessarily the best) candidate for that. There are certainly other possible systems which have not yet been proposed. But the focus at present for PG and DP is what currently exists, and how to best fit it in with both DP's and PG's needs and restrictions. One question of Robert and the other DPers: how were blank pages handled?

...

After scanning, everything gets run through Guiprep (Excellent tool!) which renumbers all of the illustrations to fit in with the DB restrictions at DP (0001.png, 0002.png, etc.) consecutively. DP image numbers have very little to do with actual page numbers, although in most cases it is a fixed offset. As Juliet said before, there are few conventions regarding illustration numbering; I generally use pic[xxxx].png, where xxxx is the true page #, or a variant thereof.

No room for fancy metadata in the filename, although I believe DP now accepts an XML file with each project; you may be able to store 'real' page # information there. Although I believe the long term plan is to leave it to a metadata round..

As an aside, I've always thought the best system would be to separate the DP proofing system from the scanning portion. In essence, to setup a separate (autonomous) "Distributed Scanners" which will encourage the scanning of older books, set minimum quality requirements, QC, standardized cataloging (possibly MARC-XML), clean up the scans to form working sets (deskewing, cropping, color depth reduction, etc.), and do so in a semi-distributed environment akin to DP. Then the work product would be archived at IA (with public access to some of the derivative scan sets if not the masters). And of course DO would generate a derivative scanset optimized for DP's process. If the system works well, DP could encourage submitters to go through the DS system for submitting scans. Of course, DS would require a few dedicated and knowledgeable people (in various areas of expertise) to get together and hammer out the specifics of the system and do the necessary development work. I do believe it will be possible to get equipment donation (such as sheet feed scanners, Plustek OptiBook scanners or similar for scanning bound books with gentle handling and not page distortion, heavy duty choppers, etc.) I also believe it possible to get tax-deductible donations of old books in poor condition which could be chopped (I'm looking into this now and am encouraged -- the tax deductibility is a big issue to bookstores and others.) And though I may be naive on this, I think it possible to find some willing librarians at academic libraries who will let us come in and scan some of their older books. Of course, we'd try to reach out to the library community to find volunteers to assist with the cataloging/metadata aspects, and maybe some will help with the scan QC (and filenaming) as well. Anyway, just sort of dreaming/musing here. Thanks, Robert, for clarifying the current status of the DP scans and filenaming system. Jon

Robert Cicconetti

7:04 a.m.

On 7/22/05, Jon Noring <jon@noring.name> wrote:

...

One question of Robert and the other DPers: how were blank pages handled?

If it has a page number (or fits into the page sequence) I scan it. If it the verso or obverse of an unnumbered illustration, I skip it. The goal is to capture the content; grabbing an unnumbered blank page doesn't accomplish much.

...

As an aside, I've always thought the best system would be to separate the DP proofing system from the scanning portion. In essence, to setup a separate (autonomous) "Distributed Scanners" which will encourage the scanning of older books, set minimum quality requirements, QC, standardized cataloging (possibly MARC-XML), clean up the scans to form working sets (deskewing, cropping, color depth reduction, etc.), and do so in a semi-distributed environment akin to DP. Then the work product would be archived at IA (with public access to some of the derivative scan sets if not the masters). And of course DO would generate a derivative scanset optimized for DP's process. If the system works well, DP could encourage submitters to go through the DS system for submitting scans.

Frankly, I think I lean more towards the other DPers.. the scan images are a means to an end; the end being a correct, well-formatted eBook. HTML, especially, is very flexible in output formats.. I can view it anywhere from a graphing calculator to a Sun Workstation, or run it through a TTS engine and listen to it. Page images are a useful reference to compare the etext against, but are not very flexible. Now, in general, I _would_ like to see an improvement in the average image quality of included illustrations. I personally preprocessed all of the Potter illustrations (main reason I haven't finished is I don't have the time right now to do it right) before uploading them.. but it is a lot of work, and a skill that takes practice to get right. I still consider myself only an intermediate photoshop user. R C

Marcello Perathoner

8:15 a.m.

Robert Cicconetti wrote:

...

Generally I start the beginning material at 1, run until I hit real page numbers, then push 'real' numbered pages to 101. I believe most other PMs use a variant of this.

Then all you need is a tool to batch-rename the files from 1.tif to f0001.tif and 101.tif to p0001.tif. -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner

8:50 a.m.

Jon Noring wrote:

...

But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.

Not at all. If you have a sheet feeder just extract inserted illustrations and such out-of-sequence stuff (making a note of the page number on the back). Then feed the whole pile starting with page arabic "1" and go drink a coffee. If you are lucky and the feeder doesn't jam you just needs compare the filename of the last file with the last page number. If they jibe you are done. Manually scan the inserted illustration sheets and name the files according to the noted page number. Repeat with roman pages.

...

If it is occasionally necessary to escape characters, then so be it. This is done all the time in URLs.

Then show me how you escape these filenames: DP12345-00420-II/2.png DP12346-00017-第十三.png (I hope those chinese characters came thru.) The escaping should work on all known OS including DOS, Windows, Linux, Mac Classic, Mac OS X, Palm etc., should also work as url and should not need renaming when travelling from one OS to another. If you cannot accomplish that, you basically are proposing an archiving system where files will have to be renamed when the OS changes. -- Marcello Perathoner webmaster@gutenberg.org

D Garcia

7:30 p.m.

On Saturday 23 July 2005 04:50 am, Marcello Perathoner wrote:

...

Jon Noring wrote:

...
But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.

Gee, much like the upcoming DP metadata rounds, which will perform this review, and stor the info in a database.

...

Not at all. If you have a sheet feeder just extract inserted illustrations and such out-of-sequence stuff (making a note of the page number on the back). Then feed the whole pile starting with page arabic "1" and go drink a coffee. If you are lucky and the feeder doesn't jam you just needs compare the filename of the last file with the last page number. If they jibe you are done.

Unless you care about the actual physical sequence, which you have just ignored.

...

Manually scan the inserted illustration sheets and name the files according to the noted page number.

Tip-in illustrations generally do not have page numbers. This discussion has long since become ridiculous. Scan the pages in physical order, starting from 001.png (.tif, whatever) Create a metadata file 001.pag for each image which contains the image file name or number and the "extra" information, which in this case is really only the printed page number, use "none" if page is unnumbered. In XML this would be trivial, but if you hate XML (which can be easily (read as a, or loaded into a) database) then just write a 001.pag file with (i,1,none) as appropriate. If you're really freaked out by unnumbered pages which have a 'logical' page number (such as the last page of a chapter of fiction, which typically are in the sequence, but aren't printed) then have a field for printed page number, and logical page number. You'll also frequently see front matter done this way, starting at something higher than 'i'. After preserving the images themselves, preserving the physical sequence is the most important requirement of an image archive. The metadata will always require human attention, but once it's done (and it can be partially automated) later tools to assemble versions in FORMAT_OF_YOUR_CHOICE can take that and do with it what they wish. Trying to store anything but the physical sequence of the pages in the filename is an unnecessary complication, and probably short-sighted in the long run.

Jonathan Ingram

8:53 p.m.

--- D Garcia <donovan@abs.net> wrote:

...

Trying to store anything but the physical sequence of the pages in the filename is an unnecessary complication, and probably short-sighted in the long run.

I agree. Keep the filenames simple, and store the information in a separate information file. This will allow to cope with all the strange page naming schemes which have been used (I have 16th century books which indicate signatures with greek letters and superscripts, for example). -- Jon Ingram __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Marcello Perathoner

9:02 p.m.

D Garcia wrote:

...

...
Not at all. If you have a sheet feeder just extract inserted illustrations and such out-of-sequence stuff (making a note of the page number on the back). Then feed the whole pile starting with page arabic "1" and go drink a coffee. If you are lucky and the feeder doesn't jam you just needs compare the filename of the last file with the last page number. If they jibe you are done.

Unless you care about the actual physical sequence, which you have just ignored.

Why? I have seen sheet feeders jam, I have seen feeders slurp in two pages at a time but I never seen a sheet feeder reorder the sequence of the pages I fed into it. Ergo: if I feed the pages in order starting with page 1 and file 1 and end up with page N and file N the files are in the correct physical sequence.

...

...
Manually scan the inserted illustration sheets and name the files according to the noted page number.

Tip-in illustrations generally do not have page numbers.

That is the reason why they get the page number of the preceding "true" page plus a number as suufix.

...

This discussion has long since become ridiculous.

Then push the ignore thread button on your reader instead of adding to it.

...

Trying to store anything but the physical sequence of the pages in the filename is an unnecessary complication, and probably short-sighted in the long run.

Why? You don't give any reasons besides personal preference. -- Marcello Perathoner webmaster@gutenberg.org

David Starner

9:19 p.m.

On 7/23/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

Why? I have seen sheet feeders jam, I have seen feeders slurp in two pages at a time but I never seen a sheet feeder reorder the sequence of the pages I fed into it.

Ergo: if I feed the pages in order starting with page 1 and file 1 and end up with page N and file N the files are in the correct physical sequence.

What if it double-fed once and you missed an illustration, or missed the fact that the page numbers weren't entirely regular? Or, you know, aren't using a sheet feeder? I've been bitten more times than I want to count by not checking every page, and sometimes even when checking every page. It's not that trivial.

D Garcia

24 Jul 24 Jul

8 p.m.

On Saturday 23 July 2005 05:02 pm, Marcello Perathoner wrote:

...

...
Unless you care about the actual physical sequence, which you have just ignored.

Why? I have seen sheet feeders jam, I have seen feeders slurp in two pages at a time but I never seen a sheet feeder reorder the sequence of the pages I fed into it.

Ergo: if I feed the pages in order starting with page 1 and file 1 and end up with page N and file N the files are in the correct physical sequence.

Books almost never start with page 1, even in the arabic numbered sections. Where's your front matter? Back matter? A book which starts at 1 and ends at (1+n) and has no other numbering in it is likely the exception rather than the rule, so it is unrealistic to design a scheme to address only this special case.

...

...
...
Manually scan the inserted illustration sheets and name the files according to the noted page number.

Tip-in illustrations generally do not have page numbers.

That is the reason why they get the page number of the preceding "true" page plus a number as suffix.

Why create this artificial distinction in the first place?

...

...
Trying to store anything but the physical sequence of the pages in the filename is an unnecessary complication, and probably short-sighted in the long run.

Why? You don't give any reasons besides personal preference.

I have, but let's review anyway. 1) Store the physical sequence of the pages, front cover to back cover by simply naming the scan files ascending from 1. Advantages: Sort order guaranteed identical across platforms, no parsing of file name segments required to determine information about the file. 2) Create a corresponding metadata file named with a 1:1 correspondence to hold the other (in this case) numbering information about the scanned image. Advantages: Any additional information about the image file is trivially associated, modifiable and extendible. it could be loaded in a database or converted to XML or other formats trivially, making implementation of meaningful searching that much easier. Marcello, in your role at PG, you should realize more than many others that storing data in a file name makes it less accessible to programs which could automate much of the common work of maintaining a dataset. It's not impossible to manipulate, but it is cumbersome and has sort order and case issues across platforms.

David Starner

9:29 p.m.

There's another problem with Marcello's scheme: you don't know how the letters will sort in an arbitrary locale. Y, for example, sorts after J in Lithuanian.

Marcello Perathoner

11:48 p.m.

David Starner wrote:

...

There's another problem with Marcello's scheme: you don't know how the letters will sort in an arbitrary locale. Y, for example, sorts after J in Lithuanian.

So what? A Lithuanian will probably use a Lithuanian locale. He will name his files according to Lithuanian costume: f0001.djvu y0001.djvu p0001.djvu Then, if he says: djvm -c 12345.djvu *djvu the shell will sort the * according to the locale and everybody will be happy. Remember: the filenames do not determine the view sequence. The view sequence is determined by the order of the files on the commandline when you pack the djvu file. (In fact the djvu file builder assigns each page an internal page number.) -- Marcello Perathoner webmaster@gutenberg.org

David Starner

25 Jul 25 Jul

10:40 p.m.

On 7/24/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

So what? A Lithuanian will probably use a Lithuanian locale. He will name his files according to Lithuanian costume:

f0001.djvu y0001.djvu p0001.djvu

Then, if he says:

djvm -c 12345.djvu *djvu

the shell will sort the * according to the locale and everybody will be happy.

And what happens when you add or change images and repack it on your system? It's going to be in the wrong order.

Marcello Perathoner

10:57 p.m.

David Starner wrote:

...

...
So what? A Lithuanian will probably use a Lithuanian locale. He will name his files according to Lithuanian costume:

f0001.djvu y0001.djvu p0001.djvu

Then, if he says:

djvm -c 12345.djvu *djvu

the shell will sort the * according to the locale and everybody will be happy.

And what happens when you add or change images and repack it on your system? It's going to be in the wrong order.

It happens that you will say: djvm -c 12345.djvu f*djvu y*djvu p*djvu and everything will be jake. -- Marcello Perathoner webmaster@gutenberg.org

David Starner

11:18 p.m.

On 7/25/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

It happens that you will say:

djvm -c 12345.djvu f*djvu y*djvu p*djvu

and everything will be jake.

If you notice it. More likely, you'll end up with a misordered book.

Marcello Perathoner

24 Jul 24 Jul

11:17 p.m.

D Garcia wrote:

...

Books almost never start with page 1, even in the arabic numbered sections. Where's your front matter? Back matter? A book which starts at 1 and ends at (1+n) and has no other numbering in it is likely the exception rather than the rule, so it is unrealistic to design a scheme to address only this special case.

I guess you should really go and read my RFC before assuming it doesn't accomodate something. It has front, body and cover streams and accomodates up to a total of 26 page numbering streams. Unless you have a book with more than 26 number sequences, you have nothing to complain of.

...

1) Store the physical sequence of the pages, front cover to back cover by simply naming the scan files ascending from 1. Advantages: Sort order guaranteed identical across platforms, no parsing of file name segments required to determine information about the file.

Do you know of some platform that sorts the alphabet in a different way? No? Then sort order can be no problem in my scheme. (I intentionally avoided mixed case prefixes.) The only information you should get from the filename is what page the file contains. My scheme does not use the filename to determine the order of the pages. The multi-page djvu file keeps track of the order of the pages. If the pages are numbered backwards, fine -- just insert them backwards into the multi-page file. (And they will stay that way on every platform too.) Disadvantages of your format: - introduces artificial numbering sequence with completely arbitrary relation to the printed page numbers. - breaks the law of least astonishment: any user in its right mind who wants to look at page 42 will instinctively go and open 42.tiff. Bumm! - does not accomodate all the scans without covers and blank pages we already have gathered. - is brittle and thus not adapted for archiving. The sequence changes if somebody goes back and adds the cover page and blank page scans to an existing set of scans. How do you handle hardcovers and paperbacks? The paperback editions usually have less pages around the covers than the hardcover but are otherwise quite exact copies. Which edition will you follow? And don't get me started about collections where portions are still copyrighted and scanning those portions would be illegal. Every other year some portion will drop into the public domain (assuming live + N) and will have to be added to the scan set. A nightmare in your scheme and no problem at all in mine. Your format is very simple, but too simple in many points. As Einstein said: make it as simple as possible but not simpler.

...

2) Create a corresponding metadata file named with a 1:1 correspondence to hold the other (in this case) numbering information about the scanned image. Advantages: Any additional information about the image file is trivially associated, modifiable and extendible. it could be loaded in a database or converted to XML or other formats trivially, making implementation of meaningful searching that much easier.

My RFC is about people who want to click on a link and have the right page open in their browsers. They don't want to fiddle with XML or databases just to look at some scans. And the browser surely will not look into any XML file to find out which url it should request. Also its far simpler to keep the real page number in the filename and store the information about the sequence in metadata than the other way round. Djvu files support metadata. You can put all sorts of metadata into the djvu file and nobody will complain. You can build any amount of metadata processing software around my proposed format.

...

Marcello, in your role at PG, you should realize more than many others that storing data in a file name makes it less accessible to programs which could automate much of the common work of maintaining a dataset. It's not impossible to manipulate, but it is cumbersome and has sort order and case issues across platforms.

LOL. You did fire all your heavy guns at the pumpkin patch again. My filenames are just designed to be: - mnemonic, - unique per book and - permanent. No ordering is derived from the filenames at all. The djvu file keeps the ordering. And my RFC doesn't even prescribe you what to do with empty pages. You may drop them or keep them at will (replaced by a notice, that is). I'd been very glad if PG had stored the etext no. in the filename for the books before #10k, instead of a making up a completely arbitrary and unintelligible string, so I had to go thru GUTINDEX to find out what was what. That would have saved me weeks of programming. And now you come along and propose the same error again: to use a completely arbitrary number as filename instead of the real page number and to have to grovel thru an XML file to find out which file to open for page 42. -- Marcello Perathoner webmaster@gutenberg.org

David Starner

11:47 p.m.

On 7/24/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

Do you know of some platform that sorts the alphabet in a different way?

Yes; Linux (and probably many other systems), under a Lithuanian locale.

Marcello Perathoner

11:58 p.m.

David Starner wrote:

...

...
Do you know of some platform that sorts the alphabet in a different way?

Yes; Linux (and probably many other systems), under a Lithuanian locale.

Yes. You are right. My bad. I didn't verify this because it is immaterial to my proposed scheme. As I wrote in the next paragraph: "My scheme does not use the filename to determine the order of the pages. The multi-page djvu file keeps track of the order of the pages." -- Marcello Perathoner webmaster@gutenberg.org

D Garcia

25 Jul 25 Jul

12:55 a.m.

On Sunday 24 July 2005 07:17 pm, Marcello Perathoner wrote:

...

I guess you should really go and read my RFC before assuming it doesn't accomodate something. It has front, body and cover streams and accomodates up to a total of 26 page numbering streams. Unless you have a book with more than 26 number sequences, you have nothing to complain of. Do you know of some platform that sorts the alphabet in a different way? No? Then sort order can be no problem in my scheme. (I intentionally avoided mixed case prefixes.)

Yes, I do, and clearly so do others here. Thanks to those for saving me the trouble of looking up examples.

...

The only information you should get from the filename is what page the file contains. My scheme does not use the filename to determine the order of the pages. The multi-page djvu file keeps track of the order of the pages. If the pages are numbered backwards, fine -- just insert them backwards into the multi-page file. (And they will stay that way on every platform too.)

Yes, but we're not talking about presentation formats, we're talking about storing data in archival formats to generate presentation formats. And consistency simplifies many of those tasks.

...

Disadvantages of your format:

- introduces artificial numbering sequence with completely arbitrary relation to the printed page numbers.

This is only an issue if you feel that the printed page numbers must be in the file name or be the file name. It's almost as if you've never heard of or completely ignored the programmatic advantages of things like hashes or linked lists.

...

- breaks the law of least astonishment: any user in its right mind who wants to look at page 42 will instinctively go and open 42.tiff. Bumm!

But in your scheme, the end user won't see 42.tif in the raw data, they will see a cryptic melange of alphanumerics which contains 42 somewhere in it.

...

- does not accomodate all the scans without covers and blank pages we already have gathered.

See below.

...

- is brittle and thus not adapted for archiving. The sequence changes if somebody goes back and adds the cover page and blank page scans to an existing set of scans.

This only applies to legacy data, and while true that adjustments would need to be made, they can be handled by tools which also update the metadata. It's not as if data inserts and record renumbering were virgin territory. Your "store it in the filename" convention is as-or-more brittle in this same respect, without the advantages of external metadata.

...

How do you handle hardcovers and paperbacks? The paperback editions usually have less pages around the covers than the hardcover but are otherwise quite exact copies. Which edition will you follow?

Seems to me that the one which was scanned (and we mostly DO keep publisher/edition information at DP) would be the logical one to follow. That's a straw man issue, Marcello. Most of us know how to handle multiple editions of books.

...

And don't get me started about collections where portions are still copyrighted and scanning those portions would be illegal. Every other year some portion will drop into the public domain (assuming live + N) and will have to be added to the scan set. A nightmare in your scheme and no problem at all in mine.

Hardly a nightmare at all, the very reason in fact to store the metadata OUTSIDE of the filename, because it is volatile, although you claim in several places that the filename under your scheme is permanent. In my scheme, you would simply have a field which indicated the date the information on the page went into public domain (including future) and the tools can automatically insert either the page, or a placeholder image explaining that the page isn't in PD until X. Since PG is legally a library, it is fine for them to store (but not distribute) copyrighted material. (IANAL) My suggestion would allow automatic inclusion of such material as it became PD without any intervention. Yours doesn't.

...

Your format is very simple, but too simple in many points. As Einstein said: make it as simple as possible but not simpler.

Einstein also said "We can't solve problems by using the same kind of thinking we used when we created them." PG used to store file version information in the cryptic 8.3 filenames. You wish to store page numbering information in filenames. I can't see what makes your scheme any less susceptible to the problems that were (eventually) realized with the former.

...

...
2) Create a corresponding metadata file named with a 1:1 correspondence to hold the other (in this case) numbering information about the scanned image. Advantages: Any additional information about the image file is trivially associated, modifiable and extendible. it could be loaded in a database or converted to XML or other formats trivially, making implementation of meaningful searching that much easier.

My RFC is about people who want to click on a link and have the right page open in their browsers. They don't want to fiddle with XML or databases just to look at some scans. And the browser surely will not look into any XML file to find out which url it should request.

You appear to be talking at cross purposes with yourself. The archive format and the presentation format are completely separate things, though in a sense they do drive certain requirements of each other. The end user wouldn't have to "fiddle with XML" under my scheme. The data representation I suggested is for the archive of image scans and data, not the presentation of the data. Your preception error is that you've made an incorrect assumption that the raw data would be what is presented to the user. The scans and metadata are stored simply in my scheme to reduce programming effort required to deliver files constructed in whatever arbitrary format desired for which a routine is coded to assemble the data in that format. No reader ever need see the metadata. You're attacking a strawman of your own creation, presumbably because I said "XML" though I never said that it would/should ever be in that particular format. In fact, were it not for the fact that PG's database doesn't seem to support very many simultaneous connections, I'd recommend that the metadata be stored there (i.e., in a database.)

...

Also its far simpler to keep the real page number in the filename and store the information about the sequence in metadata than the other way round. Djvu files support metadata. You can put all sorts of metadata into the djvu file and nobody will complain. You can build any amount of metadata processing software around my proposed format.

Far simpler how? Filenames should not be data repositories, though they can be abused as such. Data which you wish to manipulate in arbitrary fashion is better stored in a format or schema which supports random access natively. See above.

...

LOL. You did fire all your heavy guns at the pumpkin patch again. My filenames are just designed to be:

- mnemonic, - unique per book and - permanent.

To which I counter: arbitrary and unintelligible, irrelevant if they are in the directory structure of the ebook (say scans/), and inflexible. These are not "heavy guns," though it is a different viewpoint from yours. I have no idea what you mean about a "pumpkin patch," and have never exhibited a desire to engage in armed combat with gourds, decorative, edible, or any other arbitrary vegetable and/or fruit. :)

...

No ordering is derived from the filenames at all. The djvu file keeps the ordering. And my RFC doesn't even prescribe you what to do with empty pages. You may drop them or keep them at will (replaced by a notice, that is).

And so could the software which assembles the presentation format that the end user sees, just as I described above concerning interspersed non-PD material.

...

I'd been very glad if PG had stored the etext no. in the filename for the books before #10k, instead of a making up a completely arbitrary and unintelligible string, so I had to go thru GUTINDEX to find out what was what. That would have saved me weeks of programming. And now you come along and propose the same error again: to use a completely arbitrary number as filename instead of the real page number and to have to grovel thru an XML file to find out which file to open for page 42.

Already addressed above, though you still disregard that you're making the exact same mistake by storing everything in the filename. As my Canadian friends might say, I'm sorry I don't have anything to apologize for today. I'm just trying to point out alternatives, and the strengths and weaknesses of various approaches, as I see them. It's not personal.

Robert Cicconetti

2:06 a.m.

On 7/24/05, D Garcia <donovan@abs.net> wrote:

...

These are not "heavy guns," though it is a different viewpoint from yours. I have no idea what you mean about a "pumpkin patch," and have never exhibited a desire to engage in armed combat with gourds, decorative, edible, or any other arbitrary vegetable and/or fruit. :)

/me blinks. Donovan.. never exploded a pumpkin? Fired at a watermelon? Played Gallagher? Operated a pumpkin chucker? Or even wanted to? I think I'm disappointed. R C

D Garcia

10:28 p.m.

On Sunday 24 July 2005 10:06 pm, Robert Cicconetti wrote:

...

Donovan.. never exploded a pumpkin? Fired at a watermelon? Played Gallagher? Operated a pumpkin chucker? Or even wanted to? I think I'm disappointed.

Used them as projectiles, yes. As targets, no. (which is what Marcello was trying to express, I believe). Now if they were being pumpkin-chucked AND available as targets (think "pumpkin skeet") then that might well be a different story. :)

Carlo Traverso

3:16 a.m.

I come in late, having been absent the last 10 days, in which the discussion has grown to 200 posts, that I have only skimmed. So I apologize if I repeat something that has already been said. 1. Have you looked at the structure of the gallica images, at http://gallica.bnf.fr? IMHO their organization is now quite nice. They are preserved as B/W compressed TIFF files, and served in different ways. After identifying the book (their catalogue search is not yet at the level of the rest) you get a book page that allows different consultation methods: a) Notice: contains the book's catalogue card b) table des matieres: a TOC, with link to the starting pages of the chapters c) pagination: contains a list of all the pages for direct access d) chemin de fer: contains thumbnails of a selection of consecutive pages e)texte seul, plein ecran: allows random access without the extra overhead of b), c), d) f) telecharger: allows downloading of a range of pages, either as multipage tiff, or as pdf (incapsulating the same tiff pages). g) reproduire: allows ordering a paper copy. I believe that their experience, built during the years, might be significant, (we sould of course add links from and to the text, that they don't have). Probably thay might be willing to cooperate with us, if approached suitably (that IMHO includes addressing them in french ...) allowing us to know somethiung of the internals of their work, that we might copy. Many gallica books have been used for DP - at least 1000 I believe, I alone have provided about 400 - and it would be nice to do something that can be integrated in their structure. 2) It has been stated that the standard of academic quotations is to look at a reference edition, and that you have to consult physically the reference edition to check the page numbers. This is not true: you can also consult a different reprint, that includes the page numbers of a reference edition; and this can be as well be done with a PG etext provided that it includes these page numbers. Some academic books even contain different "original" page numbers to allow different reference schemes, when there is more than one reference edition (or reference manuscript). Carlo Traverso

David Starner

23 Jul 23 Jul

2:18 a.m.

On 7/22/05, Marcello Perathoner <marcello@perathoner.de> wrote:

...

While scanning you have no feedback on the correctitude of your scanning. You are scanning page "42" and saving to file "58.tif". There is no immediate relation between the page you are putting on the scanner and the filename you are saving it under.

That is true. That will continue to be true what ever standard is agreed upon, since people aren't going to suddenly change their scanning patterns based on what the final product, a long ways down the line, looks like.

7448

Age (days ago)

7451

Last active (days ago)

List overview

Download

25 comments

7 participants

participants (7)

Carlo Traverso
D Garcia
David Starner
Jon Noring
Jonathan Ingram
Marcello Perathoner
Robert Cicconetti

Scan file naming -- another comment

tags

participants (7)