re: [gutvol-d] Scan file naming -- another comment

jon said:
The system I propose for scan file naming is *simpler* than yours and more flexible
no, it isn't. not on either count. the mere act of keeping track of a single name when it contains two -- or more -- variables in it becomes much more difficult than it needs to be. put together a half-dozen books each containing hundreds of scan-files using your verbose names and you'll find yourself drowning in the confusion. your pace would fall to a crawl. try it. you'll see. d.p. scanners can't afford to work at such a pace. my filenaming convention has grown out of my experience through the entire digitizing process. if it would have needed to be more complicated, i would have learned that by now and made it so. you can always make a system more complex. the smart thing -- which experience teaches -- is to realize when the increment in _cost_ will return to you a sufficient increment in _benefit_. and when it will not. you have taken the simple-and-useful principle that it is good to know the contents of a file from its name, and blown it past the point where it is cost-beneficial. there are all kinds of information that you _could_ put into the filename, which _might_ be useful at some point. but if it makes the process of dealing with the filename too unwieldy, it ain't worth it. the trick is to know when to stop.
It also integrates better into the QC system.
you don't even have a good idea what a quality-control system might look like, let alone knowledge of problems that might crop up with each particular type of system. you might _think_ you do. but -- as is typical with you -- you don't know what you don't know. your knowledge has not been tempered by the big face-slap of the real-world. nor have you programmed the _apps_ that could implement such a quality-control system. when you get to that stage, then come back and we can have this discussion again, jon.
2) During the next stage where a human being is looking at each scan, they append the *actual* publisher supplied page number (or string) to the filename from (1). No need to add any letter prefixes or anything -- they use the *actual* string "as it is".
and here's a good illustration of your lack of knowledge because of an absence of experience, combined with an ignorance of the kinds of tasks that the machine can do. (your willingness to _discuss_ an issue is quite admirable, jon, really. but that alone can only take a person so far.) in a properly designed system, the human being should _not_ have to mess with filenames at all, or only on the rare occasions of mondo weirdness. that's why i told david and all the other scanners to keep on doing whatever they are now doing, because i can deal with their stuff after-the-fact. (well, there are _some_ things i wish they would do; but it has nothing to do with the type of useless tripe you want them to have to deal with, that's for sure.) specifically, it's easy to write a routine that looks in the o.c.r. results to find the page-number of the page. (in general, i don't say something is "easy" unless i've already done it myself. because i have learned that many things that seem like they should be easy are not. i've already written this routine. it was easy to write.) if you're writing a tool to clean up a scan=set, as i am, it's pretty much _required_ that you write this routine, because you need to delete that number from the text. but before you delete it (as that will be among the last things that your tool does), you can use it to _rename_ both the o.c.r. file and the scan-file. indeed, i basically recommend that this file-renaming be one of the _first_ things you do during clean-up, because it will usually be _so_ much easier to deal with the other clean-up tasks when the filenames and the page-numbers match up. of course, what you should have done in the first place is ensured that the scans were _auto-named_ correctly, setting the auto-name counter at 1 when scanning page 1, and then next scanning each numbered page in sequence, only going back afterwards to scan out-of-sequence pages. but accidents do happen, so i programmed this routine that renames your o.c.r. files and scan-files if needed. assuming you've got a clean set of scans and the page-numbering doesn't have many anomalies, it won't take you more than a minute of two to review the new names and approve a mass change. so, once again, jon, you've made a mountain out of a molehill, and then put together a baroque "plan" on how to scale it... -bowerbird

Bowerbird wrote:
jon said:
The system I propose for scan file naming is *simpler* than yours and more flexible
so, once again, jon, you've made a mountain out of a molehill, and then put together a baroque "plan" on how to scale it...
Well, we've both had our say, so let the few others who are following this topic decide for themselves. There are, of course, other possible filenaming systems. Jon

On Fri, 22 Jul 2005, Jon Noring wrote:
Well, we've both had our say, so let the few others who are following this topic decide for themselves. There are, of course, other possible filenaming systems.
For example, I've recently been learning the cyrillic alphabet. Why don't we develop a system where use the Old Church Slavonic mapping of numberic values to cyrillic characters and then... Oh. Wait. That might be too complicated to be practical. Andrew (If you look closely, you may find a slight amount of sarcasm in this message.)

On Fri, Jul 22, 2005 at 04:32:04PM -0600, Jon Noring wrote:
Bowerbird wrote:
jon said:
The system I propose for scan file naming is *simpler* than yours and more flexible
so, once again, jon, you've made a mountain out of a molehill, and then put together a baroque "plan" on how to scale it...
Well, we've both had our say, so let the few others who are following this topic decide for themselves. There are, of course, other possible filenaming systems.
Jon
Hmmm, did someone form an executive decision committee while I was skimming through this thread? The "say" is that we're all anxiously awaiting any completed eBooks demonstrating any or all ways of doing things. There will not be any decision about the official way of doing things (if there ever is) until some people create demonstrations. It's clear that light programming at gutenberg.org to make a given demo work is available. So is server space there or elsewhere. There are plenty of images available, too -- for example, the gallica.fr and Canadiana and Internet Archive sources often used as input to DP, as well as "actual" DP scans. I will be happy to insert samples into the PG collection, even if they're not compliant with whatever turns out to be the "official" method. If there are some sample project pages mentioned earlier in the discussion, I apologize for missing them. What I've seen is some thoughtful and reasonable proposals, and a few that don't seem quite so viable... what's next is *not* executive decision based on the proposals. Instead, actual eBooks implementing the proposals are needed. -- Greg

Greg wrote:
If there are some sample project pages mentioned earlier in the discussion, I apologize for missing them. What I've seen is some thoughtful and reasonable proposals, and a few that don't seem quite so viable... what's next is *not* executive decision based on the proposals. Instead, actual eBooks implementing the proposals are needed.
The current bottleneck is with DP, since they hold the lion's share of the page scans, but from what I understand cannot release them at this time (for a couple of reasons, most notably that they are up to their eyeballs with other more pressing issues.) I think it would be unwise to implement a final system until DP's scans can be released for linking to PG texts. Of course, in the meanwhile some experimentation can be done -- maybe DP will release a few random scan sets to better understand the issues (such as needed volunteer help) that need to be resolved in adapting DP's scans for linkage to PG texts (and for linking to page scans, whether standalone or within an encapsulation format such as DjVu.) Jon

On Sat, 23 Jul 2005 23:24:47 -0600, Jon Noring <jon@noring.name> wrote: | Greg wrote: | | > If there are some sample project pages mentioned earlier in the | > discussion, I apologize for missing them. What I've seen is some | > thoughtful and reasonable proposals, and a few that don't seem quite so | > viable... what's next is *not* executive decision based on the | > proposals. Instead, actual eBooks implementing the proposals are | > needed. | | The current bottleneck is with DP, since they hold the lion's share of | the page scans, but from what I understand cannot release them at this | time (for a couple of reasons, most notably that they are up to their | eyeballs with other more pressing issues.) | | I think it would be unwise to implement a final system until DP's scans | can be released for linking to PG texts. Of course, in the meanwhile | some experimentation can be done -- maybe DP will release a few random | scan sets to better understand the issues (such as needed volunteer | help) that need to be resolved in adapting DP's scans for linkage to | PG texts (and for linking to page scans, whether standalone or within | an encapsulation format such as DjVu.) My scans are sitting somewhere in backup, as no doubt are many peoples scans. When a system is working, just ask. -- Dave Fawthrop <dave hyphenologist co uk> In Case of Emergency Store the word "ICE" in your mobile phone address book, and against it enter the number of the person you would want to be contacted "In Case of Emergency". http://tinyurl.com/79lz9

On Sat, 23 Jul 2005, Jon Noring wrote:
can be released for linking to PG texts. Of course, in the meanwhile some experimentation can be done -- maybe DP will release a few random scan sets to better understand the issues (such as needed volunteer help) that need to be resolved in adapting DP's scans for linkage to PG texts (and for linking to page scans, whether standalone or within an encapsulation format such as DjVu.)
DP doesn't have to release the scans. I'm not jumping into the argument, but I have scans for all of the books that I have sent throgh DP if someone else wants to play with them. I even have scans for two that didn't go through DP. http://durendal.org:8080/books.html If anyone want to play with them, let me know which books and I'll tar the set up and give you the URL. -- Greg Weeks http://durendal.org:8080/greg/

Greg Weeks wrote:
Jon Noring wrote:
can be released for linking to PG texts. Of course, in the meanwhile some experimentation can be done -- maybe DP will release a few random scan sets to better understand the issues (such as needed volunteer help) that need to be resolved in adapting DP's scans for linkage to PG texts (and for linking to page scans, whether standalone or within an encapsulation format such as DjVu.)
DP doesn't have to release the scans. I'm not jumping into the argument, but I have scans for all of the books that I have sent throgh DP if someone else wants to play with them. I even have scans for two that didn't go through DP.
http://durendal.org:8080/books.html
If anyone want to play with them, let me know which books and I'll tar the set up and give you the URL.
Great! Of course, a lot of the scans submitted to DP will have to come from DP. There is also the DP database, and the desire that the scans not be "unlinked" from the database and its metadata. I urge that if scans are donated to PG, that they only be done as part of prototype development and testing, rather than the real thing, until DP is able to devote resources to assist with making its scans public. It is important that hatever system is put in place on the PG end readily adapts to DP's system, otherwise more unnecessary work may result for both PG and DP. Jon

I have a fair number of scans that I can send to anyone who wants them also. Geoff
participants (7)
-
Andrew Sly
-
Bowerbird@aol.com
-
Dave Fawthrop
-
Geoff Horton
-
Greg Newby
-
Greg Weeks
-
Jon Noring