here are the suggestions that should be accepted

jon said:
Well, we've both had our say, so let the few others who are following this topic decide for themselves.
how about instead we have people forget about what you said so they can pay attention to what i said? :+) the system you want to lay on people is so convoluted from the standpoint of their _actual_ workflow right now that they reject what you say out of hand, and end up unwilling to discuss the topic. and that makes it all that more difficult to get them to accept reasonable ideas. as david put it:
That is true. That will continue to be true what ever standard is agreed upon, since people aren't going to suddenly change their scanning patterns based on what the final product, a long ways down the line, looks like.
now, the reality of the situation is that a change in their "scanning patterns" might well _improve_ their work flow, make them more efficient, and so on. but since you don't understand that flow -- and since they know that you don't -- you can't really hope to talk at that level. i'm sorry to have to keep on harping that you don't know what you're talking about. i'm aware that some people hear that and think i'm some mean person dumping on you. the truth of the matter is i don't like saying it. i don't like that you make it necessary to do so. but _somebody_ has to level with you, jon. take another example, from your latest post:
But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.
so i tell you again that you need to look at the auto-naming feature in your scanning program. it saves the page automatically, then increments the page-counter in anticipation of the next file. (if you don't have such a capability, you really need to get yourself a more modern program. but since it was common even a decade ago, i assume that your program has this feature.) auto-naming is what most all d.p. scanners use. they don't enter a name for each individual file, as that would waste energy and slow them down. they get into a dream routine where their body is flying in a ritual of page-turning and positioning. some of 'em can scan hundreds of pages an hour. if you had scanned in more than a few books, and scanned 'em in the manner that d.p. scanners do, you would know this, and you wouldn't say things so inconsistent with the reality of their processes. *** here's what _really_ needs to be done to take page-image scan-sets public... 1. scanning should be done such that the auto-naming of the scanned images fits the outcome we want them to have. however, if the scanner for some reason cannot or did not do this, it's no problem, as our tools can rename the files _easily_. i developed such a renamer using an e-text that had unnumbered plates and blank pages (some of which were scanned and some not), plus a whole raft of missing pages (part of an english/hawaiian facing-translations section), and so on. and the renaming tool worked fine. 2. quality-control needs to be done _during_ the process of scanning, so errors (like missing pages or duplicates) are caught when they can be fixed easily. i believe d.p. scanners have learned this. you do a visual inspection of each page, to make sure it is acceptable quality, and then a check where you step through all of the images checking their page-numbers to confirm you got every single one of 'em. again, regular d.p. scanners do this already, because they've learned that it's worthwhile. finding bad pages after fact = pain in the ass. 3. all of the scans need to be _cleaned_. they need to be deskewed and regularized, both in terms of size, and placement of text. since this job can be done fairly automatically, without much need for human intervention, and since it will help the o.c.r. accuracy, it should be done _before_ the o.c.r. the original files can be discarded; the foibles of scanning are unimportant. i have not lobbied on behalf of such cleanup before because i figured d.p. would learn it sooner or later. (now if only the large scanning projects would too!) but if you take scans public, they must be cleaned. but again, if this wasn't done originally, it can easily be done later, since it is generally rather automatic. 4. the o.c.r. process should output individual files, so as to retain the linkage between text and its scan. at this point, any weirdnesses in the page-numbering can be addressed, and blank scans inserted if needed. i won't get into the reasons why now, but these files should have all styling (margins, italics, etc.) saved! folks, do _not_ strip your o.c.r. files to plain-ascii! 5. the posted e-text-files -- ascii, html, all of 'em -- need to have the page-breaks indicated in some way, so that end-users and their tools can easily correlate the text with the exact page-scan from which it came. so pagebreaks must be retained every step of the way! stop discarding this useful information from the files! if you need input on how to represent them, ask me! 6. all the way through the cleaning-up of the text, a person should be able to deal with one page of text _or_ with the entire book, whichever is convenient for the particular clean-up task at hand at the time. this is, of course, a function of the _tools_ they use, rather than the scan-set per se, but i mention it now since the implication is you'll need to be receptive to feedback and requests you get from your toolmakers; an example of that is my input on naming of the files. 7. every page-scan should be accessible separately. bundles -- such as djvu and .zip -- can also be made, but the world must be able to grab each page-scan _individually_ when one is all that they want to see; it's unreasonable to expect 'em to download 'em all. 8. a system of "continuous proofreading" should be initiated to make the best use of these public scans, as it will leverage the immense power of user eyeballs to march the e-texts toward an error-free perfection. 9. any all-image-format should be easily ported to the current and future platforms that can utilize it, like the sony playstation portable and the nokia770. it would be great if you offered such porting online. i'll be offering it as a feature in my viewer-program, with the text-file as input, but it would be nice if you could offer it using the original page-scans as well... 10. whenever possible, the converted e-text should accompany the scans, to overcome the weaknesses of an image-format (e.g., the lack of searchability, the inability to copy-and-paste text, and so forth...). -bowerbird
participants (1)
-
Bowerbird@aol.com