July 2005 - gutvol-d - lists.pglaf.org

re: [gutvol-d] Scan file naming -- another comment
by Bowerbird＠aol.com 24 Jul '05

24 Jul '05

jon said: > The system I propose for scan file naming > is *simpler* than yours and more flexible no, it isn't. not on either count. the mere act of keeping track of a single name when it contains two -- or more -- variables in it becomes much more difficult than it needs to be. put together a half-dozen books each containing hundreds of scan-files using your verbose names and you'll find yourself drowning in the confusion. your pace would fall to a crawl. try it. you'll see. d.p. scanners can't afford to work at such a pace. my filenaming convention has grown out of my experience through the entire digitizing process. if it would have needed to be more complicated, i would have learned that by now and made it so. you can always make a system more complex. the smart thing -- which experience teaches -- is to realize when the increment in _cost_ will return to you a sufficient increment in _benefit_. and when it will not. you have taken the simple-and-useful principle that it is good to know the contents of a file from its name, and blown it past the point where it is cost-beneficial. there are all kinds of information that you _could_ put into the filename, which _might_ be useful at some point. but if it makes the process of dealing with the filename too unwieldy, it ain't worth it. the trick is to know when to stop. > It also integrates better into the QC system. you don't even have a good idea what a quality-control system might look like, let alone knowledge of problems that might crop up with each particular type of system. you might _think_ you do. but -- as is typical with you -- you don't know what you don't know. your knowledge has not been tempered by the big face-slap of the real-world. nor have you programmed the _apps_ that could implement such a quality-control system. when you get to that stage, then come back and we can have this discussion again, jon. > 2) During the next stage where a human being is > looking at each scan, they append the *actual* > publisher supplied page number (or string) > to the filename from (1). No need to add any > letter prefixes or anything -- they use the *actual* > string "as it is". and here's a good illustration of your lack of knowledge because of an absence of experience, combined with an ignorance of the kinds of tasks that the machine can do. (your willingness to _discuss_ an issue is quite admirable, jon, really. but that alone can only take a person so far.) in a properly designed system, the human being should _not_ have to mess with filenames at all, or only on the rare occasions of mondo weirdness. that's why i told david and all the other scanners to keep on doing whatever they are now doing, because i can deal with their stuff after-the-fact. (well, there are _some_ things i wish they would do; but it has nothing to do with the type of useless tripe you want them to have to deal with, that's for sure.) specifically, it's easy to write a routine that looks in the o.c.r. results to find the page-number of the page. (in general, i don't say something is "easy" unless i've already done it myself. because i have learned that many things that seem like they should be easy are not. i've already written this routine. it was easy to write.) if you're writing a tool to clean up a scan=set, as i am, it's pretty much _required_ that you write this routine, because you need to delete that number from the text. but before you delete it (as that will be among the last things that your tool does), you can use it to _rename_ both the o.c.r. file and the scan-file. indeed, i basically recommend that this file-renaming be one of the _first_ things you do during clean-up, because it will usually be _so_ much easier to deal with the other clean-up tasks when the filenames and the page-numbers match up. of course, what you should have done in the first place is ensured that the scans were _auto-named_ correctly, setting the auto-name counter at 1 when scanning page 1, and then next scanning each numbered page in sequence, only going back afterwards to scan out-of-sequence pages. but accidents do happen, so i programmed this routine that renames your o.c.r. files and scan-files if needed. assuming you've got a clean set of scans and the page-numbering doesn't have many anomalies, it won't take you more than a minute of two to review the new names and approve a mass change. so, once again, jon, you've made a mountain out of a molehill, and then put together a baroque "plan" on how to scale it... -bowerbird

7 8

re: [gutvol-d] Scan file naming -- another comment
by Bowerbird＠aol.com 23 Jul '05

23 Jul '05

marcello said: > If you have a sheet feeder > just extract inserted illustrations > and such out-of-sequence stuff > (making a note of the page number > on the back). Then feed the whole pile > starting with page arabic "1" > and go drink a coffee. i'm sure juliet will get a chuckle out of that... :+) -bowerbird

1 0

re: [gutvol-d] Scan file naming -- another comment
by Bowerbird＠aol.com 22 Jul '05

22 Jul '05

jon, go back and read my post in 5 years. you will understand it much better then... the overarching rule is to make the system as complex as it needs to be, and no more. -bowerbird

3 2

re: [gutvol-d] what's wrong with these pictures that we call a scan-set?
by Bowerbird＠aol.com 22 Jul '05

22 Jul '05

marcello said: > Every OS in use today supports folders, > so its best to completely ignore this feature. that makes it sound like you want to use folders. > Lets throw all books into one folder > and use the filename to sort them out. that makes it sound like you don't want to use folders. i didn't discuss whether to use folders or not because it really doesn't matter much if you do or you don't. but if you don't have unique filenames, then you _have_to_ use folders. but that's not the only reason why unique filenames are a good idea. the benefits of unique filenames apply either way, primarily that you want to be able to unequivocally know the contents of a file by looking at its filename, and that all the files of a book get sorted with each other (if they are among other books), and that they get sorted in their correct order, as defined by the paper-book binding. > Please realize that the etext no. gets assigned > immediately before the book is posted. > I tried to have this changed, because IMO > it is a bad process, but WWers wanted to keep it like it is. > In consequence, if you want the etext no. as part of your filename, > you'll have to graft it on after the book is posted. it would be easy enough to twiddle that part of the workflow. > Basically your proposal has following advantages over mine: wrong on all counts. but everyone else can see that for themselves, so i won't bother with the redundancy of setting the record straight... > That file is just a "remake" of the zipped collection of tiffs. > I just wanted to test compression, djvu plugin, loading speed, > and deep linking features. This file predates my RFC > and so cannot be compliant with it. the point still stands; whoever made this scan-set did not include a blank-page scan for the verso of the illustration, which means the recto/verso alternating pattern for the set was thrown off... > having 2 different streams for front covers and back covers is overkill we've got plenty of letters in the alphabet we can use if we need them to get the sequence to run correctly... > nearer to the paper book experience: first thing you see is the front cover if you've got a cover-scan, that file should be sorted so it rises to the top, without question. > its easier to print: assuming you want to print the covers at all > you'll have to print them on the color printer it's no easier to do it your way than mine. and either way, people can figure that out. > if you split the cover page stream where do you put the spine? at the end, of course. its name should sort to the very bottom. because it exists outside the linear sequence of the bound pages. > in the *looooooooooooooong* meantime ... more disinformation i will not bother to counter. except to say that beta testers are still welcome! -bowerbird

2 1

what's wrong with these pictures that we call a scan-set?
by Bowerbird＠aol.com 22 Jul '05

22 Jul '05

let's talk about the issue of putting scans online. the discussions on scanning file-naming conventions and the resolution at which we "should" be scanning have sidetracked the greater thread into a tar-pit... in an attempt to pull it out, i offer this post, which i wrote after having examined a couple of the e-texts which currently offer pagescans. perhaps this re-entry into the world of reality will remind us that all of this _does_ bring up issues, even if they are not the ones we're agonizing over. since i've been writing this over the course of several days, some of it is focused on basics that you might think we have already covered, but try to read it with an open mind if you can... the fact of the matter is, unlike the early days, we now have scans for a very high percentage of the e-texts that are put into the p.g. library. they might not meet every one of the demands that some of us would like to impose on them, but nonetheless, we _do_ have the pagescans... and, with diskspace getting cheaper all the time, and fat pipes becoming more and more common, we can start entertaining the idea of making these pagescans available on a public basis. and indeed, the "official" policy is that scans can be included... especially from the standpoint of _error-correction_, having online scans could make a _huge_ contribution. but even if we decide to hold off on this for a while, it's still a good idea to know what we need to know when we _do_ go ahead and start putting scans up. so let's examine this question from a _practical_ standpoint, rather than some "idealistic" one... and the best way to do that is to get real... *** i looked at four english e-texts from the library where the pagescans are already included, and will first make some overall comments about them. as i said a few days back, these are case-studies in how _not_ to do this. they really are that bad. here are some of the problems. some of the scan files are bad; they will not display. for instance, 2 of the 4 images in #14116 are bad. that's not a very good percentage, i would consider. 3 of the 4 e-texts contained at least one bad scan. the naming convention is also flawed, badly flawed. (i guess we cannot get away from this completely.) a naming convention that uses filenames like "001.tif", "002.tif", etc. is completely useless. when you have 17,000 such e-texts, when some go astray, as they inevitably will at some time, you do _not_ want to have to open up each "001.tif" in order to see what it _really_ is. if you've ever experienced that, you know that it drives you nuts, and you find a way to fix it. laying a "system" like this on the incompetents in the general public is a recipe for huge snafu. the name _has_to_ reflect the book itself, so there needs to be a _rootname_ that is unique to each book prefixing the filename. that allows different e-texts to live in the same folder, and that can come in handy. since the "rootname" that p.g. has come to decide upon is the e-text-number, use that. so the name for the first scan in #14116 should be something like "14116-001.tif" (there is one school of thought that says that you should only zero-pad the numbers as far out as you need for the particular book. thus, since there are only 4 pages in this e-text -- it's one issue of a serial -- that would mean _no_ zero-padding, so the name of the first file would be "14116-1.tif" if you follow this school. your tools must know how to resolve the issue to find the right file no matter which way you do it.) again, with 4 scans in this e-text, and 2 of them bad, this e-text does not serve as an auspicious beginning. *** ok, after #14116, let's look at #14040. at the outset, it is worth noting here that the quality of these scans is not very good. so i am uncertain _why_ this project was chosen as one where the raw images are included. these are not the type of scans you will hold up proudly. you might be willing to show them if someone insisted that they had a reason that they needed to see them. but they're not good enough to parade 'em as examples. aside from the useless "001.tif" file-naming problem again, there are some other file-naming problems with this e-text. first of all, the filenames do not reflect the pages that are contained within them. as above, you don't want to have to _open_ a file to know the number of the page pictured inside. you want that number reflected in the name of the file itself, to the degree that it is possible. (and it almost always is.) as it is, a few example files from #14040 are as follows: > 003.tif = page 7 > 004.tif = page 9 > 005.tif = page 11 > 006.tif = page 12 so, we _could_ rename the files as follows: > rename 003.tif as 14040-007.tif = page 7 > rename 004.tif as 14040-009.tif = page 9 > rename 005.tif as 14040-011.tif = page 11 > rename 006.tif as 14040-012.tif = page 12 looking at these scans shows they are front matter, so we can probably safely assume pages 8 and 10 were blank left-hand pages. but we don't actually _know_ that now, do we? and we want to _know_; we don't want to have to _assume_ any such things, especially if we need to open up and examine the scans in order to be able to be confident about our assumptions. so this solution only makes us wonder about 008.tif, and 010.tif. what was up with those pages/files? are they blank pages? or missing pages? or what? the upshot is that blank pages have to be included, because we need to have them as _placeholders_ which tell us that we do indeed have all of the pages. we want our checker-apps to be able to run through all of the filenames in a folder and _confirm_that "yes, all pages seem to be present and accounted for", and to do that, we need to have files for blank pages. this also addresses a less subtle problem, which is that -- for some particular usages -- apps will need to know whether a page is a _left-hand_page_ in the p-book or a _right-hand_page_. when the collection of filenames is fully in-sequence with the pages as they were bound, with a separate file for each and every recto and verso, we can know -- from the odd/even nature of the file's position in that enforced-linearity sequence-number -- whether that page was a left-hand or a right-hand page. in the bad example above, the linkage has been corrupted. 003.tif is a right-hand page, but so are 004.tif and 005.tif. our assumption would be that 004.tif was a left-hand page. if we want to print the scans to "recreate" the paper-book -- which will be one of the most-common uses scans will serve -- an absence of synchronicity like this one simply will not do. just to be thorough in my treatment, i'll remark that this recto/verso distinction also has implications when you have an image that's inserted between two pages. you have to include a blank-page scan after that, too, for its _verso_, to keep the sequence of files consistent. you might remember that i pointed this out to jon noring when it came up in his "my antonia" project a while back. marcello's illustration-file in his djvu has this problem too. he inserted a graphic named "72-image.tif" within the files, but didn't have the corresponding blank-verso file as well. now whether a person will actually _scan_ these blank pages is an open question. i found that trying to skip the blank pages broke my rhythmic scanning pattern, so it was faster and easier for me to just scan them. but other people might be different. if they are, we can deal with their output later, with no trouble; it's really not a big deal to rename the files appropriately and to insert the blank-page scans wherever they happen to be needed. (you just have to have a tool that's smart enough to do the job.) one more thing about this "recreating the paper-book" thought. if the files are named wisely, sorting their names will give you the order in which they should be printed to mimic the p-book. so there's no need for an "index file" or that kind of nonsense. sort-order is the very type of easy-to-follow no-ambiguity rule that program-designers like. it also means that people can use a slide-show tool to cruise through the pagescans if they want. note that a sort-order rule conflicts with marcello's suggestion that scans of the covers should be named as follows: > c0000.tif = front cover > c0002.tif = inside front cover > c0003.tif = inside back cover > c0004.tif = outside back cover > c0005.tif = spine according to the binding-creates-a-specific-linearity model which we then emulate through judicious naming of the files, names for the back-cover scans should sort to the _bottom_. *** um, we're not done with 14040 yet; there are some other serious problems with this scan-set. pages 30 and 31 were scanned twice, whereas pages 32 and 33 are missing. if you're looking, this sequence is of files 024.tif through 027.tif. (and here you can see how confusing it becomes when the filenames are not the page-numbers.) this type of error is a not-uncommon one when scanning, to be sure. duplicate spreads, or missing spreads, or -- as in this case -- a combo, are easy mistakes to make in the course of scanning hundreds of pages. after all, it is humans doing this scanning. and to err is human, right? (but sheet-fed scanners _also_ make these mistakes, when they jam, or misfeed, or whatever. so it's not just humans; machines err too.) but even (especially!) if a mistake is common, it _should_ be the case that quality-control checks would have located and fixed this mistake, _long_ before this scan-set was ever released out to the general public. indeed, this type of mistake should _always_ be caught -- and fixed! -- before the person doing the scanning even returns the book to the shelf. it takes 20 times longer to fix this problem if you don't do it right away. the fact that the error is still sitting there, easy for me and anyone else to check and see, indicates that quality-control needs to be improved... finally, the "read-me" file for these scans says: > The page images of thie book > are shown in the TIFF files. it might be fairly petty to note this, but a typo on a word like "this" is pretty embarrassing for a "literary foundation" like p.g., is it not? *** ok, on to marcello's djvu example, for text #12973. (note the slight expansion of the analysis, from just the scans themselves to now include their consolidation within the djvu file, an additional step.) first, as i had noted in a earlier post, the page-numbers on the _djvu_file_ are _out-of-sync_ with the page-numbers on the actual pages in that file. but unlike the examples above, that's _not_ because the pagescan files are mis-named. in this case, those pagescan files are named _correctly_. (well, with just 2 exceptions.) what caused the problem, however, is that one of the pagescan files is named "000.tif", and djvu does not have the concept of a page zero. the first page it encounters is page 1, and then each page after that is incremented by 1. to put it another way, djvu does its own numbering. (this flaw is not unprecedented with viewer-tools; acrobat is the same. you'd think that any program that purports to deal with electronic-books would have -- at the outset -- taken into account that forward matter is often numbered with roman numerals, and that "page 1" of a book will require resetting the counter. but i guess that's expecting too much.) so that 000.tif -- a picture of the person who is the subject of the book, which the .html-version graphic-name enlightens us is the frontispiece -- throws off the numbering from the very beginning of the djvu file. oops. ironically, there is a "blank page" on page 4/5, which that frontispiece could replace nicely, and that would leave the page-numbering correct. if it provides us with accurate pagenumbers throughout the e-book, it is worth it to shuffle some pages of the front-matter around to achieve it. (even if it upsets the "replicate the p-book" objective mentioned above.) and in this case, that shuffle would indeed give us correct pagenumbers. at least until we hit the picture on page 73 (or so, it's 072-image.jpg), which -- as a separate file -- threw off marcello's djvu page-numbering by an _additional_ increment of 1, for a combined offset now of _2_, which means the offset is not even continuous throughout this djvu file. it's an offset of 1 up until page 73 (or so), then an offset of 2 after that. how can an ordinary person keep stuff like this straight? while it is not altogether clear _where_ this picture was in the p-book, to fix this, this picture could have been placed at the bottom of page 65, a half-page at the end of the previous chapter. and thus, the relationship between the djvu computed page-number and the p-book page-numbers in this scan-set _could_ be made to be totally consistent, if we wanted, by incorporating both of these fixes. it is often the cause that you can achieve this harmony. if you just do some work, you can make it work. but there are also some cases where you just cannot make it work out. if a p-book has 20 pages of front matter, and the first page of the book is numbered "3", there's no way you can do the required squeezing, so -- for those tools that do their own page-numbering starting at zero -- you're just not going to be able to make it work right. i recommend you lobby the programmers of those apps to make 'em more sophisticated, or else you can just throw them out. (the tools, not the programmers.) :+) in the meantime, _my_ tools will be programmed to do things correctly... (because having correct page-numbers for the body-text is so important, one way to make the best of a bad situation is to move the "front matter" to the end of the regular text, with a note in its original location that tells users where they can find it, and why it was moved for their convenience. this throws the numbering off for the front-matter, but that's less serious. again, it would be much nicer if the tools just accommodated this situation, since it is extremely common. but sometimes you do what you have to do.) *** and now back to the analysis of the scan-set and the djvu of #12973... concentrating now on the pagescans themselves, rather than the djvu, many scans are badly skewed. in particular, the left-hand pages are tilted one way, and the right-hand pages the other way, which will be a familiar sight to people who have experience with looking at scans, since it reflects a physical reality of putting a bound book on a scanner. sadly, as you page through the scans, this back-and-forth tilting gives the impression of being on a ship. it can even make you a bit seasick... :+) so this introduces the general topic of the _clean-up_ of the scans... and the first way that scans need to be cleaned-up is to be _straightened_. when you're scanning hundreds of pages, almost all of them will be crooked, to one degree or another, no matter how meticulous you are trying to be. this skew of the pages can make them difficult to read if it is bad enough. and even if it's only a very subtle skew, it will bother readers' subconscious. it's also worth noting that skewed pages give particularly poor o.c.r. results. so if you are going to deskew a page sooner or later (and -- if you want to make 'em public -- you really have to), then you might as well deskew them _before_ you o.c.r. them, and save yourself some time correcting scannos. (some o.c.r. programs even deskew the image before they do the recognition. i don't know if they save the deskewed image, though, which is what we want.) the people over at d.p. have finally learned this, at least when it comes to a really bad skew, of the type that is the case in almost all of the scans in this book. so we won't see pagescans that are this badly skewed, not from books that were done recently. however, i'm not sure that the d.p. people have learned that even a _little_ skew can adversely impact their o.c.r. results, so i don't think they routinely deskew all scans. but maybe they've changed. if not, they should. next, the size of the images in this scan-set has not been standardized. in particular, the partial-pages that are found at the end of each chapter are smaller, and thus -- when using the "fit-to-page" view option in djvu (which is the only realistic choice in most cases, providing the page can actually be read at that scaling, a problem with any non-reflowing format) -- it gets blown up unnecessarily large relative to the surrounding pages. except in very rare cases, all pagescans of a book should be the same size. in addition, positioning of the text on the scans should be _regularized_. that is, the upper-left point of the text should fall at the exact same point on every scan. even those end-chapter pages that are half-full should have their _top_left_ at the same point, even though their bottom will of course fall at a different place. (and, as noted above, they should still be the exact same size -- in width and height -- as all of the other pages.) although you might not see why at first, regularization can also aid o.c.r. that's because regularization allows you to set "zones" on the pagescan, one time for all of the pages, and zoning keeps the o.c.r. results straight. for instance, the zone that contains the running-head will be recognized as being independent of the body-text itself, so their text is not joined together. another example is that zoning can rule out stray marks in the margin from being recognized, and thus keep junk from intruding into the o.c.r. results. *** the last of the four e-texts i looked at was #14100. the problems with this set of scans were the same ones that have been discussed above, including a few very skewed scans, and a lack of regularized positioning. but otherwise, though, the quality of the scans was relatively good, and none of the 27 .tiff images were corrupted -- they all displayed just fine. *** so, there's a look at some actual e-text pagescans in the library. (there are a few more, but they were not in english, so i passed.) as it shows, the current examples of pagescans in the library are not very good examples at all, except as _bad_ examples. and if an analysis of _4_ e-texts results in a post this long, imagine the huge mess that will result from _400_ e-texts, if the practices are not buttoned down before we even start. (we won't even think about _4,000_, or _4-times-4,000_.) some of the flaws -- skewed scans, missing pages, etc. -- are readily apparent. others, like the filenaming glitches, may be less obvious to the casual observer, but they still stick out like sore thumbs to those of us who are trying to build tools to extend people's power in using the library... it is also of importance to note that some of these steps -- most notably deskewing and regularized text positioning -- are ones that can substantially improve the o.c.r. accuracy, which is a bonus that we probably cannot afford to pass up. if you're gonna do scan-cleanup sooner or later, do it sooner! other of my suggestions here, such as filenaming conventions, would also help the d.p. workflow, and i do believe this strongly. but if they want to be stubborn and do it some other way, fine. i'll just have to make corrections to what they do after-the-fact. *** anyway, there's my analysis for now. hopefully, instead of dealing with red herrings, these real issues will be addressed when project gutenberg makes the decision to move forward in earnest on the task of making scans available... that's all for now... -bowerbird

4 3

600 dpi vs. 300 dpi for text (a quickie visual experiment)
by Jon Noring 22 Jul '05

22 Jul '05

Everyone, I've placed online several bitonal test images (lossless PNG format) derived from scans of 5 point, 6 point and 11 point text. There are both 600 dpi and 300 dpi versions for each point size. http://www.openreader.org/600vs300/ (I won't go into detail of how I generated these, except to say the original scans were done at 600 dpi optical full color, then a 300 dpi full color version was generated by high-quality resampling. Then both the 600 dpi and derived 300 dpi full color were converted to bitonal by thresholding, where the threshold values were adjusted by eye to give consistent results (my eyes, for better and for worse.) Nowhere in the process was a lossy format like JPEG used. For each point size, the best way to visually compare the 600 dpi with the 300 dpi is to print them out using Photoshop or Paint Shop Pro (or similar higher-end graphics program), adjusting the scaling so the same block of text appears identical in size on paper. From this quickie experiment, which certainly could be improved, I make the following preliminary conclusions: 1) If the scans are to be used for OCR only (which is DP's focus at present), then from the visual test alone 300 dpi bitonal appears sufficient for 5 point and larger Latin character set text -- this applies to just about all text documents DP will ever encounter. Greyscale certainly improve things, but bitonal appears to be minimally sufficient for OCR. Of course, this observation, based solely on eye, jives with DP's OCR experience. So I'm not concluding anything revolutionary here. 2) Likewise, 300 dpi bitonal is *readable* by human beings for 5 points and larger Latin character set text. 3) However, for smooth, comfortable readability, 600 dpi is definitely better, even for the 11 point text. 300 dpi clearly looks ragged (especially the 5 point text). Of course, anti-aliasing during presentation will overcome some of this raggedness, but such anti-aliasing is strictly artificial and won't fix letters which are mangled in some manner due to the reduced resolution. If the purpose of scans is for multiple use cases (and not only OCR), then it appears wise to scan text at 600 dpi, preferably 24-bit full color (which aids with image cleanup, and of course necessary when we are dealing with colored text and color illustrations.) These master scans can be resized and/or reduced in color depth using batch image processing for whatever purpose is required (e.g., direct online reading, OCR, etc.) Of course, the huge downside to higher-rez, higher-color-depth images are much greater file sizes. This causes difficulty with online archival storage and transport. For the short term, these master images probably need to be stored offline on some sort of storage media (such as DVD-ROM, tape or removable high-capacity harddrives.) [see note below] At this point, since DP is not concerned with archivability and multiple use cases, and has limited bandwidth and disk storage, then there is no reason for them to require 600 dpi for scanning of text. (Illustrations are another matter, as Juliet has noted.) But I do believe that those who are submitting scans to DP should seriously consider doing all scans at 600 dpi full color (especially if the scans can be done without page distortion such as if the book is chopped and run through a sheet-feed scanner), and then resample them to 300 dpi (bitonal or greyscale) for submission to DP. Backup the original master scans in lossless form until they can be donated to a future page scan archive. Just some thoughts.. Comments? Jon Noring [Note: Obviously, high-resolution, high-color depth scans can be subtantially compressed using a lossy algorithm, such as JPEG and a few others. However, lossy compression adds artifacts to the images, so I believe lossy algorithms should be avoided for all steps in the process of producing the master scan images. Rather, use PNG or other high-quality lossless compression algorithm for all steps in producing the master scan images.]

5 9

page scans for sale
by Stephanie Maschek 22 Jul '05

22 Jul '05

Speaking of all the page scan stuff, Octavo.com is selling PDF's of various old books on CDs. The prices vary, for example, Hooke's Micrographia is "only" $30, while Shakespeare quartos are $75-$100. (I guess the people coming to this store don't know Gutenberg already has these books for free, or they need page numbers or one of the other excuses.) But except for the selling thing, is this kind of what the archived page scan discussion is leading to, or will it be much more complicated? -Stefanie

3 3

a couple more things on page scan filenaming
by Jon Noring 21 Jul '05

21 Jul '05

In regards to my prior messages today bringing up some preliminary ideas about a convention for naming page scan images, a couple more situations came to mind: 1) Some Works comprise multiple volumes, such as Burton's "The Book of the Thousand Nights and a Night", which comprises 10, 16 or 17 volumes, depending on the edition and what one defines to be the "Work". So that has to be considered with respect to book/volume/ work identifiers, and depends upon what the processing system (such as DP) assigns IDs. 2) Some page numbers are implied, but not specifically printed with the page number. For example, in the "My Antonia" project, the last page in most every chapter was kept unnumbered, although each has an obvious logical page number based on page number sequencing -- the publisher simply chose not to print the page number on these pages. It seems to me the filenaming system has to include implied publisher page numbering. Thoughts on the page scan filenaming conventions I've been considering so far? Jon

10 12

Re: [gutvol-d] 600 dpi vs. 300 dpi for text (a quickie visual experiment)
by Jon Noring 21 Jul '05

21 Jul '05

I previously wrote: > Another acceptable solution is not to permanently save the first > generation raw scans, but to preserve the second generation cleaned up > versions which have been deskewed (using the right algorithm!), > cropped, and for black and white source converted to greyscale (but > kept at 600 dpi.) One reason for full-color scanning of even black and > white source material is that due to the color characteristics of the > background paper, some color channels may be cleaner (essentially > "whiter") and could be used for conversion to greyscale. I learned > this trick from the DP forum. </smile> Although quite a few people at DP and PG know about the color channel "trick" to get cleaner greyscale scans of black and white texts, I thought I'd put up an example for those here who have not seen this (warning: need high-speed broadband or a lot of patience): http://www.openreader.org/color-separation/ The file "374.png" is a full-color, 600 dpi page scan (already deskewed and cropped) of page 374 from "My Antonia". The other images are the red, green and blue channels of this image. The paper was really this brown (Willa Cather chose this general color, plus subsequent yellowing/aging has made it worse.) Note how the red channel is significantly better than the others (better contrast between the ink and the paper which should yield a better result when converted to bitonal) with the blue channel the worst. This jives with the general view that the red channel is usually the best. In scanning black and white documents, it is possible to select one of the color channels for greyscale scanning so one need not scan at full-color. Some pretesting with full-color scans might identify the best channel to pick. Or simply scan at full color and in later processing find the best channel to use and batch convert. Anyway, it's amazing how color channel separation can give better results. Jon

1 0

Re: [gutvol-d] 600 dpi vs. 300 dpi for text (a quickie visual experiment)
by Jon Noring 21 Jul '05

21 Jul '05

Jonathan Ingram wrote: > Obviously 600DPI full colour looks better than 300DPI bitonal, but this extra > quality comes at a high price. Personally, moving from one to the other would > not happen until someone buys me a faster scanner, a faster computer, a new > large hard disk, and pays me for the extra time it will take me to scan and > process the material even with this upgraded equipment. Even then, the high > quality scans will never get off my computer unless you buy me a faster > internet connection. > > Even if some people do decide to make high resolution masters, there's little > need to make colour scans of black-and-white originals. Grayscale, maybe, but > all a colour scan will show you is how yellowed the paper is. > > You also blithely say 'backup the original master scans', perhaps ignoring just > how large lossless full-colour 600DPI scans are. Because I've been scanning for > OCR, I can store the over 900 items I've scanned for DP on my hard disk (the > folder takes up a little over 33 GB). A single large and long book scanned at > 600DPI full-colour could end up taking up that much space by itself... and I > don't particularly want to have to burn multiple DVDs every single time I scan > something (ah yes, you'll need to pay for the DVD recorder and media). I know how large those scans are. :^( Another acceptable solution is not to permanently save the first generation raw scans, but to preserve the second generation cleaned up versions which have been deskewed (using the right algorithm!), cropped, and for black and white source converted to greyscale (but kept at 600 dpi.) One reason for full-color scanning of even black and white source material is that due to the color characteristics of the background paper, some color channels may be cleaner (essentially "whiter") and could be used for conversion to greyscale. I learned this trick from the DP forum. </smile> In the "My Antonia" scan project (the source of which is entirely black and white), the entire scanset of deskewed, cropped, and still full color scans at 600 dpi (all 430 pages plus separate illustration scans) amounts to 5.8 gigs (losslessly compressed as PNG, which achieved about a 50% compression.) If I were to convert them to grey scale, which now seems like a good idea once I determine the best color channel to use, the whole file set should reduce down to 1.9 gigs as PNG -- I think one could fit 2 or 3 projects like this on a single DVD-ROM. I also have a derived scan set of 600 dpi bitonal -- that amounts to a paltry 49 Mbytes -- now 10 or 12 of these could fit on a CD-ROM. Similarly, for the "Kama Sutra" project, with 190 pages or so, the deskewed, cropped, 600 dpi greyscale PNGs total to 660 Mbytes, almost fittable on a CD-ROM. So, yes, your point is well-taken about the real raw full color 600 dpi scans that come off the scanner. I think it is acceptable, for archiving, to chuck these once the deskewed (again using the right algorithm), cropped, 600 dpi greyscale versions (generated from the best color channel) are generated. So I revise somewhat my suggestion. Jon

1 0