
let's talk about the issue of putting scans online. the discussions on scanning file-naming conventions and the resolution at which we "should" be scanning have sidetracked the greater thread into a tar-pit... in an attempt to pull it out, i offer this post, which i wrote after having examined a couple of the e-texts which currently offer pagescans. perhaps this re-entry into the world of reality will remind us that all of this _does_ bring up issues, even if they are not the ones we're agonizing over. since i've been writing this over the course of several days, some of it is focused on basics that you might think we have already covered, but try to read it with an open mind if you can... the fact of the matter is, unlike the early days, we now have scans for a very high percentage of the e-texts that are put into the p.g. library. they might not meet every one of the demands that some of us would like to impose on them, but nonetheless, we _do_ have the pagescans... and, with diskspace getting cheaper all the time, and fat pipes becoming more and more common, we can start entertaining the idea of making these pagescans available on a public basis. and indeed, the "official" policy is that scans can be included... especially from the standpoint of _error-correction_, having online scans could make a _huge_ contribution. but even if we decide to hold off on this for a while, it's still a good idea to know what we need to know when we _do_ go ahead and start putting scans up. so let's examine this question from a _practical_ standpoint, rather than some "idealistic" one... and the best way to do that is to get real... *** i looked at four english e-texts from the library where the pagescans are already included, and will first make some overall comments about them. as i said a few days back, these are case-studies in how _not_ to do this. they really are that bad. here are some of the problems. some of the scan files are bad; they will not display. for instance, 2 of the 4 images in #14116 are bad. that's not a very good percentage, i would consider. 3 of the 4 e-texts contained at least one bad scan. the naming convention is also flawed, badly flawed. (i guess we cannot get away from this completely.) a naming convention that uses filenames like "001.tif", "002.tif", etc. is completely useless. when you have 17,000 such e-texts, when some go astray, as they inevitably will at some time, you do _not_ want to have to open up each "001.tif" in order to see what it _really_ is. if you've ever experienced that, you know that it drives you nuts, and you find a way to fix it. laying a "system" like this on the incompetents in the general public is a recipe for huge snafu. the name _has_to_ reflect the book itself, so there needs to be a _rootname_ that is unique to each book prefixing the filename. that allows different e-texts to live in the same folder, and that can come in handy. since the "rootname" that p.g. has come to decide upon is the e-text-number, use that. so the name for the first scan in #14116 should be something like "14116-001.tif" (there is one school of thought that says that you should only zero-pad the numbers as far out as you need for the particular book. thus, since there are only 4 pages in this e-text -- it's one issue of a serial -- that would mean _no_ zero-padding, so the name of the first file would be "14116-1.tif" if you follow this school. your tools must know how to resolve the issue to find the right file no matter which way you do it.) again, with 4 scans in this e-text, and 2 of them bad, this e-text does not serve as an auspicious beginning. *** ok, after #14116, let's look at #14040. at the outset, it is worth noting here that the quality of these scans is not very good. so i am uncertain _why_ this project was chosen as one where the raw images are included. these are not the type of scans you will hold up proudly. you might be willing to show them if someone insisted that they had a reason that they needed to see them. but they're not good enough to parade 'em as examples. aside from the useless "001.tif" file-naming problem again, there are some other file-naming problems with this e-text. first of all, the filenames do not reflect the pages that are contained within them. as above, you don't want to have to _open_ a file to know the number of the page pictured inside. you want that number reflected in the name of the file itself, to the degree that it is possible. (and it almost always is.) as it is, a few example files from #14040 are as follows:
003.tif = page 7 004.tif = page 9 005.tif = page 11 006.tif = page 12
so, we _could_ rename the files as follows:
rename 003.tif as 14040-007.tif = page 7 rename 004.tif as 14040-009.tif = page 9 rename 005.tif as 14040-011.tif = page 11 rename 006.tif as 14040-012.tif = page 12
looking at these scans shows they are front matter, so we can probably safely assume pages 8 and 10 were blank left-hand pages. but we don't actually _know_ that now, do we? and we want to _know_; we don't want to have to _assume_ any such things, especially if we need to open up and examine the scans in order to be able to be confident about our assumptions. so this solution only makes us wonder about 008.tif, and 010.tif. what was up with those pages/files? are they blank pages? or missing pages? or what? the upshot is that blank pages have to be included, because we need to have them as _placeholders_ which tell us that we do indeed have all of the pages. we want our checker-apps to be able to run through all of the filenames in a folder and _confirm_that "yes, all pages seem to be present and accounted for", and to do that, we need to have files for blank pages. this also addresses a less subtle problem, which is that -- for some particular usages -- apps will need to know whether a page is a _left-hand_page_ in the p-book or a _right-hand_page_. when the collection of filenames is fully in-sequence with the pages as they were bound, with a separate file for each and every recto and verso, we can know -- from the odd/even nature of the file's position in that enforced-linearity sequence-number -- whether that page was a left-hand or a right-hand page. in the bad example above, the linkage has been corrupted. 003.tif is a right-hand page, but so are 004.tif and 005.tif. our assumption would be that 004.tif was a left-hand page. if we want to print the scans to "recreate" the paper-book -- which will be one of the most-common uses scans will serve -- an absence of synchronicity like this one simply will not do. just to be thorough in my treatment, i'll remark that this recto/verso distinction also has implications when you have an image that's inserted between two pages. you have to include a blank-page scan after that, too, for its _verso_, to keep the sequence of files consistent. you might remember that i pointed this out to jon noring when it came up in his "my antonia" project a while back. marcello's illustration-file in his djvu has this problem too. he inserted a graphic named "72-image.tif" within the files, but didn't have the corresponding blank-verso file as well. now whether a person will actually _scan_ these blank pages is an open question. i found that trying to skip the blank pages broke my rhythmic scanning pattern, so it was faster and easier for me to just scan them. but other people might be different. if they are, we can deal with their output later, with no trouble; it's really not a big deal to rename the files appropriately and to insert the blank-page scans wherever they happen to be needed. (you just have to have a tool that's smart enough to do the job.) one more thing about this "recreating the paper-book" thought. if the files are named wisely, sorting their names will give you the order in which they should be printed to mimic the p-book. so there's no need for an "index file" or that kind of nonsense. sort-order is the very type of easy-to-follow no-ambiguity rule that program-designers like. it also means that people can use a slide-show tool to cruise through the pagescans if they want. note that a sort-order rule conflicts with marcello's suggestion that scans of the covers should be named as follows:
c0000.tif = front cover c0002.tif = inside front cover c0003.tif = inside back cover c0004.tif = outside back cover c0005.tif = spine
according to the binding-creates-a-specific-linearity model which we then emulate through judicious naming of the files, names for the back-cover scans should sort to the _bottom_. *** um, we're not done with 14040 yet; there are some other serious problems with this scan-set. pages 30 and 31 were scanned twice, whereas pages 32 and 33 are missing. if you're looking, this sequence is of files 024.tif through 027.tif. (and here you can see how confusing it becomes when the filenames are not the page-numbers.) this type of error is a not-uncommon one when scanning, to be sure. duplicate spreads, or missing spreads, or -- as in this case -- a combo, are easy mistakes to make in the course of scanning hundreds of pages. after all, it is humans doing this scanning. and to err is human, right? (but sheet-fed scanners _also_ make these mistakes, when they jam, or misfeed, or whatever. so it's not just humans; machines err too.) but even (especially!) if a mistake is common, it _should_ be the case that quality-control checks would have located and fixed this mistake, _long_ before this scan-set was ever released out to the general public. indeed, this type of mistake should _always_ be caught -- and fixed! -- before the person doing the scanning even returns the book to the shelf. it takes 20 times longer to fix this problem if you don't do it right away. the fact that the error is still sitting there, easy for me and anyone else to check and see, indicates that quality-control needs to be improved... finally, the "read-me" file for these scans says:
The page images of thie book are shown in the TIFF files.
it might be fairly petty to note this, but a typo on a word like "this" is pretty embarrassing for a "literary foundation" like p.g., is it not? *** ok, on to marcello's djvu example, for text #12973. (note the slight expansion of the analysis, from just the scans themselves to now include their consolidation within the djvu file, an additional step.) first, as i had noted in a earlier post, the page-numbers on the _djvu_file_ are _out-of-sync_ with the page-numbers on the actual pages in that file. but unlike the examples above, that's _not_ because the pagescan files are mis-named. in this case, those pagescan files are named _correctly_. (well, with just 2 exceptions.) what caused the problem, however, is that one of the pagescan files is named "000.tif", and djvu does not have the concept of a page zero. the first page it encounters is page 1, and then each page after that is incremented by 1. to put it another way, djvu does its own numbering. (this flaw is not unprecedented with viewer-tools; acrobat is the same. you'd think that any program that purports to deal with electronic-books would have -- at the outset -- taken into account that forward matter is often numbered with roman numerals, and that "page 1" of a book will require resetting the counter. but i guess that's expecting too much.) so that 000.tif -- a picture of the person who is the subject of the book, which the .html-version graphic-name enlightens us is the frontispiece -- throws off the numbering from the very beginning of the djvu file. oops. ironically, there is a "blank page" on page 4/5, which that frontispiece could replace nicely, and that would leave the page-numbering correct. if it provides us with accurate pagenumbers throughout the e-book, it is worth it to shuffle some pages of the front-matter around to achieve it. (even if it upsets the "replicate the p-book" objective mentioned above.) and in this case, that shuffle would indeed give us correct pagenumbers. at least until we hit the picture on page 73 (or so, it's 072-image.jpg), which -- as a separate file -- threw off marcello's djvu page-numbering by an _additional_ increment of 1, for a combined offset now of _2_, which means the offset is not even continuous throughout this djvu file. it's an offset of 1 up until page 73 (or so), then an offset of 2 after that. how can an ordinary person keep stuff like this straight? while it is not altogether clear _where_ this picture was in the p-book, to fix this, this picture could have been placed at the bottom of page 65, a half-page at the end of the previous chapter. and thus, the relationship between the djvu computed page-number and the p-book page-numbers in this scan-set _could_ be made to be totally consistent, if we wanted, by incorporating both of these fixes. it is often the cause that you can achieve this harmony. if you just do some work, you can make it work. but there are also some cases where you just cannot make it work out. if a p-book has 20 pages of front matter, and the first page of the book is numbered "3", there's no way you can do the required squeezing, so -- for those tools that do their own page-numbering starting at zero -- you're just not going to be able to make it work right. i recommend you lobby the programmers of those apps to make 'em more sophisticated, or else you can just throw them out. (the tools, not the programmers.) :+) in the meantime, _my_ tools will be programmed to do things correctly... (because having correct page-numbers for the body-text is so important, one way to make the best of a bad situation is to move the "front matter" to the end of the regular text, with a note in its original location that tells users where they can find it, and why it was moved for their convenience. this throws the numbering off for the front-matter, but that's less serious. again, it would be much nicer if the tools just accommodated this situation, since it is extremely common. but sometimes you do what you have to do.) *** and now back to the analysis of the scan-set and the djvu of #12973... concentrating now on the pagescans themselves, rather than the djvu, many scans are badly skewed. in particular, the left-hand pages are tilted one way, and the right-hand pages the other way, which will be a familiar sight to people who have experience with looking at scans, since it reflects a physical reality of putting a bound book on a scanner. sadly, as you page through the scans, this back-and-forth tilting gives the impression of being on a ship. it can even make you a bit seasick... :+) so this introduces the general topic of the _clean-up_ of the scans... and the first way that scans need to be cleaned-up is to be _straightened_. when you're scanning hundreds of pages, almost all of them will be crooked, to one degree or another, no matter how meticulous you are trying to be. this skew of the pages can make them difficult to read if it is bad enough. and even if it's only a very subtle skew, it will bother readers' subconscious. it's also worth noting that skewed pages give particularly poor o.c.r. results. so if you are going to deskew a page sooner or later (and -- if you want to make 'em public -- you really have to), then you might as well deskew them _before_ you o.c.r. them, and save yourself some time correcting scannos. (some o.c.r. programs even deskew the image before they do the recognition. i don't know if they save the deskewed image, though, which is what we want.) the people over at d.p. have finally learned this, at least when it comes to a really bad skew, of the type that is the case in almost all of the scans in this book. so we won't see pagescans that are this badly skewed, not from books that were done recently. however, i'm not sure that the d.p. people have learned that even a _little_ skew can adversely impact their o.c.r. results, so i don't think they routinely deskew all scans. but maybe they've changed. if not, they should. next, the size of the images in this scan-set has not been standardized. in particular, the partial-pages that are found at the end of each chapter are smaller, and thus -- when using the "fit-to-page" view option in djvu (which is the only realistic choice in most cases, providing the page can actually be read at that scaling, a problem with any non-reflowing format) -- it gets blown up unnecessarily large relative to the surrounding pages. except in very rare cases, all pagescans of a book should be the same size. in addition, positioning of the text on the scans should be _regularized_. that is, the upper-left point of the text should fall at the exact same point on every scan. even those end-chapter pages that are half-full should have their _top_left_ at the same point, even though their bottom will of course fall at a different place. (and, as noted above, they should still be the exact same size -- in width and height -- as all of the other pages.) although you might not see why at first, regularization can also aid o.c.r. that's because regularization allows you to set "zones" on the pagescan, one time for all of the pages, and zoning keeps the o.c.r. results straight. for instance, the zone that contains the running-head will be recognized as being independent of the body-text itself, so their text is not joined together. another example is that zoning can rule out stray marks in the margin from being recognized, and thus keep junk from intruding into the o.c.r. results. *** the last of the four e-texts i looked at was #14100. the problems with this set of scans were the same ones that have been discussed above, including a few very skewed scans, and a lack of regularized positioning. but otherwise, though, the quality of the scans was relatively good, and none of the 27 .tiff images were corrupted -- they all displayed just fine. *** so, there's a look at some actual e-text pagescans in the library. (there are a few more, but they were not in english, so i passed.) as it shows, the current examples of pagescans in the library are not very good examples at all, except as _bad_ examples. and if an analysis of _4_ e-texts results in a post this long, imagine the huge mess that will result from _400_ e-texts, if the practices are not buttoned down before we even start. (we won't even think about _4,000_, or _4-times-4,000_.) some of the flaws -- skewed scans, missing pages, etc. -- are readily apparent. others, like the filenaming glitches, may be less obvious to the casual observer, but they still stick out like sore thumbs to those of us who are trying to build tools to extend people's power in using the library... it is also of importance to note that some of these steps -- most notably deskewing and regularized text positioning -- are ones that can substantially improve the o.c.r. accuracy, which is a bonus that we probably cannot afford to pass up. if you're gonna do scan-cleanup sooner or later, do it sooner! other of my suggestions here, such as filenaming conventions, would also help the d.p. workflow, and i do believe this strongly. but if they want to be stubborn and do it some other way, fine. i'll just have to make corrections to what they do after-the-fact. *** anyway, there's my analysis for now. hopefully, instead of dealing with red herrings, these real issues will be addressed when project gutenberg makes the decision to move forward in earnest on the task of making scans available... that's all for now... -bowerbird