[gutvol-d] what's wrong with these pictures that we call a scan-set?

21 Jul 2005

      let's talk about the issue of putting scans online.

the discussions on scanning file-naming conventions
and the resolution at which we "should" be scanning
have sidetracked the greater thread into a tar-pit...

in an attempt to pull it out, i offer this post,
which i wrote after having examined a couple
of the e-texts which currently offer pagescans.

perhaps this re-entry into the world of reality will
remind us that all of this _does_ bring up issues,
even if they are not the ones we're agonizing over.

since i've been writing this over the course of
several days, some of it is focused on basics
that you might think we have already covered,
but try to read it with an open mind if you can...

the fact of the matter is, unlike the early days,
we now have scans for a very high percentage
of the e-texts that are put into the p.g. library.

they might not meet every one of the demands
that some of us would like to impose on them,
but nonetheless, we _do_ have the pagescans...

and, with diskspace getting cheaper all the time,
and fat pipes becoming more and more common,
we can start entertaining the idea of making these
pagescans available on a public basis.   and indeed,
the "official" policy is that scans can be included...

especially from the standpoint of _error-correction_,
having online scans could make a _huge_ contribution.

but even if we decide to hold off on this for a while,
it's still a good idea to know what we need to know
when we _do_ go ahead and start putting scans up.

so let's examine this question from a _practical_
standpoint, rather than some "idealistic" one...

and the best way to do that is to get real...

***

i looked at four english e-texts from the library
where the pagescans are already included, and
will first make some overall comments about them.

as i said a few days back, these are case-studies
in how _not_ to do this.   they really are that bad.

here are some of the problems.

some of the scan files are bad; they will not display.
for instance, 2 of the 4 images in #14116 are bad.
that's not a very good percentage, i would consider.
3 of the 4 e-texts contained at least one bad scan.

the naming convention is also flawed, badly flawed.
(i guess we cannot get away from this completely.)

a naming convention that uses filenames like 
"001.tif", "002.tif", etc. is completely useless.

when you have 17,000 such e-texts, when some
go astray, as they inevitably will at some time,
you do _not_ want to have to open up each
"001.tif" in order to see what it _really_ is.
if you've ever experienced that, you know that
it drives you nuts, and you find a way to fix it.

laying a "system" like this on the incompetents
in the general public is a recipe for huge snafu.

the name _has_to_ reflect the book itself,
so there needs to be a _rootname_ that is
unique to each book prefixing the filename.
that allows different e-texts to live in the
same folder, and that can come in handy.

since the "rootname" that p.g. has come to
decide upon is the e-text-number, use that.
so the name for the first scan in #14116
should be something like "14116-001.tif"

(there is one school of thought that says
that you should only zero-pad the numbers
as far out as you need for the particular book.
thus, since there are only 4 pages in this e-text
-- it's one issue of a serial -- that would mean
_no_ zero-padding, so the name of the first file
would be "14116-1.tif" if you follow this school.
your tools must know how to resolve the issue to
find the right file no matter which way you do it.)

again, with 4 scans in this e-text, and 2 of them bad,
this e-text does not serve as an auspicious beginning.

***

ok, after #14116, let's look at #14040.   at the outset,
it is worth noting here that the quality of these scans
is not very good.   so i am uncertain _why_ this project 
was chosen as one where the raw images are included.
these are not the type of scans you will hold up proudly.
you might be willing to show them if someone insisted
that they had a reason that they needed to see them.
but they're not good enough to parade 'em as examples.

aside from the useless "001.tif" file-naming problem again,
there are some other file-naming problems with this e-text.
first of all, the filenames do not reflect the pages that are
contained within them.   as above, you don't want to have to
_open_ a file to know the number of the page pictured inside.
you want that number reflected in the name of the file itself,
to the degree that it is possible.   (and it almost always is.)

as it is, a few example files from #14040 are as follows:
...
003.tif = page 7
   004.tif = page 9
   005.tif = page 11
   006.tif = page 12
so, we _could_ rename the files as follows:
...
rename 003.tif as 14040-007.tif = page 7
   rename 004.tif as 14040-009.tif = page 9
   rename 005.tif as 14040-011.tif = page 11
   rename 006.tif as 14040-012.tif = page 12
looking at these scans shows they are front matter,
so we can probably safely assume pages 8 and 10
were blank left-hand pages.   but we don't actually
_know_ that now, do we?   and we want to _know_;
we don't want to have to _assume_ any such things,
especially if we need to open up and examine the scans
in order to be able to be confident about our assumptions.

so this solution only makes us wonder about 008.tif,
and 010.tif.   what was up with those pages/files?
are they blank pages?   or missing pages?   or what?

the upshot is that blank pages have to be included, 
because we need to have them as _placeholders_
which tell us that we do indeed have all of the pages.

we want our checker-apps to be able to run through
all of the filenames in a folder and _confirm_that
"yes, all pages seem to be present and accounted for",
and to do that, we need to have files for blank pages.

this also addresses a less subtle problem, which is that
-- for some particular usages -- apps will need to know
whether a page is a _left-hand_page_ in the p-book or 
a _right-hand_page_.   when the collection of filenames
is fully in-sequence with the pages as they were bound,
with a separate file for each and every recto and verso,
we can know -- from the odd/even nature of the file's
position in that enforced-linearity sequence-number --
whether that page was a left-hand or a right-hand page.

in the bad example above, the linkage has been corrupted.
003.tif is a right-hand page, but so are 004.tif and 005.tif.
our assumption would be that 004.tif was a left-hand page.

if we want to print the scans to "recreate" the paper-book -- 
which will be one of the most-common uses scans will serve
-- an absence of synchronicity like this one simply will not do.

just to be thorough in my treatment, i'll remark that
this recto/verso distinction also has implications when
you have an image that's inserted between two pages.
you have to include a blank-page scan after that, too,
for its _verso_, to keep the sequence of files consistent.

you might remember that i pointed this out to jon noring
when it came up in his "my antonia" project a while back.

marcello's illustration-file in his djvu has this problem too.
he inserted a graphic named "72-image.tif" within the files,
but didn't have the corresponding blank-verso file as well.

now whether a person will actually _scan_ these blank pages
is an open question.   i found that trying to skip the blank pages
broke my rhythmic scanning pattern, so it was faster and easier
for me to just scan them.   but other people might be different.
if they are, we can deal with their output later, with no trouble;
it's really not a big deal to rename the files appropriately and to
insert the blank-page scans wherever they happen to be needed.
(you just have to have a tool that's smart enough to do the job.)

one more thing about this "recreating the paper-book" thought.
if the files are named wisely, sorting their names will give you
the order in which they should be printed to mimic the p-book.
so there's no need for an "index file" or that kind of nonsense.

sort-order is the very type of easy-to-follow no-ambiguity rule
that program-designers like.   it also means that people can use
a slide-show tool to cruise through the pagescans if they want.

note that a sort-order rule conflicts with marcello's suggestion
that scans of the covers should be named as follows:
...
c0000.tif = front cover
   c0002.tif = inside front cover
   c0003.tif = inside back cover
   c0004.tif = outside back cover
   c0005.tif = spine
according to the binding-creates-a-specific-linearity model
which we then emulate through judicious naming of the files,
names for the back-cover scans should sort to the _bottom_.

***

um, we're not done with 14040 yet; there are
some other serious problems with this scan-set.

pages 30 and 31 were scanned twice, whereas
pages 32 and 33 are missing.   if you're looking,
this sequence is of files 024.tif through 027.tif.
(and here you can see how confusing it becomes
when the filenames are not the page-numbers.)

this type of error is a not-uncommon one when scanning, to be sure.
duplicate spreads, or missing spreads, or -- as in this case -- a combo,
are easy mistakes to make in the course of scanning hundreds of pages.
after all, it is humans doing this scanning.   and to err is human, right?
(but sheet-fed scanners _also_ make these mistakes, when they jam,
or misfeed, or whatever.   so it's not just humans; machines err too.)

but even (especially!) if a mistake is common, it _should_ be the case
that quality-control checks would have located and fixed this mistake,
_long_ before this scan-set was ever released out to the general public.

indeed, this type of mistake should _always_ be caught -- and fixed! --
before the person doing the scanning even returns the book to the shelf.
it takes 20 times longer to fix this problem if you don't do it right away.

the fact that the error is still sitting there, easy for me and anyone else
to check and see, indicates that quality-control needs to be improved...

finally, the "read-me" file for these scans says:
...
The page images of thie book 
   are shown in the TIFF files.
it might be fairly petty to note this, but a typo on a word like "this"
is pretty embarrassing for a "literary foundation" like p.g., is it not?

***

ok, on to marcello's djvu example, for text #12973.

(note the slight expansion of the analysis, from just the scans themselves
to now include their consolidation within the djvu file, an additional step.)

first, as i had noted in a earlier post, the page-numbers on the _djvu_file_
are _out-of-sync_ with the page-numbers on the actual pages in that file.

but unlike the examples above, that's _not_ because the pagescan files
are mis-named.   in this case, those pagescan files are named _correctly_.
(well, with just 2 exceptions.)

what caused the problem, however, is that one of the pagescan files
is named "000.tif", and djvu does not have the concept of a page zero.   
the first page it encounters is page 1, and then each page after that is
incremented by 1.   to put it another way, djvu does its own numbering.

(this flaw is not unprecedented with viewer-tools; acrobat is the same.
you'd think that any program that purports to deal with electronic-books
would have -- at the outset -- taken into account that forward matter is
often numbered with roman numerals, and that "page 1" of a book will
require resetting the counter.   but i guess that's expecting too much.)

so that 000.tif -- a picture of the person who is the subject of the book, 
which the .html-version graphic-name enlightens us is the frontispiece --
throws off the numbering from the very beginning of the djvu file.   oops.

ironically, there is a "blank page" on page 4/5, which that frontispiece
could replace nicely, and that would leave the page-numbering correct.
if it provides us with accurate pagenumbers throughout the e-book, it is
worth it to shuffle some pages of the front-matter around to achieve it.
(even if it upsets the "replicate the p-book" objective mentioned above.)

and in this case, that shuffle would indeed give us correct pagenumbers.

at least until we hit the picture on page 73 (or so, it's 072-image.jpg),
which -- as a separate file -- threw off marcello's djvu page-numbering
by an _additional_ increment of 1, for a combined offset now of _2_,
which means the offset is not even continuous throughout this djvu file.
it's an offset of 1 up until page 73 (or so), then an offset of 2 after that.
how can an ordinary person keep stuff like this straight?

while it is not altogether clear _where_ this picture was in the p-book,
to fix this, this picture could have been placed at the bottom of page 65,
a half-page at the end of the previous chapter.   and thus, the relationship
between the djvu computed page-number and the p-book page-numbers
in this scan-set _could_ be made to be totally consistent, if we wanted,
by incorporating both of these fixes.   it is often the cause that you can
achieve this harmony.   if you just do some work, you can make it work.

but there are also some cases where you just cannot make it work out.
if a p-book has 20 pages of front matter, and the first page of the book
is numbered "3", there's no way you can do the required squeezing, so 
-- for those tools that do their own page-numbering starting at zero --
you're just not going to be able to make it work right.   i recommend you 
lobby the programmers of those apps to make 'em more sophisticated,
or else you can just throw them out.   (the tools, not the programmers.)      
  :+)

in the meantime, _my_ tools will be programmed to do things correctly...

(because having correct page-numbers for the body-text is so important,
one way to make the best of a bad situation is to move the "front matter"
to the end of the regular text, with a note in its original location that 
tells
users where they can find it, and why it was moved for their convenience.
this throws the numbering off for the front-matter, but that's less serious.
again, it would be much nicer if the tools just accommodated this situation,
since it is extremely common.   but sometimes you do what you have to do.)

***

and now back to the analysis of the scan-set and the djvu of #12973...

concentrating now on the pagescans themselves, rather than the djvu,
many scans are badly skewed.   in particular, the left-hand pages are
tilted one way, and the right-hand pages the other way, which will be
a familiar sight to people who have experience with looking at scans,
since it reflects a physical reality of putting a bound book on a scanner.

sadly, as you page through the scans, this back-and-forth tilting gives the 
impression of being on a ship.   it can even make you a bit seasick...      
:+)

so this introduces the general topic of the _clean-up_ of the scans...

and the first way that scans need to be cleaned-up is to be _straightened_.
when you're scanning hundreds of pages, almost all of them will be crooked,
to one degree or another, no matter how meticulous you are trying to be.
this skew of the pages can make them difficult to read if it is bad enough.
and even if it's only a very subtle skew, it will bother readers' 
subconscious.

it's also worth noting that skewed pages give particularly poor o.c.r. 
results.
so if you are going to deskew a page sooner or later (and -- if you want to 
make 'em public -- you really have to), then you might as well deskew them
_before_ you o.c.r. them, and save yourself some time correcting scannos.

(some o.c.r. programs even deskew the image before they do the recognition.
i don't know if they save the deskewed image, though, which is what we want.)

the people over at d.p. have finally learned this,
at least when it comes to a really bad skew,
of the type that is the case in almost all of the
scans in this book.   so we won't see pagescans
that are this badly skewed, not from books that
were done recently.   however, i'm not sure that
the d.p. people have learned that even a _little_
skew can adversely impact their o.c.r. results,
so i don't think they routinely deskew all scans.
but maybe they've changed.   if not, they should.

next, the size of the images in this scan-set has not been standardized.
in particular, the partial-pages that are found at the end of each chapter
are smaller, and thus -- when using the "fit-to-page" view option in djvu
(which is the only realistic choice in most cases, providing the page can 
actually be read at that scaling, a problem with any non-reflowing format)
-- it gets blown up unnecessarily large relative to the surrounding pages.
except in very rare cases, all pagescans of a book should be the same size.

in addition, positioning of the text on the scans should be _regularized_.
that is, the upper-left point of the text should fall at the exact same point
on every scan.   even those end-chapter pages that are half-full should
have their _top_left_ at the same point, even though their bottom will
of course fall at a different place.   (and, as noted above, they should 
still
be the exact same size -- in width and height -- as all of the other pages.)

although you might not see why at first, regularization can also aid o.c.r.
that's because regularization allows you to set "zones" on the pagescan,
one time for all of the pages, and zoning keeps the o.c.r. results straight.
for instance, the zone that contains the running-head will be recognized as
being independent of the body-text itself, so their text is not joined 
together.
another example is that zoning can rule out stray marks in the margin from
being recognized, and thus keep junk from intruding into the o.c.r. results.

***

the last of the four e-texts i looked at was #14100.   the problems with
this set of scans were the same ones that have been discussed above,
including a few very skewed scans, and a lack of regularized positioning.
but otherwise, though, the quality of the scans was relatively good, and
none of the 27 .tiff images were corrupted -- they all displayed just fine.

***

so, there's a look at some actual e-text pagescans in the library.
(there are a few more, but they were not in english, so i passed.)

as it shows, the current examples of pagescans in the library
are not very good examples at all, except as _bad_ examples.

and if an analysis of _4_ e-texts results in a post this long,
imagine the huge mess that will result from _400_ e-texts,
if the practices are not buttoned down before we even start.
(we won't even think about _4,000_, or _4-times-4,000_.)

some of the flaws -- skewed scans, missing pages, etc. --
are readily apparent.   others, like the filenaming glitches,
may be less obvious to the casual observer, but they still
stick out like sore thumbs to those of us who are trying to
build tools to extend people's power in using the library...

it is also of importance to note that some of these steps
-- most notably deskewing and regularized text positioning --
are ones that can substantially improve the o.c.r. accuracy,
which is a bonus that we probably cannot afford to pass up.
if you're gonna do scan-cleanup sooner or later, do it sooner!

other of my suggestions here, such as filenaming conventions,
would also help the d.p. workflow, and i do believe this strongly.
but if they want to be stubborn and do it some other way, fine.
i'll just have to make corrections to what they do after-the-fact.

***

anyway, there's my analysis for now.

hopefully, instead of dealing with red herrings, these real issues
will be addressed when project gutenberg makes the decision to
move forward in earnest on the task of making scans available...

that's all for now...

-bowerbird

[gutvol-d] what's wrong with these pictures that we call a scan-set?

Bowerbird＠aol.com