this thread explains how to digitize a book quickly and easily.
search archive.org for "booksculture00mabiuoft" for the book.
***
the major advances attained in our last lesson were that we
1) are now able to chunk the text in terms of its _pages_, and
2) link to a particular scan, or even pull it in, if we so desire...
a question you might have is "where are the scans mounted?"
one answer is that you can mount them anywhere you like...
in the case of my code here, i linked to scans from this book
which have been located on one of my websites for ages now.
(indeed, this set of scans is _the_original_, from google, who
released this book as their very first public-domain example.
back then, the project was named "google print", which is why
each scan bears a "watermark" in its margin which says that.)
one big reason i linked to my scans is that they were already
mounted, but another was that they are named intelligently;
the number in the filename of each scan is its pagenumber.
but of course, this approach won't serve _you_, at least not
if you do another book for which i haven't mounted scans...
there is another easy answer, though... recall that we are
cleaning the o.c.r. text for a book scanned by archive.org.
so the scans are located at archive.org! you might object
that the scans in that scan-set are combined in a .zip file.
that is correct. but they can still be accessed individually.
to see how, take another look at the archive.org viewer-app:
> http://www.archive.org/stream/booksculture00mabiuoft#page/123
that displays the scan for page 123 (and perhaps 122 too,
depending whether you're using the 1-up or 2-up modes.)
if you right-click on the scan for page 123, and then choose
the selection to "copy image address", you should get this:
> http://ia600300.us.archive.org/BookReader/BookReaderImages.php?
> zip=/1/items/booksculture00mabiuoft/booksculture00mabiuoft_jp2.zip
> &file=booksculture00mabiuoft_jp2/booksculture00mabiuoft_0129.jp2
> &scale=3.02920443101712&rotate=0
that's a u.r.l., a very long one. (i broke it up into pieces.)
and it ends up that you can ignore the last part of it --
the part that gives the "scale" and "rotate" components.
this u.r.l. gives us the address for the scan for page 123.
you could be forgiven if you're a bit confused, since there
isn't _any_ occurrence of "123" inside of that u.r.l., at least
until i remind you that there's an offset of "6" in this book.
sure enough, 123+6=129, and you _do_ see a "129" there.
so that u.r.l. tells us how to address the other pagescans
in this book. if we wanted to show the scan for page 234,
for instance, we'd simply change that 129 to 240 (234+6).
thus, for another book, you'll need to figure out the offset.
one complication is that some books have an offset that
_changes_ in the course of the book, due to pages which
were unnumbered. (this happens with illustration plates.)
in such cases, you'll need to work with a "variable offset".
it's not difficult to figure it out; it's just a pain in the ass.
of course, another option is just to download the scans
in the .zip file, rename the files more intelligently, and
mount them in a location over which you have control...
this will give you a chance to downscale the images too.
because you should be aware that archive.org scans are
bloated... for instance, their scan for page 123 is 221k,
whereas the .jpg mounted on my site for page 123 is 66k.
so if you summon a page many times, it'll be downloaded
repeatedly, which could impact any bandwidth limitations
you might be suffering under... so if you think you might
work a lot on the pages, it's best to download the scans...
why waste those precious mobile minutes 4 times as fast?
perhaps you think that's a first-world problem? re-think.
the people on the poor side of the digital divide will suffer
even more from the archive.org obsession with pagescans
instead of digital text, which is lighter _and_ more flexible.
and i don't think we should sweep it under the rug, either...
having said all that, i will also report that the java-based
viewer-app at archive.org works like a charm on an ipad.
so -- if you are lucky enough to afford it -- appreciate it.
getting back to that issue of where the scans are stored,
it's also possible to grab google's scans programmatically
just use the same tactic of getting the u.r.l. for the scan,
and plug it into your program. you might also find that
google's a bit better about naming its files intelligently.
***
so, enough about that. let's get back to our mabie book...
in our last lesson, i viewed scans to answer some questions
about a handful of edits, so i made those fixes to the text...
i also decided to do the manual editing of the front-matter.
this book was typical of most, in that the recognition of the
front-matter is very spotty. most of the words are correct
-- which is a big improvement over the way it used to be --
but the formatting is bad, with far too many blank lines...
you can see the original o.c.r. front-matter here:
> http://zenmarkuplanguage.com/grapes000.txt
up until this last edit, i substituted in some dummy-text:
> http://zenmarkuplanguage.com/grapes002.txt
and you can view my hand-edited front-matter here:
> http://zenmarkuplanguage.com/grapes003.txt
so that's now the newest version of our text: grapes003.txt.
***
this book was most decidedly quite friendly to us, because
the first page of its body-text has the pagenumber of "7",
meaning that we had 6 pages of front-matter to play with.
sometimes we need to shift the front-matter to the back
-- just while we're cleaning the text -- so we can _start_
on "page 1" and have that text be the first text in the file.
but we didn't have to play that little trick, thank goodness.
you will notice that i did "rearrange" some of the pages,
and i dropped the text from the cover (we didn't need it).
plus in an e-book, you don't have "recto" and "verso",
and blank pages are unnecessary, so i eliminate 'em.
in general, just play around with it, and make it work...
***
ok, now for some code...
our page-oriented advances from the last lesson allow us
to do something kinda cool, for the sake of a little demo.
here we go:
> http://zenmarkuplanguage.com/grapes201.py
i keep hoping marcello will barge in here to make fun of
my python skills. i have even tried to bait him with code
that's _not_ how a "real" python programmer would do it.
but this code is _really_ not up to snuff, so i ain't even
gonna give you the source today. nonetheless -- this
little demo program just might get your juices flowing.
it shows one page of text, with its pagescan alongside,
and lets you click buttons to go backward or forward...
(or just enter a pagenumber in the box to jump there.)
that's pretty simple, to be sure, but it can also serve as
an _engine_ for features which are much more powerful.
if you'd like to see some examples of that, just ask me.
or if you want to know the nature of the problems with
the code in this program (and the last), ask about that...
or if you're wondering if all this can be accomplished
in a more effective manner with an offline application
-- the answer is "yes", as i've been saying all along --
then we can talk about that... or not... as you prefer...
***
one thing that this page-based method tells us is that
there are various ways you can approach a digitization.
you can set your focus at the level of _lines_ of text,
which proves to be the _best_ level most of the time,
in my opinion. you can work on the book as a whole.
you can make the _page_ the primary unit of analysis.
you can digitize at the level of _individual_words_, and
you can concentrate on the _paragraph_ as your entity.
now, of course, there's no requirement that you work
_exclusively_ at any particular level. some checks can
be accomplished more effectively at a certain level, so
you should choose your focus appropriate for the task.
so in our next lesson, we're gonna drill down to words,
and code up an actual spell-check routine for our text.
big fun for the weekend! :+)
-bowerbird