[gutvol-d] a review of some digitization tools -- 001

16 Nov 2011

      so let's review some digitization tools, shall we?

the task of digitizing a book is now fairly simple,
thanks to cheap scanners and good o.c.r. today...

scanners have been incorporated into printers
for a while now, and even the cheaper ones are
able to do a satisfactory job on most p-books...

also, the best-of-breed o.c.r. app -- abbyy --
is now cross-platform _and_ fairly affordable...

but the best news is that you don't even need
to do the scanning or o.c.r. yourself nowadays
-- if you want to do a public-domain book --
because the chances are very good that google
or internet archive has already done it for you.

so search around to see if scans already exist.

(if you want a book in a non-english language,
you might find it at google or internet archive,
but if not, there are other sites you can search,
and you likely know more about 'em than i do.)

you might even discover that a book has been
scanned multiple times.   that's a good thing...
what you should do is look at all of the copies,
and find the one with the best-looking scans
and (more important) the most-accurate o.c.r.

download the scans and the o.c.r., as well as
a djvu copy (if available) and/or the .pdf copy.

the first thing to do is to spellcheck the o.c.r.
when doing this, you must look at the scans!
keep in mind that you aren't "correcting" the
spelling, per se, but rather matching the book.

now, do not take this to a ridiculous extreme.

if there was clearly an _error_ in the p-book,
you will want to fix it.   you might even wanna
update some anachronisms, such as changing
"to-day" to the version that we all use "today".
(i typically delete periods from chapter-heads.)

and there might be some changes you'll make
in order to make the text work as an _e-book_.
(i make sure that the table-of-contents entries
_exactly_match_ the chapter-heads themselves,
so my automatic link-creator works correctly.)

but be conservative in making these changes,
because you might change the book's flavor...

for instance, in the early days, p.g. would
strip out the pagenumbers from an index,
"because an e-book doesn't have pages"...

well, that's stupid, because those numbers
didn't _hurt_ anything, and they might have
proven to be quite useful down the line, so
they _should've_ been left in, "just in case".

don't make that kind of mistake.

while you are doing your spellcheck, it is
very important that you _add_new_words_
to your spellcheck dictionary -- such as
character names, place names, and so on.

not only does this mean you only have to
deal with them _once_, it also means that 
when a similar-looking variant comes up,
it's probable that it's gonna be a "scanno".

adding words also means you can re-do
the spellcheck after any/all future edits,
with a result that's both quick and clean.
that helps catch any errors that you have
inadvertently introduced during editing,
something which happens to everybody...

_after_ spellcheck, depending on the text
you have downloaded, you might have to
_restore_any_text_styling_, such as italics.

if o.c.r. output gets saved as "plain-text",
it loses its formatting, including _italics_.
so you need to restore it.   in most books,
there aren't many italicized words, so this
is a manageable task, albeit very mundane.

if your book _does_ have lots of italics --
and you should check this at the outset --
you might want to re-do the o.c.r. for it...
specify that the output be saved as .rtf...
and then use your wordprocessor to place
_markers_ around all the italicized words.
you can do this by using a global change.
you can use underscores as markers, or
the bracket-i/bracket-slash-i html tags,
or another convention you might invent.

you'll also want to mark any _bold_ too,
but most books contain very little bold,
except in their headers, which we will be
marking separately in a later process, so
you don't need to worry about them now.

in this vein, note that what you're marking
right now are _individual_ italicized words
or phrases.   if a whole paragraph is italics,
or bold, make a note about it for the future.

the other thing you'll have to do, again in
your wordprocessor, is to "close up" any
"spacey punctuation".   for the most part,
these are double-quotemarks (and periods
and commas) with a space on _both_ sides,
so it's rather easy to do a search for them.

so the first tool you'll use for digitization is
your old friend, your trusty wordprocessor...

-bowerbird

p.s.   if you have a book you'd like me to use
as an example during this series, suggest it.

[gutvol-d] a review of some digitization tools -- 001

Bowerbird＠aol.com