the thrust of this thread, once again, is how to digitize a book.
i went through all the steps, and will regurgitate them for you.
i used the book "books and culture", by mabie, which you can
locate by searching archive.org for "booksculture00mabiuoft".
as i said in my last post, the o.c.r. was clean, which is typical,
with around 97% of the body-text lines recognized correctly,
at least _after_ i'd run my automatic cleaning-routines on it...
people who've been here for a long time will recognize that
this is a finding that i have obtained over and over and over.
most of the time, you'll get errors in roughly 3% of the lines.
that depends, of course, on the quality of the printed book,
how much care you take in the scanning process, and so on.
but another big variable is how much _cleaning_ you can do.
***
the first type of cleaning is something i call "auto-cleaning".
when i say "auto-cleaning", i'm talking about global changes
that can be made to the text _without_any_ manual approval.
("manual approval" is where you accept/reject each change.)
an example of such auto-cleaning would be a global change
of all instances of a space-comma (" ,") to just a comma (",").
in old books, o.c.r. often puts a space in front of punctuation.
the punctuation characters were actually _printed_ with some
extra space in front of 'em, at least according to the standard
of modern-day typesetting, so it isn't really an o.c.r. "error"...
but it's something we'll want to replace in our digitized text,
so it's a global change we can make without manual approval.
if you can do a lot of auto-cleaning, you can drop the errors
down to the 1% range,,. if you can't, it might float up to 5%...
***
one of the risks that you take with auto-cleaning is that you
aren't actually looking at the changes that you're making, so
_might_ change something that's right to something wrong.
is it possible that auto-changes might _introduce_ an error?
sure it's _possible_. but with some changes, it's so unlikely
we don't need to worry about it. i certainly don't feel i must
"manually" approve every change of space-comma to comma.
if such an auto-change does introduce an error, i'll just trust
that a reader from the public will find the error and report it.
now, obviously, if the auto-cleaning fixes 998 errors but also
_introduces_ 2 new errors, that's probably "worth it" anyway,
as it's easier to clean the 2 new errors than the 998 old ones.
but it helps to know what kind of errors you're introducing,
if/when you are, so i always like to review the results of any
new routine that i might be considering for auto-cleaning...
***
another example of an auto-clean routine is to correct the
"floating-doublequote" when it happens at the start of a line.
o.c.r. often has trouble with doublequotes in older books,
putting an extra space in front of them or behind them...
(in real life, with modern-day conventions, a doublequote
has a space _either_ in front of it _or_ behind, or _neither_
if parentheses or brackets are involved, but _never_ both.)
a floating-doublequote in the middle of a line is ambiguous,
since the extra space might've been introduced in front of it,
or might've been introduced behind it. but at the beginning
of a line, the extra space was obviously introduced behind it,
so can be removed with a no-look global search-and-replace.
some line-starting floating-doublequotes in the mabie o.c.r.:
> " thrift of time," which brings ripe-
> " Chronicles " and North's trans-
> " Still studying Dante ? " said the
> " Divine Comedy " or in Goethe's
> " Faust " for the first time discovers
> " Master and Man," is one of those
> " Odyssey " are of more importance
> " In any museum," says Mr. La
> " Anna Karenina " leaves no reader
likewise, floating-doublequotes at the _end_ of a line are also
candidates for a no-look auto-cleaning search-and-replace...
again, the 3 examples from the current book:
> guidance ? "
> they make. The student of" Faust "
> No man can read "In Memoriam "
as that first example shows us, question-mark characters are
also often "floating", and a good candidate for auto-cleaning.
(likewise with exclamation-points, semicolons, and colons...)
***
auto-cleaning also brings up some "philosophical" questions
about the digitization process... for instance, i close up any
spacey-ellipses (". . .") so they are tight ("..."), _plus_ i globally
change all 4-dot ellipses to 3-dotters (wanna fight about it?);
i also merge each ellipse with the word that it follows (usually)
or that it precedes (when the ellipse is at the start of the line).
i always change "to-day", "to-night", and "to-morrow" so that
they lose their unnecessary dash. these are "in-house styles",
so i don't expect you to agree, and don't care if you disagree,
so it really won't be productive to discuss this kind of issue...
you don't have to do all my auto-changes if you don't want to.
***
so that takes care of auto-cleaning...
there are other types of cleaning, and i'll talk about them in
our next lesson...
-bowerbird
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d