[gutvol-d] http://edwardbetts.com/correct

28 Nov 2011

      Some of my impressions on Edward's tool. First, it is great. This
said, here are some of the shortcomings that I remarked.

Editing word by word is often insufficient; there is no way to rejoin
words in which a space has been erroneously inserted (this is frequent
e.g. when apstrophes are involved, but noy only) or spaces between
words and punctuation (e.g. spaces before a comma, that depend on the
line justification).

Sometimes, especially in books with smaller font, the text display
font is too large, and the text part is readable only with difficulty.
See e.g. http://edwardbetts.com/correct/leaf/artofbook00holm/15 For
example, see the lines ABCD....; in one line the J is a word,
separated from the others; in another the whole alphabet is one word:
this depends on the kerning of the different fonts.

The use of sans-serif proportional fonts gravely degrades the
visibility of some kind of recognition errors (I and l, uppercase i
vs. lowercase L; ri vs. n etc.) especially when the font is too large and
the letters fall one above the other.

I would suggest to display and edit line by line, with a fixed-width
font. Moreover, one should show the difference between a soft and a
hard hyphen, (this is a difference in whinh often the OCR is hopeless,
as well a corrector of one line or one page: is to-day or today once
the lines are rejoined?)

A problem might arise when the OCR has given up on a part of a page:
one finds relatively often lines missing altogether, or, for example,
an O (uppercase oh) word missing (this happens in Italian "O" is "Or".
This might be easy to fix with line editing, but a missing line is
harder. Since the image is sliced, and the slices do not cover the
original page, it may even happen that a part is missed completely.
This is freequent enough with the page headers or the page signature.
See http://edwardbetts.com/correct/leaf/ilcavalieredello00vero/8 vs 9.

Reading a line of text in a page, I tend to associate it with the image
immadiately below, that of course doesn't match.  When I correctly
focus the pair of matching lines, I substantially read the first of
the two, I find it hard to focus on the text (the second of the matching
lines). I wonder how it would be having the line of text first, then
the matching line of image.

Carlo

[gutvol-d] http://edwardbetts.com/correct

traverso＠posso.dm.unipi.it