Re: [gutvol-d] http://edwardbetts.com/correct

30 Dec 2011

      On 2011-11-28 04:15, Carlo Traverso wrote:
...
Some of my impressions on Edward's tool. First, it is great. This
said, here are some of the shortcomings that I remarked.
Editing word by word is often insufficient; there is no way to rejoin
words in which a space has been erroneously inserted (this is frequent
e.g. when apstrophes are involved, but noy only) or spaces between
words and punctuation (e.g. spaces before a comma, that depend on the
line justification).
Agreed. It needs a join word feature.
...
Sometimes, especially in books with smaller font, the text display
font is too large, and the text part is readable only with difficulty.
See e.g. http://edwardbetts.com/correct/leaf/artofbook00holm/15 For
example, see the lines ABCD....; in one line the J is a word,
separated from the others; in another the whole alphabet is one word:
this depends on the kerning of the different fonts.
The placing of the characters is naive, I can think of ways to improve it.
...
The use of sans-serif proportional fonts gravely degrades the
visibility of some kind of recognition errors (I and l, uppercase i
vs. lowercase L; ri vs. n etc.) especially when the font is too large and
the letters fall one above the other.
Good point, I can switch to serif.
...
I would suggest to display and edit line by line, with a fixed-width
font. Moreover, one should show the difference between a soft and a
hard hyphen, (this is a difference in whinh often the OCR is hopeless,
as well a corrector of one line or one page: is to-day or today once
the lines are rejoined?)
I'm not sure about your argument for a fixed-width font. You're right 
about hyphens.
...
A problem might arise when the OCR has given up on a part of a page:
one finds relatively often lines missing altogether, or, for example,
an O (uppercase oh) word missing (this happens in Italian "O" is "Or".
This might be easy to fix with line editing, but a missing line is
harder. Since the image is sliced, and the slices do not cover the
original page, it may even happen that a part is missed completely.
This is freequent enough with the page headers or the page signature.
See http://edwardbetts.com/correct/leaf/ilcavalieredello00vero/8 vs 9.
Agreed. This is a problem with my model.
...
Reading a line of text in a page, I tend to associate it with the image
immadiately below, that of course doesn't match.  When I correctly
focus the pair of matching lines, I substantially read the first of
the two, I find it hard to focus on the text (the second of the matching
lines). I wonder how it would be having the line of text first, then
the matching line of image.
I could switch the order of text and images.

Thanks for the feedback.

-- 
Edward.

Re: [gutvol-d] http://edwardbetts.com/correct

Edward Betts