
On 2011-11-28 04:15, Carlo Traverso wrote:
Some of my impressions on Edward's tool. First, it is great. This said, here are some of the shortcomings that I remarked.
Editing word by word is often insufficient; there is no way to rejoin words in which a space has been erroneously inserted (this is frequent e.g. when apstrophes are involved, but noy only) or spaces between words and punctuation (e.g. spaces before a comma, that depend on the line justification).
Agreed. It needs a join word feature.
Sometimes, especially in books with smaller font, the text display font is too large, and the text part is readable only with difficulty. See e.g. http://edwardbetts.com/correct/leaf/artofbook00holm/15 For example, see the lines ABCD....; in one line the J is a word, separated from the others; in another the whole alphabet is one word: this depends on the kerning of the different fonts.
The placing of the characters is naive, I can think of ways to improve it.
The use of sans-serif proportional fonts gravely degrades the visibility of some kind of recognition errors (I and l, uppercase i vs. lowercase L; ri vs. n etc.) especially when the font is too large and the letters fall one above the other.
Good point, I can switch to serif.
I would suggest to display and edit line by line, with a fixed-width font. Moreover, one should show the difference between a soft and a hard hyphen, (this is a difference in whinh often the OCR is hopeless, as well a corrector of one line or one page: is to-day or today once the lines are rejoined?)
I'm not sure about your argument for a fixed-width font. You're right about hyphens.
A problem might arise when the OCR has given up on a part of a page: one finds relatively often lines missing altogether, or, for example, an O (uppercase oh) word missing (this happens in Italian "O" is "Or". This might be easy to fix with line editing, but a missing line is harder. Since the image is sliced, and the slices do not cover the original page, it may even happen that a part is missed completely. This is freequent enough with the page headers or the page signature. See http://edwardbetts.com/correct/leaf/ilcavalieredello00vero/8 vs 9.
Agreed. This is a problem with my model.
Reading a line of text in a page, I tend to associate it with the image immadiately below, that of course doesn't match. When I correctly focus the pair of matching lines, I substantially read the first of the two, I find it hard to focus on the text (the second of the matching lines). I wonder how it would be having the line of text first, then the matching line of image.
I could switch the order of text and images. Thanks for the feedback. -- Edward.