
Gardner Buchanan wrote:
On 18-Feb-2010 21:21, Bowerbird@aol.com wrote:
it's usually the case that making those checks and fixes useful in the general case, against any random book, is a more difficult matter.
The tragic bb in a nutshell. He gets one easy text, then builds a `program“ that finds the bugs in that one easy text and proclaims it the ultimate fixing tools. ... Everybody laughs. ... BB waits one year. ... Repetitur. To build a useful tool you have to: 1. get two random samples of scans, say two sets of 100 complete book scans, using different scan techniques and different OCR on books of different ages and provenience. You could get those out of google or IA. 2. build a bug list of those OCRed texts against proven good copies. 3. build a program using the texts and error lists of the first group. You are not allowed to look at the second group texts. 4. run the program against your blind group and record the percentage of positives and negatives it finds. 5. run any known tools against the blind group and see if yours performs significantly better. 6. If better then brag else shut up.
That's kind of my experience, I guess. Several fixes will suggest themselves, in the context of a given specific text. The next one might need different fixes. But that doesn't mean a long list of fixups might be tried when there's no cost to just adding tests/fixes to the list.
If you have to enter the regexes manually you should use any editor that supports them. -- Marcello Perathoner webmaster@gutenberg.org