[gutvol-d] Re: Bowerbird's software projects

19 Feb 2010

      Gardner Buchanan wrote:
...
On 18-Feb-2010 21:21, Bowerbird@aol.com wrote:
...
it's usually the case that making those checks and fixes
useful in the general case, against any random book, is
a more difficult matter.
The tragic bb in a nutshell. He gets one easy text, then builds a 
`program´ that finds the bugs in that one easy text and proclaims it the 
ultimate fixing tools. ... Everybody laughs. ... BB waits one year. ... 
Repetitur.

To build a useful tool you have to:

1. get two random samples of scans, say two sets of 100 complete book 
scans, using different scan techniques and different OCR on books of 
different ages and provenience. You could get those out of google or IA.

2. build a bug list of those OCRed texts against proven good copies.

3. build a program using the texts and error lists of the first group. 
You are not allowed to look at the second group texts.

4. run the program against your blind group and record the percentage of 
positives and negatives it finds.

5. run any known tools against the blind group and see if yours performs 
significantly better.

6. If better then
      brag
    else
      shut up.
...
That's kind of my experience, I guess.  Several fixes will
suggest themselves, in the context of a given specific text.
The next one might need different fixes.  But that doesn't mean
a long list of fixups might be tried when there's no cost
to just adding tests/fixes to the list.
If you have to enter the regexes manually you should use any editor that 
supports them.

-- 
Marcello Perathoner
webmaster@gutenberg.org