
On Fri, Sep 28, 2012 at 8:33 AM, James Adcock <jimad@msn.com> wrote:
These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking.
Independent scans and OCR with independent programs is not fast or efficient. Nor can simple hours be counted; the whole point of DP and similar projects is that it's easier to get many hours from many people then to get one volunteer to put in fewer hours.
And the results are more accurate. The DP approach does not lead to particularly accurate texts.
Right, whatever. Instead of sneering, how about some actual evidence and numbers. Show me texts, and let's see what you consider "not particularly accurate".
PPPS: And neither Unicode nor HTML give us good tools to transcribe what one actually finds in historical books in the first place.
Unicode doesn't? Whatever. Unless you're talking manuscripts and EETS books, it's pretty solid. -- Kie ekzistas vivo, ekzistas espero.