
26 Oct
2011
26 Oct
'11
6:27 p.m.
On 24 October 2011 23:53, <Bowerbird@aol.com> wrote:
http://www.archive.org/stream/artofbook00holm#page/n12/mode/1up
i think we can agree that that's a whole lot of mud that we need to scrape off a page that has 2 words.
Sure - if you're only interested in the text content, it's quite useless. It is useful for OCR research to have that data, so I'm glad they provide it - not as useful as corrected text, granted, but I think the clearest example of the value of such data is reCAPTCHA, which (in part of its operation) compares the output of two OCR systems, and extracts images from the coordinates where they disagree. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you