
ok, we've got a couple of different topics running around, so let us take a minute to make sure we are not confused... first of all, let's talk about my campaign for preprocessing... i have demonstrated, over and over and over again, that d.p. (and rfrank) should be doing _much_ better preprocessing... i've shown how they can use _very_simple_means_ to do that, and how -- if they did -- they could reduce the error-counts in their books to a ridiculously small amount, even _before_ their text went in front of proofers. i have talked about how it is a huge _waste_ of the generous donations of volunteers (in both time and energy) not to do aggressive preprocessing, which automatically locates errors to make them easy to fix... again, the crux of my argument -- and i have proven it to be absolutely true, again and again -- is that it's _easy_ to do this. indeed, when i have shown the steps taken to locate the errors, it becomes painfully obvious how ridiculously simple they are... they include obvious checks, like a number embedded in a word, or a lowercase letter followed by a capital letter, or two commas in a row, or a period at the beginning of a line. _obvious_ stuff! this isn't rocket science. it's not even _hard_... it's dirt-simple! and yet neither d.p. nor rfrank has instituted such preprocessing. *** let's contrast this with gardner's request, which was to compile a list of reg-ex tests that will locate all possible errors in any random book. this request -- as worthy as it might seem -- is _much_ more difficult to realize. in fact, it's almost impossible. a friend of mine over in england, nick hodson, is a very prolific digitizer. all by himself, he has done some 500 books or more. nick collected an extensive set of checks over the years. i can't remember exactly how many there were, but roughly about 200. however, once nick upgraded his o.c.r. program, he found that about half of his checks were no longer required. they had been necessary essentially as an artifact of an outdated o.c.r. program. the type of books nick was digitizing hadn't changed, and neither had the quality of the scans, or the resolution of the scans, or the digital retouching that he performed on the scans -- none of that. he was the same person, using the same computer and scanner, and he was doing the same things exactly as he had done before. the only thing that changed was the version of his o.c.r. program. yet he found many checks he formerly needed became unnecessary. so, for an operation like d.p., who intakes all kinds of scans and uses a wide variety of o.c.r. programs, operated by users with a huge range of expertise, their results will be all over the board. they're _never_ gonna get a definitive list of checks to be made. it would be _immensely_ difficult, to the point of being impossible. but that's totally beside our other point, about preprocessing... because the fact of the matter is that a few dozen _simple_ tests are all that d.p. needs in order to reduce the number of errors to a level where they can be handled easily by their human proofers. they're never gonna get 100%. but they could find 90% so easily that it's criminal negligence that they aren't doing that already... heck, spell-check by itself will locate 50% of the errors for you... -bowerbird

however, once nick upgraded his o.c.r. program, he found that about half of his checks were no longer required. they had been necessary essentially as an artifact of an outdated o.c.r. program.
Begs the question why DP doesn't just institute a quality hosted OCR and let people just submit the page images. Ask people to test run a couple pages by the hosted OCR before settling on their digitization settings in order to make sure they know what they are doing.
but they could find 90% so easily that ...
Not that I totally disagree, but when you take out the easy stuff, the stuff that's left is harder to find. Especially at the P1 level too much cruft makes for painful proofing -- but so does too little cruft. Either way you scare off your newbies, who you need to keep around and happy and convince them to "progress" to the more difficult and less rewarding levels, such as P3 and F2. Its not just the cruftiness, but that the current interface doesn't make fixing common cruftiness easy -- neither on the fingers nor on the eyes. Working "solo" I find there are lots of clever schemes I can do to reduce the amount of "P1" cruft I need to fix -- but "P1" really only takes about 20% of the time and effort I find necessary to make a book.

On 23 February 2010 15:47, Jim Adcock <jimad@msn.com> wrote:
Begs the question why DP doesn't just institute a quality hosted OCR and let people just submit the page images. Ask people to test run a couple pages by the hosted OCR before settling on their digitization settings in order to make sure they know what they are doing.
This was discussed on DP back in 2003. If you have a DP login see here: http://www.pgdp.net/phpBB2/viewtopic.php?t=5840 And the flaw? I quote from from a post about finereader in that thread: ------------------------------ I asked the price of Linux development kit. It is 9000 Euro, plus some more money to get a licence for a fixed number of page/month (500 euro for 25k pages/month) (Tesseract might be the way to go, but there's still the chronic shortage of programmers to implement new DP features.) Malcolm Malcolm

I asked the price of Linux development kit. It is 9000 Euro, plus some more money to get a licence for a fixed number of page/month (500 euro for 25k pages/month)
As opposed to $400 a pop per volunteer buying the recommended ABBYY Finereader in order to do a good job of OCR, or $0 to do a bad job of OCR using whatever came free with one's scanner -- leaving the P1's to be the ones who are actually doing the OCR! Maybe we should do a straw poll of the volunteers about who is willing to donate to a hosted OCR. If each DP volunteer were willing to donate $3 we would be there.

Who pays $400 for FR? Older versions that work quite well can be had on ebay for under $50, last time I looked. And there's always the OCR pool. -Bob On Thu, Feb 25, 2010 at 2:22 PM, Jim Adcock <jimad@msn.com> wrote:
I asked the price of Linux development kit. It is 9000 Euro, plus some more money to get a licence for a fixed number of page/month (500 euro for 25k pages/month)
As opposed to $400 a pop per volunteer buying the recommended ABBYY Finereader in order to do a good job of OCR, or $0 to do a bad job of OCR using whatever came free with one's scanner -- leaving the P1's to be the ones who are actually doing the OCR!
Maybe we should do a straw poll of the volunteers about who is willing to donate to a hosted OCR. If each DP volunteer were willing to donate $3 we would be there.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 2/23/2010 10:47 AM, Jim Adcock wrote:
Begs the question why DP doesn't just institute a quality hosted OCR and let people just submit the page images. Ask people to test run a couple pages by the hosted OCR before settling on their digitization settings in order to make sure they know what they are doing.
Having done OCR on several thousand books, I can safely say that even with the most advanced OCR programs currently available, this is NOT a good idea for books of any complexity at all. It might be OK for straight fiction. The big stumbling block is that ABBYY often segments the page incorrectly or orders the segments incorrectly. A classic example often comes up in the Table of Contents where it may group all of the chapter titles into one block and then the page numbers into another block. When this is saved as plain text for proofing, that will make lines of chapter titles appear, followed by lines of the page numbers where what we really want is for the page number to appear on the same line as the chapter title. Not much fun for the proofers to clean up. ABBYY 10 does much better than the previous version I was using, but still sometimes gets things wrong. Getting blocks of text in the wrong order, as can sometimes happen when there are multiple illustrations on a page dividing the text up into separate blocks, is equally bad. Another common OCR error is missing the last word of a paragraph when it appears by itself on a line. When I scan a book, I keep an eye out for any pages having anything other than a single solid block of text. If the book has any, I'll then go through page by page to make sure that the OCR got the text block segmentation and order correct. I often end up redrawing the text blocks, sometimes re-ordering them, and then running the OCR a second time on that page. I would not trust a "batch" or "remote" OCR program to do this correctly. Despite assertions to the contrary, the content providers at DP do go to some considerable lengths to make things easier for the other volunteers. There are other problems with providing a central OCR service, which include expense, processing load, etc. But to my mind, the definitive problem is what I outlined above. Without an interactive capability OCR results often are not good enough for books of any complexity. And before someone says "so make the OCR engine on the server be interactive", let me say that communication and processing costs would be prohibitively expensive, and further, the OCR engines that are sold for that kind of multiple user, production environment use, don't (as far as I know) make that kind of interaction easy to accomplish. JulietS

Making an OCR batch service at DP would be pointless, since the Internet Archive already does it: their OCR is as good as what you can do yourself if you don't act manually during the recognition (training, drawing blocks, etc.). And for contents not taken from TIA you can just donate the scans at TIA and let them do the OCR for you. Carlo
participants (6)
-
Bowerbird@aol.com
-
Jim Adcock
-
Juliet Sutherland
-
Malcolm Farmer
-
Robert Cicconetti
-
traverso@posso.dm.unipi.it