
On Dec 24, 2011, at 10:46 AM, Bowerbird@aol.com wrote:
for now, i'll conditionally agree i missed 20 errors. but that number is kind of meaningless without also stating the number of errors that i _found_. if i found 2, while missing 20, that's extremely bad. if i found 2000, only missing 20, it's extraordinary. i would estimate that i found 80-200, meaning that my accuracy level was at the 80%-90% range, which is in the ballpark for a typical p1 proofer, _except_ i only worked for an hour, far less than a p1 round.
I would guess that you found more than 200. There are 256 pages and what you worked on was straight from Abbyy.
that's because i thought roger's preprocessing had probably _found_and_fixed_ all of those problems.
Uh, no. I didn't put the project up for you to scrape and try to draw conclusions. I put it up so people could play with the editor. The more errors the better. You chose to scrape it and open up the discussion. There was no preprocessing, on purpose.
the other info that roger needs to tell us about is how many errors that _he_ found and missed, and reveal how many "uniques" we each found...
I think you are asking for my equivalent count to your 20 or so. I don't have a number, but it was certainly more than 20. The thrill of looking at the diffs was that in many cases, I could see where what you found could have been found with a regex (RE). I made a list (and checked it twice) as I went along and then baked all those into the Python back-end that generates the analysis window. By the way, if anyone ever wants to understand the analysis code, it starts in HTML, which onclicks to JavaScript, which ajaxs to PHP, which popens to Python, which is my language of choice where RE's are involved. That made it easy to reverse-engineer and bake-in the RE's that you probably used. (Don K, stop frowning!)
when we have those numbers for both proofings, we can then compute how many errors _remain_, which is the thing that we _really_ want to know, so we can decide if another proofing is "worth it".
I don't have those numbers and I'm not going to try to derive them from the data. I think that it would be interesting to compare the output after smoothreading to the output of either the book-level work you did or the page level work I did. The only reason, for me, would be to see how I could improve the program (and likely for you, your RE set). My program is freely available, but it takes a server. You did a series that highlighted some checks you made when doing it all at once which was excellent. Right now, I don't care which way is better, technically. I believe the choice will be made on which approach is most comfortable to the user. I've heard from some people that solo process that actually like to go through the book page at a time because they enjoy following the story as they go, which doesn't happen when someone is in production mode at the book level.
one of the things you should absolutely remember is that if you have found an error of a certain type, you _must_ check for more which might be lurking.
That's something that's currently missing from my program. For example, lets say I'm on page 24 and I find a scanno "Eamon" for "Ramon." I would like to be able to easily get out of the page-at-a-time mode and have it give me ready access to any page that has "Eamon" so I can fix it right there. That's trivial in a the book-level edit but something I would have to add to my page-level tool.
but the number of doublequote glitches is astounding.
i have to verify them and write 'em up, but i'd guess that i found at least a dozen more, and maybe two...
I actually use a tool I've written (ppsmq) to turn straight quotes in a text into curly quotes. The program flags anything that it suspects is wrong. It's surprising how much it finds, not only on a raw OCR like this but even on published texts. I had not used it on this text, of course, because the version you scraped, as I said, was intentionally replete with errors. Just so people are clear, my intention in putting up the editor was only to get people's take on whether it worked in their computing environment and to offer suggestions. Many responded and I've made several improvements to the program. Thanks to those that did. My next step is to use it on several more books. I am not promoting it for general use and certainly not comparing it to other methods. It's an interesting experiment to me and may be useful to page-oriented sites like DP at some time in the future. Even if nobody ever wants to use it, I will. --Roger