Re: [gutvol-d] updated PP editor

25 Dec 2011

      On Dec 24, 2011, at 10:46 AM, Bowerbird@aol.com wrote:
...
for now, i'll conditionally agree i missed 20 errors.
but that number is kind of meaningless without
also stating the number of errors that i _found_.
if i found 2, while missing 20, that's extremely bad.
if i found 2000, only missing 20, it's extraordinary.
i would estimate that i found 80-200, meaning that
my accuracy level was at the 80%-90% range, which
is in the ballpark for a typical p1 proofer, _except_
i only worked for an hour, far less than a p1 round.
I would guess that you found more than 200. There are 256
pages and what you worked on was straight from Abbyy.
...
that's because i thought roger's preprocessing had
probably _found_and_fixed_ all of those problems.
Uh, no. I didn't put the project up for you to scrape and
try to draw conclusions. I put it up so people could play
with the editor. The more errors the better. You chose to
scrape it and open up the discussion. There was no
preprocessing, on purpose.
...
the other info that roger needs to tell us about
is how many errors that _he_ found and missed,
and reveal how many "uniques" we each found...
I think you are asking for my equivalent count to your 20
or so. I don't have a number, but it was certainly more
than 20. The thrill of looking at the diffs was that in
many cases, I could see where what you found could have
been found with a regex (RE). I made a list (and checked
it twice) as I went along and then baked all those into
the Python back-end that generates the analysis window.

By the way, if anyone ever wants to understand the
analysis code, it starts in HTML, which onclicks to
JavaScript, which ajaxs to PHP, which popens to Python,
which is my language of choice where RE's are involved.
That made it easy to reverse-engineer and bake-in the RE's
that you probably used. (Don K, stop frowning!)
...
when we have those numbers for both proofings,
we can then compute how many errors _remain_,
which is the thing that we _really_ want to know,
so we can decide if another proofing is "worth it".
I don't have those numbers and I'm not going to try to
derive them from the data. I think that it would be
interesting to compare the output after smoothreading to
the output of either the book-level work you did or the
page level work I did. The only reason, for me, would be
to see how I could improve the program (and likely for
you, your RE set).

My program is freely available, but it takes a server. You
did a series that highlighted some checks you made when
doing it all at once which was excellent. Right now, I
don't care which way is better, technically. I believe the
choice will be made on which approach is most comfortable
to the user. I've heard from some people that solo process
that actually like to go through the book page at a time
because they enjoy following the story as they go, which
doesn't happen when someone is in production mode at the
book level.
...
one of the things you should absolutely remember
is that if you have found an error of a certain type,
you _must_ check for more which might be lurking.
That's something that's currently missing from my program.
For example, lets say I'm on page 24 and I find a scanno
"Eamon" for "Ramon." I would like to be able to easily get
out of the page-at-a-time mode and have it give me ready
access to any page that has "Eamon" so I can fix it right
there. That's trivial in a the book-level edit but
something I would have to add to my page-level tool.
...
but the number of doublequote glitches is astounding.
i have to verify them and write 'em up, but i'd guess
that i found at least a dozen more, and maybe two...
I actually use a tool I've written (ppsmq) to turn
straight quotes in a text into curly quotes. The program
flags anything that it suspects is wrong. It's surprising
how much it finds, not only on a raw OCR like this but
even on published texts. I had not used it on this text,
of course, because the version you scraped, as I said,
was intentionally replete with errors.

Just so people are clear, my intention in putting up the
editor was only to get people's take on whether it worked
in their computing environment and to offer suggestions.
Many responded and I've made several improvements to the
program. Thanks to those that did. My next step is to use
it on several more books. I am not promoting it for
general use and certainly not comparing it to other
methods. It's an interesting experiment to me and may be
useful to page-oriented sites like DP at some time in the
future. Even if nobody ever wants to use it, I will.

--Roger

Re: [gutvol-d] updated PP editor

Roger Frank