Re: [gutvol-d] updated PP editor

ho ho ho! happy christmas eve! :+) ok, busy day for me, and i'm sure all of you too, so i'd guess you'll appreciate if i cut to the chase, and leave the extended discussion for next week. *** for now, i'll conditionally agree i missed 20 errors. but that number is kind of meaningless without also stating the number of errors that i _found_. if i found 2, while missing 20, that's extremely bad. if i found 2000, only missing 20, it's extraordinary. i would estimate that i found 80-200, meaning that my accuracy level was at the 80%-90% range, which is in the ballpark for a typical p1 proofer, _except_ i only worked for an hour, far less than a p1 round. moreover, i didn't perform the full range of checks. most significantly, i did not check any quotemarks. that's because i thought roger's preprocessing had probably _found_and_fixed_ all of those problems. however, since _7_ of them showed up as "missed", i decided that i should go back and do that check... more on the results of that check coming up later... *** the other info that roger needs to tell us about is how many errors that _he_ found and missed, and reveal how many "uniques" we each found... when we have those numbers for both proofings, we can then compute how many errors _remain_, which is the thing that we _really_ want to know, so we can decide if another proofing is "worth it". this is what don was talking about, except that the situation isn't nearly as vague as he believes. indeed, the specificity you can obtain from just a mere _handful_ of data can be quite startling, if you ensure that you select your data carefully. (this often means creating a test to produce it.) *** one of the things you should absolutely remember is that if you have found an error of a certain type, you _must_ check for more which might be lurking. i did that, with the errors of mine that roger listed. he found that i'd missed some "lie for he" scannos. and sure enough, a search turned up another one:
chance that lie may get work the first of the
roger listed a paragraph-termination error i missed. and a search revealed another two or three more:
The World Syndicate Publishing Co, be in it, Finny, I mean, but were giving the Christmas hymn beginning,
and, most notably for this book, there were also the double-quote errors roger discovered i'd missed.... the systematic search for those was quite interesting. *** the typographer for this p-book was badly hungover. the rate of errors in this p-book is surprisingly high, including an atypical number of outright misspellings. but the number of doublequote glitches is astounding. i have to verify them and write 'em up, but i'd guess that i found at least a dozen more, and maybe two... so -- at least in one sense -- you could say i missed even more errors than roger computed. but of course if i had done this doublequote check in the first place, and done it right, i wouldn't have missed any of these, as doublequotes _are_ a type of error that's detectable. so that check must definitely be a part of our routine... likewise with paragraph-terminators, also detectable... stealth scannos are a mixed situation. there are tests that can be done which _do_ help find stealth scannos. but these tests often pull up "false alarms" as well, and thus the cost-benefit ratio from doing them is unclear. that's why my general orientation is _not_ to do them, but rather to leave them for the smoothreaders to find, since detecting stealth scannos is often _easy_ for them. and that's what it boils down to -- to make this _easy_. *** more later, maybe today or tomorrow, and next week... watch out for flying reindeer! :+) -bowerbird

On Dec 24, 2011, at 10:46 AM, Bowerbird@aol.com wrote:
for now, i'll conditionally agree i missed 20 errors. but that number is kind of meaningless without also stating the number of errors that i _found_. if i found 2, while missing 20, that's extremely bad. if i found 2000, only missing 20, it's extraordinary. i would estimate that i found 80-200, meaning that my accuracy level was at the 80%-90% range, which is in the ballpark for a typical p1 proofer, _except_ i only worked for an hour, far less than a p1 round.
I would guess that you found more than 200. There are 256 pages and what you worked on was straight from Abbyy.
that's because i thought roger's preprocessing had probably _found_and_fixed_ all of those problems.
Uh, no. I didn't put the project up for you to scrape and try to draw conclusions. I put it up so people could play with the editor. The more errors the better. You chose to scrape it and open up the discussion. There was no preprocessing, on purpose.
the other info that roger needs to tell us about is how many errors that _he_ found and missed, and reveal how many "uniques" we each found...
I think you are asking for my equivalent count to your 20 or so. I don't have a number, but it was certainly more than 20. The thrill of looking at the diffs was that in many cases, I could see where what you found could have been found with a regex (RE). I made a list (and checked it twice) as I went along and then baked all those into the Python back-end that generates the analysis window. By the way, if anyone ever wants to understand the analysis code, it starts in HTML, which onclicks to JavaScript, which ajaxs to PHP, which popens to Python, which is my language of choice where RE's are involved. That made it easy to reverse-engineer and bake-in the RE's that you probably used. (Don K, stop frowning!)
when we have those numbers for both proofings, we can then compute how many errors _remain_, which is the thing that we _really_ want to know, so we can decide if another proofing is "worth it".
I don't have those numbers and I'm not going to try to derive them from the data. I think that it would be interesting to compare the output after smoothreading to the output of either the book-level work you did or the page level work I did. The only reason, for me, would be to see how I could improve the program (and likely for you, your RE set). My program is freely available, but it takes a server. You did a series that highlighted some checks you made when doing it all at once which was excellent. Right now, I don't care which way is better, technically. I believe the choice will be made on which approach is most comfortable to the user. I've heard from some people that solo process that actually like to go through the book page at a time because they enjoy following the story as they go, which doesn't happen when someone is in production mode at the book level.
one of the things you should absolutely remember is that if you have found an error of a certain type, you _must_ check for more which might be lurking.
That's something that's currently missing from my program. For example, lets say I'm on page 24 and I find a scanno "Eamon" for "Ramon." I would like to be able to easily get out of the page-at-a-time mode and have it give me ready access to any page that has "Eamon" so I can fix it right there. That's trivial in a the book-level edit but something I would have to add to my page-level tool.
but the number of doublequote glitches is astounding.
i have to verify them and write 'em up, but i'd guess that i found at least a dozen more, and maybe two...
I actually use a tool I've written (ppsmq) to turn straight quotes in a text into curly quotes. The program flags anything that it suspects is wrong. It's surprising how much it finds, not only on a raw OCR like this but even on published texts. I had not used it on this text, of course, because the version you scraped, as I said, was intentionally replete with errors. Just so people are clear, my intention in putting up the editor was only to get people's take on whether it worked in their computing environment and to offer suggestions. Many responded and I've made several improvements to the program. Thanks to those that did. My next step is to use it on several more books. I am not promoting it for general use and certainly not comparing it to other methods. It's an interesting experiment to me and may be useful to page-oriented sites like DP at some time in the future. Even if nobody ever wants to use it, I will. --Roger

On Sat, Dec 24, 2011 at 7:43 PM, Roger Frank <rfrank@rfrank.net> wrote:
By the way, if anyone ever wants to understand the analysis code, it starts in HTML, which onclicks to JavaScript, which ajaxs to PHP, which popens to Python, which is my language of choice where RE's are involved. That made it easy to reverse-engineer and bake-in the RE's that you probably used. (Don K, stop frowning!)
It sounds legit to me. But as in the past the bird has expressed
disdain for regexes for his own use, so it's probably more string- search-and-replace. Whatever he's comfortable with. In the Twister program I use for preparing new projects, I have a display with the text and page synchronized, and a set of external files with lists of regexes - columns are search pattern, replacement pattern, flag-set, and descriptive text. I run one pass to get counts for all of them, and then work through what is found. Some are safe for global replacement ('ist" is almost always 1st unless it's quoting German, "nth" is "11th" unless there is math, etc. I figure if it improves the text 90% of the time to execute the default replacement it's probably worth doing. I think I can correct over half the initial errors before the first proofer sees it, but not much better than that. Double-quotes are an interesting case. The editor WordPress interestingly automatically converts them to curly-quotes without even asking, and with great accuracy. It even does a great job on single-quotes. I need to look at the algorithms when I get some time. Here's one I'll solicit community improvement suggestions for. It's the regex I use to find problematic quotes. It does a pretty reliable job finding errors, and the logic to repair them is also reliable. But its ability to detect paragraph and other boundaries, and leading continuing quotes, could be better. /([^\s\(-]?)"(\s*)([^\\]*?(\\.[^\\]*)*)(\s*)("|^\r)([^\s\)\.\,\?;:-]?)/gim This is in (my one experiment with) Actionscript, which is a pretty strict superset of javascript, including regex functionality.

This is interesting, at least to me. I've just written a program to pull the italics from the RTF files created by Abbyy. In the book I've been experimenting with, there are 158 lines that have at least one italic word. As I edited that book with the ppe editor, I visually found and marked 65 of them. That's not even close to okay. To fit the editor on a laptop screen, I shrank the image window. A click on it zooms it up, but that is tiresome. Trying to spot italic markup in that too-small and greyscale window is clearly too difficult. One user emailed me that she found the small png window "useless." So I've made a version that has the png window the same size as the edit and analysis pane. That's much better, but now it needs a 1920x1080 display to fit. How many potential users did I just lose? Probably quite a few. I did another version that was two-paned: the image window and the text/analysis window. That fits on everything and allows the user to choose which of the two right panes--edit or analysis--is displayed. Either this or the three-paned version is a big improvement over what I used (and what's at http://etext.ws). The three-paned version is up at http://etext.co. Neither of these are sticky, because I have another project for DPC that I'm overdue to be working on that will consume at least one of those sites for development/testing activities. But back to the size of the png pane. How it's presented doesn't matter for the purpose of getting the italic markup right. I do not believe any process should rely on visual discovery in a markup-free text. Unfortunately, unless it has changed, that's the standard practice at DP. Since proofers are supposed to proof the words, the markup from Abbyy is routinely removed before putting it into the rounds. Then, in the later formatting rounds, the foofers are supposed to catch all of them and put them back in. Well, that doesn't work for me, since I've just demonstrated that I'm not good at spotting italics on a small screen. I'm guessing I'd still miss some even with the three-paned version. So bottom line I'm not going to trust myself and I'll trust Abbyy instead, which is actually very good at spotting and marking italics. The program I wrote is a short edit distance from what guiprep does. Mine is in Python and therefore easy for me to extend to all I would want from guiprep plus some things beyond what guiprep can do. That program, including the part that recognizes and tags italics, is what I'll use to prepare the next book project. --Roger

Roger, Are you trying to proof and format at the same time? I did that experiment several years ago with the same effect. Don

On Dec 25, 2011, at 10:55 AM, don kretz wrote:
Roger,
Are you trying to proof and format at the same time?
I did that experiment several years ago with the same effect.
Don
Yes, because removing formatting only to have to put it back in seems counterproductive, especially since Abbyy's output has a higher catch percentage than I believe a foofer does. And having two foofers go through each page of a book like this seems to be a waste of resources. --Roger

It will be interesting to see what your testing reveals. I remember that, at the time, abbyy's accuracy with formatting location and boundaries was not good, and that one could spend more time fixing what's there than to put it there in the first place. Perhaps the benefit is that one might neglect to check some pages for formatting; rather than that one scanned tbe page for formatting and missed it. And we have no crosschecks for the kinds of errors only a human reader will notice, so no equivalent to the evidence you are using to detect your missed punctuation. Suggested test: add a checkbox saying "Check formatting" and prohibit proceeding to another page until the box is checked. Also consider how the abbyy formatting data could be used by software after your proofing, to consider each location where abbyy found formatting and whether you had put formatting there; and resolve the differences efficiently. Don On Sun, Dec 25, 2011 at 11:18 AM, Roger Frank <rfrank@rfrank.net> wrote:
On Dec 25, 2011, at 10:55 AM, don kretz wrote:
Roger,
Are you trying to proof and format at the same time?
I did that experiment several years ago with the same effect.
Don
Yes, because removing formatting only to have to put it back in seems counterproductive, especially since Abbyy's output has a higher catch percentage than I believe a foofer does. And having two foofers go through each page of a book like this seems to be a waste of resources.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (3)
-
Bowerbird@aol.com
-
don kretz
-
Roger Frank