
even then, i just don't think "confidence in proofer" will actually work.
or, to be more accurate, i think it'll work just well enough that rfrank will put lots of energy into it before he finds out it doesn't work well enough.
or, worst case scenario, he'll convince himself that it really _is_ working, and other people believe him, and we all end up with non-perfect
i said: pages. good thing i posted that yesterday. because today rfrank posted his first "informal analysis". and it looks like i was right... rfrank did his analysis on 32 pages that were marked as "done" but then subsequently proofed again, as is done for a random sample... he admits this is a small number of pages, and that there are also "many factors at play", but then goes on to draw conclusions anyway. of the 32 pages, two had added proofer notes, and 1 error was fixed. he doesn't tell us if either (or both) of the proofer notes were good, in the sense that they pointed out something of value, so we'll have to assume that they were meaningless and just added noise to the text. but even then, we have 1 error missed in 32 pages. on the face of it, that means that 3% of the "done" pages had an error. so, for a 200-page book, that would cumulate to a total of 6 mistakes. again, by my 1-error-every-10-pages criterion, that's fully acceptable. but by the (unrealistic) standards of _most_ of the volunteers, it's not. rfrank concludes that "this seems to say that making sure every page is seen by two proofers is not warranted"... so that's his take on this. *** partly the decision rests on the abundance of proofers. if you have lots and lots of proofers, like d.p., then you can afford to send a page through them 2 times or 3 times, even 4 or 5 times. but if your proofers are scarce, like they are over at fadedpage.com, then you might be reluctant to have them view a page even twice... i think i'm pretty good about making sure proofers are used _wisely_. i don't think i abuse their contribution, or that i take 'em for granted; neither am i afraid to use their resources if it is responsible to do so. and i think having 2 people verify a page as clean is responsible use. *** the other thing, though, in evaluating all these experiments, is that you need to know how many errors there _really_ were on each page. only _then_ can you accurately access the accuracy of the proofers... remember that there are lots of pages in these books that have _no_ errors on them, none at all. is it any surprise, then, that they were _actually_ "done" when they were _marked_ as "done"? not hardly... likewise, it isn't really a surprise when a page with _one_ error on it has that error fixed, is then marked as "done", and is _really_ done. what you have to pay attention to, in such cases, are the pages where an error is _not_ found by the first person, who marks it "done", but is then found by the second person. rfrank isn't making nearly enough information available that we can analyze the results in a reasonable way. so i guess we just have to "trust" him. i just wish i had more faith in his reasoning. -bowerbird

I have created a new command line tool "pgdiff" along the lines of what BB has been talking about, which compares two independently OCR'ed texts on a word-by-word basis, so as to find and flag errors. In this regards it is similar to "worddiff", as opposed to "diff" which is the approach BB has been talking about, which compares on a per-line basis. But my new tool has several tricks that haven't been seen before: It can be used with two different versions or editions of the text as long as there are not really long differences in the texts. IE the two texts do not have to have their linebreaks at the same locations. It tries to retain the linebreak locations of the first input text in preference to the second input text. IE the first input text should represent the target text you are trying to create. This means it can also be used for "versioning" - for example using a copy of a PG text from one version or edition of a text to help fix and create a text from a different version or edition of the text. It can also be used to recover linebreak information, where linebreak information has been lost, for example to take an older PG text and recover linebreak information in order to allow, for example, the resubmission of that PG text back to DP for a clean-up pass. In normal mode when if finds an mismatch it outputs the mismatch like this { it'll | it'11 } within the body of the text so that given a regex compatible editor it is very quick to search for and fix the errors found. As BB says, having tried this approach, the manual approach of trying to visually spot errors seems pretty painful and silly. I find that finding differences on a word basis rather than a line basis makes it quicker and easier to fix the errors in general. You do want to do some regex punc normalization on the two OCRs to try to remove the trivial differences prior to running the tool, in order to cut down the number of trivial errors it finds that you have to fix. Source and a compiled windows version at http://www.freekindlebooks.org/Dev/StringMatch It is based on traditional Levenshtein Distances where the token is taken to be the non-white part of a "word" as opposed to measuring distances between lines of text or on individual characters.

PS: To help clarify what I am talking about I enclose below an except of the output of this tool (being used for versioning, error-flagging and linebreak recovery) ===== got to going away so much, too, and locking me in. Once he locked me in and was gone three days. It was dreadful lonesome. I judged he had got { drowned, | drownded, } and I wasn't ever going to get out any more. I was scared. I made up my mind I would fix up some way to leave there. I had tried to get out of that cabin many a time, but I couldn't find no way. There { warn't | wam't } a window to it big enough for a dog to get through. I couldn't get up the chimbly; it was too narrow. The door was thick, solid oak slabs. Pap was pretty careful not to leave a knife or anything in the cabin when he was away; I reckon I had { hunted | himted } the place over as much as a { hundred | himdred } times; well, I was most all the time at it, because it was about the only way to put in the time. But this time I found something at { last ; | last; } I found an old rusty wood-saw { without | v/ithout } any handle; it was laid in between a rafter and the clapboards of the roof. I greased it up and went to work. There was an old horse-blanket nailed against the logs at the far end of the cabin behind the table, to keep the wind from blowing through the chinks and putting the candle out. I got under the table and raised the blanket, and went to work to { saw | sav/ } a section of the big bottom log { out - big | out--big } enough to let me through. Well, it was a good long job, but I was getting { towards | toward } the end of it when I heard pap's gun in the woods. I got rid of the signs of my work, and dropped the blanket and hid my saw, and pretty soon pap come in. ===== One input file has line breaks that look like this: .it. I was all over welts. He got to going away so much, too, and locking me in. Once he locked me in and was gone three days. It was dreadful lonesome. I judged he had got drowned, and I wasn't ever going to get. ===== The other input file has line breaks that look like this: .got to going away so much, too, and locking me in. Once he locked me in and was gone three days. It was dreadful lonesome. I judged he had got. But it doesn't matter, the algorithm will still find the word differences.

On 11-Mar-2010 19:09, James Adcock wrote:
PS: To help clarify what I am talking about I enclose below an except of the output of this tool
This suits me. I have a project on the go that I will try this on pretty promptly. I will let you know what I come up with. Thank you! ============================================================ Gardner Buchanan <gbuchana@teksavvy.com> Ottawa, ON FreeBSD: Where you want to go. Today.

"James" == James Adcock <jimad@msn.com> writes:
James> PS: To help clarify what I am talking about I enclose below James> an except of the output of this tool James> (being used for versioning, error-flagging and linebreak James> recovery) It seems very much similar to wdiff output, may you please show where your tool gives something basically different from wdiff? Carlo Traverso

On 3/11/2010 4:24 PM, James Adcock wrote:
I have created a new command line tool “pgdiff” along the lines of what BB has been talking about, which compares two independently OCR’ed texts on a word-by-word basis, so as to find and flag errors.
[snip] I think this will be a very useful tool moving forward, at least to me. I particularly like the fact that the code is not derived from the GNU diff program. wdiff, of which Mr. Traverso is so fond, is actually just a front end to diff; it takes the input files and rewrites them so that each word is on a separate line, and then passes the rewritten lines to diff. Once you have the diff output it somehow figures out how to merge the results back with the originals, but I actually lost interest in figuring out the code when I realized in required the GNU diff program to work. One of the reasons I wanted to avoid GNU diff and wdiff is because of the restrictive, viral GPL. I have no problem /using/ GPLed programs, but I have no interest in extending or improving them -- which leads me to wonder about your own claims to intellectual property in this code. Here in the United States I don't think any author can avoid a copyright even if he or she doesn't want one. Copyright is created and attached by operation of law, and there is no actual legal entity called "the public domain" that you can assign your copyright to. I think it would be nice to have a non-profit organization whose mission is solely to hold copyrights and refuse to enforce them. In the meantime, here is the verbiage I use on my code; I'm not completely convinced it will actually work, but you might want to adopt it as well: /* Copyright-Only Dedication (based on United States law) The person or persons who have associated their work with this document (the "Dedicators") hereby dedicate whatever copyright they may have in the work of authorship herein (the "Work") to the public domain. Dedicators make this dedication for the benefit of the public at large and to the detriment of Dedicators' heirs and successors. Dedicators intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights under copyright law, whether vested or contingent, in the Work. Dedicators understand that such relinquishment of all rights includes the relinquishment of all rights to enforce (by lawsuit or otherwise) those copyrights in the Work. Dedicators recognize that, once placed in the public domain, the Work may be freely reproduced, distributed, transmitted, used, modified, built upon, or otherwise exploited by anyone for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been invented or conceived. */ I suspect that your own code may need to be "hardened" against particularly ill-formed files, and might possibly be enhanced to satisfy other needs, or could even become the back end for a visual tool for those users who need it. I'd be happy to route enhancements or bug fixes back to you if I have permission to use the code in other ways.

On 13 March 2010 19:34, Lee Passey <lee@novomail.net> wrote:
<snip>
In the meantime, here is the verbiage I use on my code; I'm not completely convinced it will actually work, but you might want to adopt it as well:
/* Copyright-Only Dedication (based on United States law)
The person or persons who have associated their work with this document (the "Dedicators") hereby dedicate whatever copyright they may have in the work of authorship herein (the "Work") to the public domain.
Dedicators make this dedication for the benefit of the public at large and to the detriment of Dedicators' heirs and successors. Dedicators intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights under copyright law, whether vested or contingent, in the Work. Dedicators understand that such relinquishment of all rights includes the relinquishment of all rights to enforce (by lawsuit or otherwise) those copyrights in the Work.
Dedicators recognize that, once placed in the public domain, the Work may be freely reproduced, distributed, transmitted, used, modified, built upon, or otherwise exploited by anyone for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been invented or conceived. */
This sounds quite similar to the 'Creative Commons Zero' licence: http://creativecommons.org/publicdomain/zero/1.0/ -- Jon Ingram

I decline to attach any verbiage at all. I tell you I wrote it and you can use it any way you like -- at your own risk and amusement, obviously. If you need to get more serious than that contact me by email and we can talk about it. If you find bugs in it or difficulties porting to other platforms I would like to know about it. I recommend the code not be used by NASA. I have written code that others potentially depended on for life and limb and I would rather not have to go there again.
participants (7)
-
Bowerbird@aol.com
-
Gardner Buchanan
-
James Adcock
-
Jon Ingram
-
Karl Eichwalder
-
Lee Passey
-
traverso@posso.dm.unipi.it