
On 3/11/2010 4:24 PM, James Adcock wrote:
I have created a new command line tool “pgdiff” along the lines of what BB has been talking about, which compares two independently OCR’ed texts on a word-by-word basis, so as to find and flag errors.
[snip] I think this will be a very useful tool moving forward, at least to me. I particularly like the fact that the code is not derived from the GNU diff program. wdiff, of which Mr. Traverso is so fond, is actually just a front end to diff; it takes the input files and rewrites them so that each word is on a separate line, and then passes the rewritten lines to diff. Once you have the diff output it somehow figures out how to merge the results back with the originals, but I actually lost interest in figuring out the code when I realized in required the GNU diff program to work. One of the reasons I wanted to avoid GNU diff and wdiff is because of the restrictive, viral GPL. I have no problem /using/ GPLed programs, but I have no interest in extending or improving them -- which leads me to wonder about your own claims to intellectual property in this code. Here in the United States I don't think any author can avoid a copyright even if he or she doesn't want one. Copyright is created and attached by operation of law, and there is no actual legal entity called "the public domain" that you can assign your copyright to. I think it would be nice to have a non-profit organization whose mission is solely to hold copyrights and refuse to enforce them. In the meantime, here is the verbiage I use on my code; I'm not completely convinced it will actually work, but you might want to adopt it as well: /* Copyright-Only Dedication (based on United States law) The person or persons who have associated their work with this document (the "Dedicators") hereby dedicate whatever copyright they may have in the work of authorship herein (the "Work") to the public domain. Dedicators make this dedication for the benefit of the public at large and to the detriment of Dedicators' heirs and successors. Dedicators intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights under copyright law, whether vested or contingent, in the Work. Dedicators understand that such relinquishment of all rights includes the relinquishment of all rights to enforce (by lawsuit or otherwise) those copyrights in the Work. Dedicators recognize that, once placed in the public domain, the Work may be freely reproduced, distributed, transmitted, used, modified, built upon, or otherwise exploited by anyone for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been invented or conceived. */ I suspect that your own code may need to be "hardened" against particularly ill-formed files, and might possibly be enhanced to satisfy other needs, or could even become the back end for a visual tool for those users who need it. I'd be happy to route enhancements or bug fixes back to you if I have permission to use the code in other ways.