
"Jim" == Jim Adcock <jimad@msn.com> writes:
Don> What I find remarkable is that after 2 decades anyone would Don> expect the >> Project Gutenberg old guard to do anything other than the same >> thing they've been doing. Greg> ....Which is to leave such decisions to the eBook's Greg> submitter(s). Jim> Again, I have written software that would allow one to Jim> back-align PG works to "the original text" even when they are Jim> not "identical" texts, and can reintroduce page numbers and Jim> "original" line breaks. It's in a crude state right now, Jim> because no one has actually expressed an interest. My intent Jim> was that it could be used by DP to reprocess old crufty PG Jim> files back through their system (which it could be used for) Jim> if they wanted to [so that no one at DP really has ANY excuse Jim> to complain about independently produced books] or it could Jim> be used by someone wanting to back-submit to archive.org Or Jim> it could be used to pursue "more scholarly" versions. Jim> The software "works" by taking one "polished" PG text and one Jim> "unpolished" say raw OCR, Levenshtein matching them on word Jim> tokens and then clones the formatting whitespace from the one Jim> to the other. It can also clone over the page numbers. Jim> In general, obviously, if you want to say produce a Jim> "scholarly" edition from a PG text you're going to have to Jim> re-proof your book after performing such back matching. My Jim> software can help with that too. Much simpler, one can use wdiff (or dwdiff) and preserve whitespace from one and non-whitespace from the other through regexp. I am interested in comparing your tool with my wdiff approach. Carlo