
"Lee" == Lee Passey <lee@novomail.net> writes:
Lee> On Thu, February 16, 2012 3:23 am, Carlo Traverso wrote: >>>>>>> "Bowerbird" == Bowerbird <Bowerbird@aol.com> writes: >> Bowerbird> still waiting for carlo to demonstrate his output... >> Bowerbird> or document his procedure. or _anything_, really. >> I assume that you know wdiff format if you want to understand >> the details. If you don't, either read the manual or skip the >> details. Lee> So do any of these methods work when one of the files Lee> contains markup and the other doesn't? Yes, in the test that I made with PNP the PG text had _italic markup_ and it is not really different from any other markup. And had PG header and footer (that the TIA file did not have). Of course, the more heavy is the markup, the more problems might arise. <technical> With markup, I would use dwdiff -P instead of wdiff. The difference is that wdiffing "<i>italic markup</i>" and "italic markup" there is one big difference [-<i>italic markup</i>-] {+italic markup+} i.e. total replacement, while with dwdiff -P one has [-<i>-]italic markup[-</i>-] i.e. it recognizes that the second version is the same as the first with the markup removed. </technical> Lee> Are all of these methods automated (meaning that no human Lee> intervention is required to produce the new file)? Perfection Lee> is not required; good enough is good enough. Yes, pipe the wdiff command through a short sed script. The worse that can happen is that there are a few line ends more and a few are missed. The wdiff of the two complete PNP including headers and footers took about 0.2 seconds, the sed part probably less (I have not yet written the script, I used emacs interactively).
"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> focus, kids, focus. Bowerbird> think about the objective here. Bowerbird> so... why would we want to rewrap a p.g. text to an Bowerbird> existing scan-set? Bowerbird> well, for two main reasons: 1. to re-proof and correct Bowerbird> the text with the scan-set. 2. to use the scan-set as Bowerbird> the provenance for the text. To reproof, unless the images are exactly the same edition, and are good enough, it would be much much better to proofread the OCR and then look at the wdiff output. It shows exactly where the two versions differ. Otherwise you'll have to find a few hundred (or thousand) differences, mainly punctuation. The wdiff output is excellent to check if a txt file corresponds with the images, even if the images are bad and the OCR awful, like the 1813 google images. Look at this fragment of p. 21; [-...-] is PG version {+....+} is TIA --------------- [-"I-]{+" I+} had once [-had-] some [-thought-]{+thoughts+} of fixing in town [-myself--for-]{+myself, for+} I am fond of superior society; but I did not feel quite certain that the air of London would agree with Lady Lucas." He paused in hopes of an [-answer;-]{+answer:+} but his companion was not disposed to make any; and Elizabeth at that instant moving towards them, he was struck with the [-action-]{+notion+} of doing a very gallant thing, and called out to [-her: "My dear-]{+her, " My "dear+} Miss Eliza, why are [-you-] not {+you+} dancing? Mr. Darcy, you must allow me to present this young lady to you as a very desirable partner. You cannot refuse to dance, I am [-sure-]{+sure,+} when so much beauty is before you." And, taking her hand, he would have given it to Mr. [-Darcy-]{+Darcy,+} who, though extremely surprised, was not unwilling to receive it, when she instantly drew back, and said with some discomposure to Sir [-William: "Indeed,-]{+William, " Indeed,+} sir, I have not the least intention of dancing. I entreat you not to suppose that I moved this way in order to beg for a partner." --------------- There are 10 substantial differences in 20 lines. From OCR to the image, 3 spacey quotes easy to fix in pre-processing, a straw quote and maybe a couple of corrections. Much easier to proofread from OCR than proofread from a reconstructed text with a lot of differences. I hope that checking if a text corresponds to an OCR might be pretty much automated, with an analysis of the types of wdiff. Some kinds of wdiffs are possible OCR misrecognitions (e.g. [-action-]{+notion+}) while some other aren't possible: like word inversions; "why are [-you-] not {+you+} dancing?" means that PG has "why are you not dancing?" and TIA has "why are not you dancing?". Impossible for an OCR error, it is a clear sign of different editions. Bowerbird> the good news is that we only have to do this rewrap Bowerbird> _once_ for a book, and we can assume that we'll have Bowerbird> volunteers with a reasonable level of skill for the Bowerbird> task, so it doesn't have to be idiot-proof or fully Bowerbird> automatic. Bowerbird> the bad news is that we need to do it for 20,000 books, Bowerbird> so the job _can't_ require _too_ much time or energy... If one can detect automatically if a PG text corresponds to a set of images it might be done. And is a much different kind of work than proofreading, a different set of volunteers might be involved. And PG has the clearance images, hence one does not have to guess, just to find the original and check. Carlo