Re: [gutvol-d] rewrapping p.g. to an existing scan-set

16 Feb 2012

      ...
...
...
...
...
"Lee" == Lee Passey <lee@novomail.net> writes:
Lee> On Thu, February 16, 2012 3:23 am, Carlo Traverso wrote:

    >>>>>>> "Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
    >>
    Bowerbird> still waiting for carlo to demonstrate his output...
    >>
    Bowerbird> or document his procedure.  or _anything_, really.
    >>  I assume that you know wdiff format if you want to understand
    >> the details. If you don't, either read the manual or skip the
    >> details.

    Lee> So do any of these methods work when one of the files
    Lee> contains markup and the other doesn't?

Yes, in the test that I made with PNP the PG text had _italic markup_
and it is not really different from any other markup. And had PG
header and footer (that the TIA file did not have). Of course, the
more heavy is the markup, the more problems might arise. 

<technical>

With markup, I would use dwdiff -P instead of wdiff. The difference is
that wdiffing "italic markup" and "italic markup" there is one
big difference 
 [-italic markup-] {+italic markup+}
i.e. total replacement, while with dwdiff -P one has 
 [--]italic markup[--] 
i.e. it recognizes that the second version is the same
as the first with the markup removed.

</technical>

    Lee> Are all of these methods automated (meaning that no human
    Lee> intervention is required to produce the new file)? Perfection
    Lee> is not required; good enough is good enough.

Yes, pipe the wdiff command through a short sed script. The worse that
can happen is that there are a few line ends more and a few are
missed. The wdiff of the two complete PNP including headers and
footers took about 0.2 seconds, the sed part probably less (I have not
yet written the script, I used emacs interactively).
...
...
...
...
...
"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> focus, kids, focus.

    Bowerbird> think about the objective here.

    Bowerbird> so...  why would we want to rewrap a p.g. text to an
    Bowerbird> existing scan-set?

    Bowerbird> well, for two main reasons: 1.  to re-proof and correct
    Bowerbird> the text with the scan-set. 2.  to use the scan-set as
    Bowerbird> the provenance for the text.

To reproof, unless the images are exactly the same edition, and are
good enough, it would be much much better to proofread the OCR and then
look at the wdiff output. It shows exactly where the two versions
differ. Otherwise you'll have to find a few hundred (or thousand)
differences, mainly punctuation. 

The wdiff output is excellent to check if a txt file corresponds with
the images, even if the images are bad and the OCR awful, like the
1813 google images. 

Look at this fragment of p. 21; [-...-] is PG version {+....+} is TIA

---------------

[-"I-]{+" I+} had once [-had-] some [-thought-]{+thoughts+} of fixing in town [-myself--for-]{+myself,
for+} I am fond of superior society; but I did not feel
quite certain that the air of London would agree with Lady
Lucas."

He paused in hopes of an [-answer;-]{+answer:+} but his companion
was not disposed to make any; and Elizabeth at that instant
moving towards them, he was struck with the [-action-]{+notion+}
of doing a very gallant thing, and called out to
[-her:

"My dear-]{+her,

" My "dear+} Miss Eliza, why are [-you-] not {+you+} dancing? Mr.
Darcy, you must allow me to present this young lady to
you as a very desirable partner. You cannot refuse to
dance, I am [-sure-]{+sure,+} when so much beauty is before you."
And, taking her hand, he would have given it to Mr. [-Darcy-]{+Darcy,+}
who, though extremely surprised, was not unwilling to
receive it, when she instantly drew back, and said with
some discomposure to Sir [-William:

"Indeed,-]{+William,

" Indeed,+} sir, I have not the least intention of dancing.
I entreat you not to suppose that I moved this way in
order to beg for a partner."

---------------

There are 10 substantial differences in 20 lines. From OCR to the
image, 3 spacey quotes easy to fix in pre-processing, a straw quote
and maybe a couple of corrections. Much easier to proofread from OCR
than proofread from a reconstructed text with a lot of differences.

I hope that checking if a text corresponds to an OCR might be pretty
much automated, with an analysis of the types of wdiff. Some kinds of
wdiffs are possible OCR misrecognitions (e.g. [-action-]{+notion+})
while some other aren't possible: like word inversions; "why are
[-you-] not {+you+} dancing?"  means that PG has "why are you not
dancing?" and TIA has "why are not you dancing?". Impossible for an OCR
error, it is a clear sign of different editions.

    Bowerbird> the good news is that we only have to do this rewrap
    Bowerbird> _once_ for a book, and we can assume that we'll have
    Bowerbird> volunteers with a reasonable level of skill for the
    Bowerbird> task, so it doesn't have to be idiot-proof or fully
    Bowerbird> automatic.

    Bowerbird> the bad news is that we need to do it for 20,000 books,
    Bowerbird> so the job _can't_ require _too_ much time or energy...

If one can detect automatically if a PG text corresponds to a set of
images it might be done. And is a much different kind of work than
proofreading, a different set of volunteers might be involved.

And PG has the clearance images, hence one does not have to guess,
just to find the original and check. 

Carlo

Re: [gutvol-d] rewrapping p.g. to an existing scan-set

traverso＠posso.dm.unipi.it