Assume PG has the images. (Do we know yet for what % of projects that is true?)

Then from diffing or wherever assume we can generate a list of text locations to
compare with the images. The text locations are identified by - what - page number
and offset? with some adjacent context?

Let me know if I'm not following ...

Now we need to either 1. rename the image files with page numbers as part of the filename,
or 2. construct a cross-reference of pages and filenames. Option 2 is more extensible and
doesn't involve screwing around with primary sources; but Option 1 bb can understand.

Next what's needed is a means to a) rename the image files, b) interject page markers into
the text to be compared, c) provide a means to display them side-by-side and make
necessary edits to the text, and d) spit out the text with/without embedded page numbers
and corrections in one ready-to-use file.

All of which Twister provides.



On Thu, Feb 16, 2012 at 2:06 PM, Carlo Traverso <traverso@posso.dm.unipi.it> wrote:
>>>>> "Lee" == Lee Passey <lee@novomail.net> writes:

   Lee> On Thu, February 16, 2012 3:23 am, Carlo Traverso wrote:

   >>>>>>> "Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
   >>
   Bowerbird> still waiting for carlo to demonstrate his output...
   >>
   Bowerbird> or document his procedure.  or _anything_, really.
   >>  I assume that you know wdiff format if you want to understand
   >> the details. If you don't, either read the manual or skip the
   >> details.

   Lee> So do any of these methods work when one of the files
   Lee> contains markup and the other doesn't?

Yes, in the test that I made with PNP the PG text had _italic markup_
and it is not really different from any other markup. And had PG
header and footer (that the TIA file did not have). Of course, the
more heavy is the markup, the more problems might arise.

<technical>

With markup, I would use dwdiff -P instead of wdiff. The difference is
that wdiffing "<i>italic markup</i>" and "italic markup" there is one
big difference
  [-<i>italic markup</i>-] {+italic markup+}
i.e. total replacement, while with dwdiff -P one has
  [-<i>-]italic markup[-</i>-]
i.e. it recognizes that the second version is the same
as the first with the markup removed.

</technical>

   Lee> Are all of these methods automated (meaning that no human
   Lee> intervention is required to produce the new file)? Perfection
   Lee> is not required; good enough is good enough.

Yes, pipe the wdiff command through a short sed script. The worse that
can happen is that there are a few line ends more and a few are
missed. The wdiff of the two complete PNP including headers and
footers took about 0.2 seconds, the sed part probably less (I have not
yet written the script, I used emacs interactively).


>>>>> "Bowerbird" == Bowerbird  <Bowerbird@aol.com> writes:

   Bowerbird> focus, kids, focus.

   Bowerbird> think about the objective here.

   Bowerbird> so...  why would we want to rewrap a p.g. text to an
   Bowerbird> existing scan-set?

   Bowerbird> well, for two main reasons: 1.  to re-proof and correct
   Bowerbird> the text with the scan-set. 2.  to use the scan-set as
   Bowerbird> the provenance for the text.

To reproof, unless the images are exactly the same edition, and are
good enough, it would be much much better to proofread the OCR and then
look at the wdiff output. It shows exactly where the two versions
differ. Otherwise you'll have to find a few hundred (or thousand)
differences, mainly punctuation.

The wdiff output is excellent to check if a txt file corresponds with
the images, even if the images are bad and the OCR awful, like the
1813 google images.

Look at this fragment of p. 21; [-...-] is PG version {+....+} is TIA

---------------

[-"I-]{+" I+} had once [-had-] some [-thought-]{+thoughts+} of fixing in town [-myself--for-]{+myself,
for+} I am fond of superior society; but I did not feel
quite certain that the air of London would agree with Lady
Lucas."

He paused in hopes of an [-answer;-]{+answer:+} but his companion
was not disposed to make any; and Elizabeth at that instant
moving towards them, he was struck with the [-action-]{+notion+}
of doing a very gallant thing, and called out to
[-her:

"My dear-]{+her,

" My "dear+} Miss Eliza, why are [-you-] not {+you+} dancing? Mr.
Darcy, you must allow me to present this young lady to
you as a very desirable partner. You cannot refuse to
dance, I am [-sure-]{+sure,+} when so much beauty is before you."
And, taking her hand, he would have given it to Mr. [-Darcy-]{+Darcy,+}
who, though extremely surprised, was not unwilling to
receive it, when she instantly drew back, and said with
some discomposure to Sir [-William:

"Indeed,-]{+William,

" Indeed,+} sir, I have not the least intention of dancing.
I entreat you not to suppose that I moved this way in
order to beg for a partner."

---------------


There are 10 substantial differences in 20 lines. From OCR to the
image, 3 spacey quotes easy to fix in pre-processing, a straw quote
and maybe a couple of corrections. Much easier to proofread from OCR
than proofread from a reconstructed text with a lot of differences.

I hope that checking if a text corresponds to an OCR might be pretty
much automated, with an analysis of the types of wdiff. Some kinds of
wdiffs are possible OCR misrecognitions (e.g. [-action-]{+notion+})
while some other aren't possible: like word inversions; "why are
[-you-] not {+you+} dancing?"  means that PG has "why are you not
dancing?" and TIA has "why are not you dancing?". Impossible for an OCR
error, it is a clear sign of different editions.

   Bowerbird> the good news is that we only have to do this rewrap
   Bowerbird> _once_ for a book, and we can assume that we'll have
   Bowerbird> volunteers with a reasonable level of skill for the
   Bowerbird> task, so it doesn't have to be idiot-proof or fully
   Bowerbird> automatic.

   Bowerbird> the bad news is that we need to do it for 20,000 books,
   Bowerbird> so the job _can't_ require _too_ much time or energy...

If one can detect automatically if a PG text corresponds to a set of
images it might be done. And is a much different kind of work than
proofreading, a different set of volunteers might be involved.

And PG has the clearance images, hence one does not have to guess,
just to find the original and check.

Carlo
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d