Re: [gutvol-d] rewrapping p.g. to an existing scan-set

17 Feb 2012

      ...
...
...
...
...
"don" == don kretz <dakretz@gmail.com> writes:
don> Assume PG has the images. (Do we know yet for what % of
    don> projects that is true?)

    don> Then from diffing or wherever assume we can generate a list
    don> of text locations to compare with the images. The text
    don> locations are identified by - what - page number and offset?
    don> with some adjacent context?

This is my vision of a system to handle errata and correct PG books.

When we have images, we can easily provide OCR, hopefully of good
quality, but google quality should be OK too. We store the original,
the OCR with an association to the scans (from a spot of the OCR we
can go at least to the page, but also to the exact position in the page).

An user has remarked something to correct. He accesses an errata page
at PG, and enters the corrected text with some context: copies a
snippet of text, corrects it, and sends the corrected text. (This is
the worst scenario, one might send original and corrected text, and 
additional info, like the title of the book, but it works even with
just the corrected text).

The system performs a fuzzy search (google-like) in the text database,
and identifies one or more books, and positions that almost match the
snippet sent.

The user is shown the image, the current text, the corrections already
proposed but not yet accepted or already rejected, and can modify his
proposal, save or cancel (this can reuse some DP code)

Later, a WW-er will examine all the errata report, with the same
interface, and accept, delay or reject them. The accepted errata are
applied to all the formats automatically and the archive is updated.

A student of mine in 2004 prepared a prototype that did all this
(except multiple formats, PG HTML was not yet established) for one
text (disks were a lot smaller 8 years ago). I still have some
material, but I fear that I no longer have the installation, but it
would be obsolete anyway. It worked even with moved parts: footnotes
were collected at the end, and corrections to the footnotes showed the
page in which the footnote was printed.

The key feature was that the database had two texts, the PG text and
the OCR; the OCR was linked to the images; OCR, PG txt and errata were
associated through fuzzy searches.

The system worked also without images, but the decision of the WW-er
would be much more difficult.

I might have a student to work three months on such a project. Not
enough to have it in place, but enough for a start.

Carlo

PS: to answer the original question of Don: I am not interested in
identifying a position, I am interested in identifying a correction
proposal, and this can be identified as a patch (in the sense of the
unix command patch): 

PATCH(1)

NAME
       patch - apply a diff file to an original

Re: [gutvol-d] rewrapping p.g. to an existing scan-set

traverso＠posso.dm.unipi.it