
"don" == don kretz <dakretz@gmail.com> writes:
don> Assume PG has the images. (Do we know yet for what % of don> projects that is true?) don> Then from diffing or wherever assume we can generate a list don> of text locations to compare with the images. The text don> locations are identified by - what - page number and offset? don> with some adjacent context? This is my vision of a system to handle errata and correct PG books. When we have images, we can easily provide OCR, hopefully of good quality, but google quality should be OK too. We store the original, the OCR with an association to the scans (from a spot of the OCR we can go at least to the page, but also to the exact position in the page). An user has remarked something to correct. He accesses an errata page at PG, and enters the corrected text with some context: copies a snippet of text, corrects it, and sends the corrected text. (This is the worst scenario, one might send original and corrected text, and additional info, like the title of the book, but it works even with just the corrected text). The system performs a fuzzy search (google-like) in the text database, and identifies one or more books, and positions that almost match the snippet sent. The user is shown the image, the current text, the corrections already proposed but not yet accepted or already rejected, and can modify his proposal, save or cancel (this can reuse some DP code) Later, a WW-er will examine all the errata report, with the same interface, and accept, delay or reject them. The accepted errata are applied to all the formats automatically and the archive is updated. A student of mine in 2004 prepared a prototype that did all this (except multiple formats, PG HTML was not yet established) for one text (disks were a lot smaller 8 years ago). I still have some material, but I fear that I no longer have the installation, but it would be obsolete anyway. It worked even with moved parts: footnotes were collected at the end, and corrections to the footnotes showed the page in which the footnote was printed. The key feature was that the database had two texts, the PG text and the OCR; the OCR was linked to the images; OCR, PG txt and errata were associated through fuzzy searches. The system worked also without images, but the decision of the WW-er would be much more difficult. I might have a student to work three months on such a project. Not enough to have it in place, but enough for a start. Carlo PS: to answer the original question of Don: I am not interested in identifying a position, I am interested in identifying a correction proposal, and this can be identified as a patch (in the sense of the unix command patch): PATCH(1) NAME patch - apply a diff file to an original