Re: [gutvol-d] rewrapping p.g. to an existing scan-set

jim said:
ok, that looks pretty good, at first glance. :+) i'm curious if you can also restore the end-line hyphenates. it's alright if you can't, but if you can, i'd like to see that too. of course, depending on what you'll wanna use this for, you might not even want the end-line hyphenates to be restored. for instance, to re-do this book at distributed proofreaders -- which, face it, greg, will be your "crowdsourcing" solution, because why should anyone re-invent that particular wheel? -- you don't need the end-line hyphenates; d.p doesn't use 'em. but for some purposes, they will be important. for instance, my intention would be to mount the text with the original scan-set, for easy comparison by the public, so for that kind of project, the end-line hyphenates are needed. *** still waiting for carlo to demonstrate his output... or document his procedure. or _anything_, really. *** i've written programs to do this job a half-dozen times, each one with a slightly different approach, so i believe i have enough strategies now to do the task thoroughly, i just need to assemble the pieces in the correct order... it's also the case that i'll compare the archive.org text with the p.g. text, to create a superior end-result, so i'm willing to take some time to pre-clean them both -- most especially the archive.org version-- meaning that it will be simpler to write the synthesizing code... still, i'd like as much of it as possible to be automatic. so i'll post my output as soon as carlo posts his, but i'm also going to continue to work on this objective...
Try: http://www.gutenberg.org/cache/epub/28948/pg28948.txt vs. http://www.archive.org/details/therainbowlawren00lawrrich
great. i'll tackle that next. unless you know that it's a significantly different version. i'm only interested in scan-sets that are a match for the particular text-file. if they're too dissimilar, a comparison isn't worthwhile. -bowerbird

"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> still waiting for carlo to demonstrate his output... Bowerbird> or document his procedure. or _anything_, really. I assume that you know wdiff format if you want to understand the details. If you don't, either read the manual or skip the details. Basically, the result of wdiff is the common text, in which the whitespace of the common parts is taken from the second argument, and variable parts are enclosed in [-...-] for the first file, {+...+} for the second file. The procedure is the following: assume that we have two files, file-pg.txt and file-tia.txt, the second having page separators. To simplify, assume that neither file contains the strings [- -] {+ +} that are used in wdiff format (if they do, wdiff has options to use different separators). execute the command wdiff file-pg.txt file-tia.txt > file-mix.txt Then take file-mix and do the following replacements: 1 - replace SEPARATOR with +}SEPARATOR{+ 2 - remove any string composed of (whitespace){+(anything except +})+} 3 - remove [- and -] The result is not perfect if file-pg.txt and file-tia.txt differ around the separator (e.g. if a word is split at page boundaries), in that case SEPARATOR is introduced at the beginning of the difference zone, and a few words may fall in the wrong page; since I never needed to do it systematically, I never cared to formalize this step. But a procedure might be easy to describe, using diff at the character level, to split one wdiff difference into two differences. To do this, dwdiff instead of wdiff might be useful. Similarly, if a difference region contains newlines, part of the PG newlines may survive. It is easy to recognize (a [-...-] or a {+...+} region contains newlines) and could be handled in the same way. Reintroducing end-of-line hyphenation might be possible too. Of course this requires to handle newlines as outlined above. Overall, my advice would be to use wdiff handling as a pre-processing step for a more sophisticated tool operating on small regions. wdiff handles very well large difference regions like PG licence without problems. In the specific case of PNP, the TIA version has a lot of small differences with PG. Comparing with the 1813 first edition apparently PG version is more faithful to that edition than to the 1833 edition. TIA only has the second volume of the 1813 edition. Carlo

On Thu, February 16, 2012 3:23 am, Carlo Traverso wrote:
"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> still waiting for carlo to demonstrate his output...
Bowerbird> or document his procedure. or _anything_, really.
I assume that you know wdiff format if you want to understand the details. If you don't, either read the manual or skip the details.
So do any of these methods work when one of the files contains markup and the other doesn't? Are all of these methods automated (meaning that no human intervention is required to produce the new file)? Perfection is not required; good enough is good enough.

"Lee" == Lee Passey <lee@novomail.net> writes:
Lee> On Thu, February 16, 2012 3:23 am, Carlo Traverso wrote: >>>>>>> "Bowerbird" == Bowerbird <Bowerbird@aol.com> writes: >> Bowerbird> still waiting for carlo to demonstrate his output... >> Bowerbird> or document his procedure. or _anything_, really. >> I assume that you know wdiff format if you want to understand >> the details. If you don't, either read the manual or skip the >> details. Lee> So do any of these methods work when one of the files Lee> contains markup and the other doesn't? Yes, in the test that I made with PNP the PG text had _italic markup_ and it is not really different from any other markup. And had PG header and footer (that the TIA file did not have). Of course, the more heavy is the markup, the more problems might arise. <technical> With markup, I would use dwdiff -P instead of wdiff. The difference is that wdiffing "<i>italic markup</i>" and "italic markup" there is one big difference [-<i>italic markup</i>-] {+italic markup+} i.e. total replacement, while with dwdiff -P one has [-<i>-]italic markup[-</i>-] i.e. it recognizes that the second version is the same as the first with the markup removed. </technical> Lee> Are all of these methods automated (meaning that no human Lee> intervention is required to produce the new file)? Perfection Lee> is not required; good enough is good enough. Yes, pipe the wdiff command through a short sed script. The worse that can happen is that there are a few line ends more and a few are missed. The wdiff of the two complete PNP including headers and footers took about 0.2 seconds, the sed part probably less (I have not yet written the script, I used emacs interactively).
"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> focus, kids, focus. Bowerbird> think about the objective here. Bowerbird> so... why would we want to rewrap a p.g. text to an Bowerbird> existing scan-set? Bowerbird> well, for two main reasons: 1. to re-proof and correct Bowerbird> the text with the scan-set. 2. to use the scan-set as Bowerbird> the provenance for the text. To reproof, unless the images are exactly the same edition, and are good enough, it would be much much better to proofread the OCR and then look at the wdiff output. It shows exactly where the two versions differ. Otherwise you'll have to find a few hundred (or thousand) differences, mainly punctuation. The wdiff output is excellent to check if a txt file corresponds with the images, even if the images are bad and the OCR awful, like the 1813 google images. Look at this fragment of p. 21; [-...-] is PG version {+....+} is TIA --------------- [-"I-]{+" I+} had once [-had-] some [-thought-]{+thoughts+} of fixing in town [-myself--for-]{+myself, for+} I am fond of superior society; but I did not feel quite certain that the air of London would agree with Lady Lucas." He paused in hopes of an [-answer;-]{+answer:+} but his companion was not disposed to make any; and Elizabeth at that instant moving towards them, he was struck with the [-action-]{+notion+} of doing a very gallant thing, and called out to [-her: "My dear-]{+her, " My "dear+} Miss Eliza, why are [-you-] not {+you+} dancing? Mr. Darcy, you must allow me to present this young lady to you as a very desirable partner. You cannot refuse to dance, I am [-sure-]{+sure,+} when so much beauty is before you." And, taking her hand, he would have given it to Mr. [-Darcy-]{+Darcy,+} who, though extremely surprised, was not unwilling to receive it, when she instantly drew back, and said with some discomposure to Sir [-William: "Indeed,-]{+William, " Indeed,+} sir, I have not the least intention of dancing. I entreat you not to suppose that I moved this way in order to beg for a partner." --------------- There are 10 substantial differences in 20 lines. From OCR to the image, 3 spacey quotes easy to fix in pre-processing, a straw quote and maybe a couple of corrections. Much easier to proofread from OCR than proofread from a reconstructed text with a lot of differences. I hope that checking if a text corresponds to an OCR might be pretty much automated, with an analysis of the types of wdiff. Some kinds of wdiffs are possible OCR misrecognitions (e.g. [-action-]{+notion+}) while some other aren't possible: like word inversions; "why are [-you-] not {+you+} dancing?" means that PG has "why are you not dancing?" and TIA has "why are not you dancing?". Impossible for an OCR error, it is a clear sign of different editions. Bowerbird> the good news is that we only have to do this rewrap Bowerbird> _once_ for a book, and we can assume that we'll have Bowerbird> volunteers with a reasonable level of skill for the Bowerbird> task, so it doesn't have to be idiot-proof or fully Bowerbird> automatic. Bowerbird> the bad news is that we need to do it for 20,000 books, Bowerbird> so the job _can't_ require _too_ much time or energy... If one can detect automatically if a PG text corresponds to a set of images it might be done. And is a much different kind of work than proofreading, a different set of volunteers might be involved. And PG has the clearance images, hence one does not have to guess, just to find the original and check. Carlo

Assume PG has the images. (Do we know yet for what % of projects that is true?) Then from diffing or wherever assume we can generate a list of text locations to compare with the images. The text locations are identified by - what - page number and offset? with some adjacent context? Let me know if I'm not following ... Now we need to either 1. rename the image files with page numbers as part of the filename, or 2. construct a cross-reference of pages and filenames. Option 2 is more extensible and doesn't involve screwing around with primary sources; but Option 1 bb can understand. Next what's needed is a means to a) rename the image files, b) interject page markers into the text to be compared, c) provide a means to display them side-by-side and make necessary edits to the text, and d) spit out the text with/without embedded page numbers and corrections in one ready-to-use file. All of which Twister provides. On Thu, Feb 16, 2012 at 2:06 PM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"Lee" == Lee Passey <lee@novomail.net> writes:
Lee> On Thu, February 16, 2012 3:23 am, Carlo Traverso wrote:
>> "Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> still waiting for carlo to demonstrate his output...
Bowerbird> or document his procedure. or _anything_, really.
I assume that you know wdiff format if you want to understand the details. If you don't, either read the manual or skip the details.
Lee> So do any of these methods work when one of the files Lee> contains markup and the other doesn't?
Yes, in the test that I made with PNP the PG text had _italic markup_ and it is not really different from any other markup. And had PG header and footer (that the TIA file did not have). Of course, the more heavy is the markup, the more problems might arise.
<technical>
With markup, I would use dwdiff -P instead of wdiff. The difference is that wdiffing "<i>italic markup</i>" and "italic markup" there is one big difference [-<i>italic markup</i>-] {+italic markup+} i.e. total replacement, while with dwdiff -P one has [-<i>-]italic markup[-</i>-] i.e. it recognizes that the second version is the same as the first with the markup removed.
</technical>
Lee> Are all of these methods automated (meaning that no human Lee> intervention is required to produce the new file)? Perfection Lee> is not required; good enough is good enough.
Yes, pipe the wdiff command through a short sed script. The worse that can happen is that there are a few line ends more and a few are missed. The wdiff of the two complete PNP including headers and footers took about 0.2 seconds, the sed part probably less (I have not yet written the script, I used emacs interactively).
"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> focus, kids, focus.
Bowerbird> think about the objective here.
Bowerbird> so... why would we want to rewrap a p.g. text to an Bowerbird> existing scan-set?
Bowerbird> well, for two main reasons: 1. to re-proof and correct Bowerbird> the text with the scan-set. 2. to use the scan-set as Bowerbird> the provenance for the text.
To reproof, unless the images are exactly the same edition, and are good enough, it would be much much better to proofread the OCR and then look at the wdiff output. It shows exactly where the two versions differ. Otherwise you'll have to find a few hundred (or thousand) differences, mainly punctuation.
The wdiff output is excellent to check if a txt file corresponds with the images, even if the images are bad and the OCR awful, like the 1813 google images.
Look at this fragment of p. 21; [-...-] is PG version {+....+} is TIA
---------------
[-"I-]{+" I+} had once [-had-] some [-thought-]{+thoughts+} of fixing in town [-myself--for-]{+myself, for+} I am fond of superior society; but I did not feel quite certain that the air of London would agree with Lady Lucas."
He paused in hopes of an [-answer;-]{+answer:+} but his companion was not disposed to make any; and Elizabeth at that instant moving towards them, he was struck with the [-action-]{+notion+} of doing a very gallant thing, and called out to [-her:
"My dear-]{+her,
" My "dear+} Miss Eliza, why are [-you-] not {+you+} dancing? Mr. Darcy, you must allow me to present this young lady to you as a very desirable partner. You cannot refuse to dance, I am [-sure-]{+sure,+} when so much beauty is before you." And, taking her hand, he would have given it to Mr. [-Darcy-]{+Darcy,+} who, though extremely surprised, was not unwilling to receive it, when she instantly drew back, and said with some discomposure to Sir [-William:
"Indeed,-]{+William,
" Indeed,+} sir, I have not the least intention of dancing. I entreat you not to suppose that I moved this way in order to beg for a partner."
---------------
There are 10 substantial differences in 20 lines. From OCR to the image, 3 spacey quotes easy to fix in pre-processing, a straw quote and maybe a couple of corrections. Much easier to proofread from OCR than proofread from a reconstructed text with a lot of differences.
I hope that checking if a text corresponds to an OCR might be pretty much automated, with an analysis of the types of wdiff. Some kinds of wdiffs are possible OCR misrecognitions (e.g. [-action-]{+notion+}) while some other aren't possible: like word inversions; "why are [-you-] not {+you+} dancing?" means that PG has "why are you not dancing?" and TIA has "why are not you dancing?". Impossible for an OCR error, it is a clear sign of different editions.
Bowerbird> the good news is that we only have to do this rewrap Bowerbird> _once_ for a book, and we can assume that we'll have Bowerbird> volunteers with a reasonable level of skill for the Bowerbird> task, so it doesn't have to be idiot-proof or fully Bowerbird> automatic.
Bowerbird> the bad news is that we need to do it for 20,000 books, Bowerbird> so the job _can't_ require _too_ much time or energy...
If one can detect automatically if a PG text corresponds to a set of images it might be done. And is a much different kind of work than proofreading, a different set of volunteers might be involved.
And PG has the clearance images, hence one does not have to guess, just to find the original and check.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"don" == don kretz <dakretz@gmail.com> writes:
don> Assume PG has the images. (Do we know yet for what % of don> projects that is true?) don> Then from diffing or wherever assume we can generate a list don> of text locations to compare with the images. The text don> locations are identified by - what - page number and offset? don> with some adjacent context? This is my vision of a system to handle errata and correct PG books. When we have images, we can easily provide OCR, hopefully of good quality, but google quality should be OK too. We store the original, the OCR with an association to the scans (from a spot of the OCR we can go at least to the page, but also to the exact position in the page). An user has remarked something to correct. He accesses an errata page at PG, and enters the corrected text with some context: copies a snippet of text, corrects it, and sends the corrected text. (This is the worst scenario, one might send original and corrected text, and additional info, like the title of the book, but it works even with just the corrected text). The system performs a fuzzy search (google-like) in the text database, and identifies one or more books, and positions that almost match the snippet sent. The user is shown the image, the current text, the corrections already proposed but not yet accepted or already rejected, and can modify his proposal, save or cancel (this can reuse some DP code) Later, a WW-er will examine all the errata report, with the same interface, and accept, delay or reject them. The accepted errata are applied to all the formats automatically and the archive is updated. A student of mine in 2004 prepared a prototype that did all this (except multiple formats, PG HTML was not yet established) for one text (disks were a lot smaller 8 years ago). I still have some material, but I fear that I no longer have the installation, but it would be obsolete anyway. It worked even with moved parts: footnotes were collected at the end, and corrections to the footnotes showed the page in which the footnote was printed. The key feature was that the database had two texts, the PG text and the OCR; the OCR was linked to the images; OCR, PG txt and errata were associated through fuzzy searches. The system worked also without images, but the decision of the WW-er would be much more difficult. I might have a student to work three months on such a project. Not enough to have it in place, but enough for a start. Carlo PS: to answer the original question of Don: I am not interested in identifying a position, I am interested in identifying a correction proposal, and this can be identified as a patch (in the sense of the unix command patch): PATCH(1) NAME patch - apply a diff file to an original

BB>i'm curious if you can also restore the end-line hyphenates. Haven't thought about it. Again what I did pretty much "fell out for free" when I was doing other work, namely working on cross versioning software which doesn't crash and burn when there are large version differences, not just a few words. In general the DP positions and actions re hyphenation really doesn't make sense to me. Volunteers are really bad at doing EOL dehyphenation, so this really ought to be handled by smart software, not volunteers. For example given an EOL hyphen: ... cockle- shell ... volunteers will frequently turn this into "cockle shell" -- which is NOT an option! So, DP ought to be dehyphenating as much as possible using smart software before presenting this stuff to volunteers.

On Fri, Feb 17, 2012 at 8:59 AM, Jim Adcock <jimad@msn.com> wrote:
In general the DP positions and actions re hyphenation really doesn't make sense to me. Volunteers are really bad at doing EOL dehyphenation, so this really ought to be handled by smart software, not volunteers. For example given an EOL hyphen:
... cockle- shell ...
volunteers will frequently turn this into "cockle shell" -- which is NOT an option! So, DP ought to be dehyphenating as much as possible using smart software before presenting this stuff to volunteers.
It is. Guiprep has always had a deyphenation pass available. Only the problems that can't be solved by software are presented to the users. -- Kie ekzistas vivo, ekzistas espero.

Jim> volunteers will frequently turn this into "cockle shell" -- which is
NOT an option! So, DP ought to be dehyphenating as much as possible using smart software before presenting this stuff to volunteers.
David>It is. Guiprep has always had a deyphenation pass available. Only the problems that can't be solved by software are presented to the users. Sorry? I thought Guiprep is used at the end of the DP process during PP. By then the human-powered dehyphenation process has already done its damage.

guiguts is used during PP, guiprep is used when preparing a project. guiprep will dehyphenate based on the dictionary used by the OCR software (because that's what determines the contents of the textwo files). Proofers will then re-insert many of the EOL hyphens using the -* notation, putting the problem squarely back into the human domain. On 2/26/2012 10:29 AM, Jim Adcock wrote:
Jim> volunteers will frequently turn this into "cockle shell" -- which is
So, DP ought to be dehyphenating as much as possible using smart software before presenting this stuff to volunteers. David>It is. Guiprep has always had a deyphenation pass available. Only the problems that can't be solved by software are presented to the users.
Sorry? I thought Guiprep is used at the end of the DP process during PP. By then the human-powered dehyphenation process has already done its damage.

Badger>guiprep will dehyphenate based on the dictionary used by the OCR software (because that's what determines the contents of the textwo files). Proofers will then re-insert many of the EOL hyphens using the -* notation, putting the problem squarely back into the human domain. Guiprep/Guiguts -- My bad. Back when I was making submissions to DP they insisted that the lines *not* be dehyphenated because they thought that it confused P1'ers. And at least back then one saw lots of hyphenation/dehyphenation wars ultimately being resolved by some P3'er getting it totally wrong. The "-*" notation (and its similar options at DP) being my all time favorites because it allows subsequent rounds of proofers to effortlessly *decrease* the quality of the work having been done previously all while patting themselves on the back saying "and what a good boy am I." PS: "the dictionary used by the OCR software" obviously being a totally wrong dictionary to use in the first place!

On Sun, Feb 26, 2012 at 3:27 PM, Jim Adcock <jimad@msn.com> wrote:
PS: "the dictionary used by the OCR software" obviously being a totally wrong dictionary to use in the first place!
There is no right dictionary. You can't do this automatically, as long as we have to deal with books that have examples in the middle of the line with a space, a hyphen and no space. You can't do this automatically as long as there are nonce words only found hyphenated at the end of the line. If you want to work alone on making a book, then fine. But backseat, uninformed complaining doesn't help those who are trying to do it cooperatively. -- Kie ekzistas vivo, ekzistas espero.

But backseat, uninformed complaining doesn't help those who are trying to do it cooperatively.
And when people who work cooperatively insist on uninformed group-think that doesn't help either. It should be obvious that the dictionary that comes with the OCR is the wrong dictionary to use, that better choices are readily available, and those better choices should be used. I can try to make such a tool, but then the group-think will continue to work against that tool. The DP project-specific dictionary being one weak example of trying to use more appropriately targeted dictionaries.

On Mon, Feb 27, 2012 at 9:15 AM, Jim Adcock <jimad@msn.com> wrote:
But backseat, uninformed complaining doesn't help those who are trying to do it cooperatively.
And when people who work cooperatively insist on uninformed group-think that doesn't help either.
http://xkcd.com/610/ You think that complaining that other people are involved in group-think helps your case?
I can try to make such a tool, but then the group-think will continue to work against that tool.
Yeah, it's so much easier to complain that nobody will use your tool then to make it. In any case, none of this matters. The state-of-the-art uses only internal evidence and a little knowledge of English to dehyphenate. You could actually take a look at the state of guiprep as it exists today, and then implement your tool, and _then_ _test it_; because reality is, you can say it's obviously better as often as you want, but we won't really know that until the tools have been implemented and they've been tested. -- Kie ekzistas vivo, ekzistas espero.
participants (7)
-
Bowerbird@aol.com
-
Da Badger
-
David Starner
-
don kretz
-
Jim Adcock
-
Lee Passey
-
traverso@posso.dm.unipi.it