
Don>If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements. I would hope that no one on this forum is in a position to issue non-negotiable demands. Rather I would hope we could have a polite conversation of ideas on the basis of their merits. I would hope PG would be willing to accept *any* text which they can reasonably assure is out of copyright, and thus the public has a "non-negotiable" right to copy that work wherever and whenever they want. PG being part of the public. If PG is not willing to accept all such texts then PG is effectively working *with* those organizations which engage in copyfraud. But, PG, and would-be submitters face a realistic problem that some other organizations assert "copyright" over some or all parts of photographic page images from works which are clearly out of copyright. In my opinion those assertions are fraudulent aka "copyfraud" but as a pragmatic response to this problem PG, and submitters may wish to avoid copying, storing, and re-transmission of these "copyfraud" photocopies. Even these "copyfraud" organizations agree that textual representation of these out-of-copyright texts in no way infringes on their blessed photocopies. So a simple solution in these cases is that PG posts the textual representation, but does not host a copy of the "copyfraud" images. Asking submitters to tackle or put themselves in harm's way of the these "copyfraud" organizations would not be a reasonable requirement. What the submitter often does instead is tells the whitewashers where these pages are hosted, in case the whitewashers need to verify that the submitted text *is* in fact the text of a book which *is* out of copyright, and which therefore the submitter, and PG, have every legal right in the world to store, transfer and copy as they see fit. Besides, we are not really talking about "citizen-provided proofing" -- that is what DP does -- rather we are talking about human-powered formatting to overcome the limitations of PG tool-driven formatting. Agreed, still, that such formatting is done better while viewing page images, otherwise one is doing "blind formatting" -- which in fact is what the PG tools *are* doing, except perhaps when they are only slightly fixing problems with submitter HTML code.
The text must have the page numbers embedded as data so the error submission process includes the ability to easily confirm the text with the image.
Submitters, including yours truly, often do not record page numbers because to do so is often a pain -- page numbers being one of the things which OCR does worst. People who insist on page numbers do so because page numbers fit well with *their* choice of work flow. For those of us who do not use that work flow putting page numbers back in doesn't work well. Checking submitted text against page images without page numbers is *in practice* no big deal and something that we who do not use page numbers do "all the time" -- we simply use textual searches to match what we are working on against a portion of the page image. For example when I OCR I also generate a PDF page image file which contains the OCR text. Usually I am "in sync" with the PDF so *I* know where I am. If I put the work away and come back in two weeks from vacation I can simply "re sync" by doing a textual search on the PDF. Again, If I have to provide page images then I might as well just provide the page images to DP and let them "mechanical turk" the work. Because making a set of page images that would keep DP or PG happy is a ton of work -- where I would rather use the time and energy to do something fun and constructive. And some of us believe that page numbers in e-books are a very very bad idea. If you want page numbers, you do the work. Don't make us do it.