Re: [gutvol-d] RST/PGTEI/etc

8 Feb 2012

      Don>If PG is to have a hope of citizen-provided proofing, crowd or
otherwise, there are two non-negotiable design requirements.

I would hope that no one on this forum is in a position to issue
non-negotiable demands.  Rather I would hope we could have a polite
conversation of ideas on the basis of their merits.

I would hope PG would be willing to accept *any* text which they can
reasonably assure is out of copyright, and thus the public has a
"non-negotiable" right to copy that work wherever and whenever they want.
PG being part of the public. If PG is not willing to accept all such texts
then PG is effectively working *with* those organizations which engage in
copyfraud. But, PG, and would-be submitters face a realistic problem that
some other organizations assert "copyright" over some or all parts of
photographic page images from works which are clearly out of copyright.  In
my opinion those assertions are fraudulent aka "copyfraud" but as a
pragmatic response to this problem PG, and submitters may wish to avoid
copying, storing, and re-transmission of these "copyfraud" photocopies. Even
these "copyfraud" organizations agree that textual representation of these
out-of-copyright texts in no way infringes on their blessed photocopies.  So
a simple solution in these cases is that PG posts the textual
representation, but does not host a copy of the "copyfraud" images.  Asking
submitters to tackle or put themselves in harm's way of the these
"copyfraud" organizations would not be a reasonable requirement.  What the
submitter often does instead is tells the whitewashers where these pages are
hosted, in case the whitewashers need to verify that the submitted text *is*
in fact the text of a book which *is* out of copyright, and which therefore
the submitter, and PG, have every legal right in the world to store,
transfer and copy as they see fit.

Besides, we are not really talking about "citizen-provided proofing" -- that
is what DP does -- rather we are talking about human-powered formatting to
overcome the limitations of PG tool-driven formatting.  Agreed, still, that
such formatting is done better while viewing page images, otherwise one is
doing "blind formatting" -- which in fact is what the PG tools *are* doing,
except perhaps when they are only slightly fixing problems with submitter
HTML code.
...
The text must have the page numbers embedded as data so the error
submission process includes the ability to easily confirm the text with the
image.
Submitters, including yours truly, often do not record page numbers because
to do so is often a pain -- page numbers being one of the things which OCR
does worst.  People who insist on page numbers do so because page numbers
fit well with *their* choice of work flow.  For those of us who do not use
that work flow putting page numbers back in doesn't work well.  Checking
submitted text against page images without page numbers is *in practice* no
big deal and something that we who do not use page numbers do "all the time"
-- we simply use textual searches to match what we are working on against a
portion of the page image.  For example when I OCR I also generate a PDF
page image file which contains the OCR text.  Usually I am "in sync" with
the PDF so *I* know where I am.  If I put the work away and come back in two
weeks from vacation I can simply "re sync" by doing a textual search on the
PDF.

Again, If I have to provide page images then I might as well just provide
the page images to DP and let them "mechanical turk" the work.  Because
making a set of page images that would keep DP or PG happy is a ton of work
-- where I would rather use the time and energy to do something fun and
constructive.  And some of us believe that page numbers in e-books are a
very very bad idea.  If you want page numbers, you do the work.  Don't make
us do it.

Re: [gutvol-d] RST/PGTEI/etc

Jim Adcock