Re: [gutvol-d] Improving the PG library

22 Sep 2012

      Thanks for this, John.  I think a good way to start would be
with just a few books (i.e., up to 10).  They might not be
the most popular, because those are often older and therefore
it's harder to know the correct print edition.
  -- Greg

On Sat, Sep 22, 2012 at 10:11:26AM +0100, jon.a@hursts.eclipse.co.uk wrote:
...
Hi All,
I've had a project for PG library improvement niggling away at the
back of my mind for a while now, and would be grateful if those on
this list would shoot it down in flames so that I can abandon it and
avoid doing a whole heap of work. I will add that Greg has badly let
me down by being enthusiastic about it.
The project is to target 40% of PG downloads for improvement, equating
to the most popular 1000 titles. There are just two simple steps for
each title:
Step 1 is to either source or create a master scan (MS). The MS must
be the same edition as the extant PG text, and it must be of high
enough quality to support accurate OCR. It will be stored with the
extant PG text and will become the unequivocal master document for
that text, i.e. if any document that purports to be a version of that
text disagrees with the master scan, then that document will be
considered to be in error (typos in the original can be corrected
directly in the MS if required).
Step 2 is creation of a reference text transformation (RTT) from the
master scan. The RTT is a line by line pixel to UTF-8 transformation
without wrapping or additional markup. It is produced by diffing a new
proofreading of the MS with the extant PG text so that only errors
common to both will remain. The new proofreading will use DP's P1 and
P2 rounds, possibly repeated, with a single project level directive to
not clothe eol hyphens and dashes -- essentially a standard DP LOTE
style project skipping P3, F1, F2 and PP. The RTT will be stored by PG
alongside the MS.
The point of all this?
The immediate benefit is that the RTT can be used to produce a
comprehensive list of errata for its extant text, thus allowing the WW
to eliminate the vast majority of errors in one hit.
The future benefit is that the MS is a perfect universal master
format: it does not include any introduced errors and it codes every
formatting nuance. All transformations from MS to either intermediate
master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML,
PDF, epub, mobi etc.) initially follow much the same path, and the RTT
represents a fairly late divergence point. Personally, I intend to use
them to make a raft of Kindle sized PDFs via LaTeX. Hopefully someone
else might take care of epub and mobi. Maybe Amazon might use them to
sort their Kindle Store versions out. The point is you end up with the
foundations to do interesting work without having to first do a whole
lot of boring OCRing and proofreading, and you can easily track
changes in the RTT to keep your own version up to date.
Additional benefits: both the MS and RTT are usable ebook formats in
and of themselves, and the combination will allow a pretty nifty
errata system to be written, whereby a reader types in a suspect
phrase, gets taken to the line in the MS where that phrase is found,
and can deliver the errata by simply clicking on the line and clicking
a "Please Check" button.
Weak points that I can see:
1. I know little about copyright law and less about the editions used
for current texts, so I'm not going to be able to find scans suitable
for the MS without help.
2. Whilst 1000 books is only 2.5% of the PG library, it is still
rather a lot of books. I reckon I might get through 1 a month by
myself. To complete the project within 2 years, it would need 42
people working at the same rate. That's rather a lot of people.
3. The elephant in the room... most of the heavy lifting relies on DP,
and that means convincing Louise to let us use DP's P1 and P2
rounds. Even though standard DP workflow is used and it will help
balance the rounds, I have my doubts she will agree.
So... get out your best negativity and fire away!
Cheers
Jon
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d

Re: [gutvol-d] Improving the PG library

Greg Newby