
Thanks for this, John. I think a good way to start would be with just a few books (i.e., up to 10). They might not be the most popular, because those are often older and therefore it's harder to know the correct print edition. -- Greg On Sat, Sep 22, 2012 at 10:11:26AM +0100, jon.a@hursts.eclipse.co.uk wrote:
Hi All,
I've had a project for PG library improvement niggling away at the back of my mind for a while now, and would be grateful if those on this list would shoot it down in flames so that I can abandon it and avoid doing a whole heap of work. I will add that Greg has badly let me down by being enthusiastic about it.
The project is to target 40% of PG downloads for improvement, equating to the most popular 1000 titles. There are just two simple steps for each title:
Step 1 is to either source or create a master scan (MS). The MS must be the same edition as the extant PG text, and it must be of high enough quality to support accurate OCR. It will be stored with the extant PG text and will become the unequivocal master document for that text, i.e. if any document that purports to be a version of that text disagrees with the master scan, then that document will be considered to be in error (typos in the original can be corrected directly in the MS if required).
Step 2 is creation of a reference text transformation (RTT) from the master scan. The RTT is a line by line pixel to UTF-8 transformation without wrapping or additional markup. It is produced by diffing a new proofreading of the MS with the extant PG text so that only errors common to both will remain. The new proofreading will use DP's P1 and P2 rounds, possibly repeated, with a single project level directive to not clothe eol hyphens and dashes -- essentially a standard DP LOTE style project skipping P3, F1, F2 and PP. The RTT will be stored by PG alongside the MS.
The point of all this?
The immediate benefit is that the RTT can be used to produce a comprehensive list of errata for its extant text, thus allowing the WW to eliminate the vast majority of errors in one hit.
The future benefit is that the MS is a perfect universal master format: it does not include any introduced errors and it codes every formatting nuance. All transformations from MS to either intermediate master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML, PDF, epub, mobi etc.) initially follow much the same path, and the RTT represents a fairly late divergence point. Personally, I intend to use them to make a raft of Kindle sized PDFs via LaTeX. Hopefully someone else might take care of epub and mobi. Maybe Amazon might use them to sort their Kindle Store versions out. The point is you end up with the foundations to do interesting work without having to first do a whole lot of boring OCRing and proofreading, and you can easily track changes in the RTT to keep your own version up to date.
Additional benefits: both the MS and RTT are usable ebook formats in and of themselves, and the combination will allow a pretty nifty errata system to be written, whereby a reader types in a suspect phrase, gets taken to the line in the MS where that phrase is found, and can deliver the errata by simply clicking on the line and clicking a "Please Check" button.
Weak points that I can see:
1. I know little about copyright law and less about the editions used for current texts, so I'm not going to be able to find scans suitable for the MS without help.
2. Whilst 1000 books is only 2.5% of the PG library, it is still rather a lot of books. I reckon I might get through 1 a month by myself. To complete the project within 2 years, it would need 42 people working at the same rate. That's rather a lot of people.
3. The elephant in the room... most of the heavy lifting relies on DP, and that means convincing Louise to let us use DP's P1 and P2 rounds. Even though standard DP workflow is used and it will help balance the rounds, I have my doubts she will agree.
So... get out your best negativity and fire away!
Cheers
Jon _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d