Improving the PG library

older
pride and prejudice blah blah blah

jon.a＠hursts.eclipse.co.uk

22 Sep 2012 22 Sep '12

9:11 a.m.

Hi All, I've had a project for PG library improvement niggling away at the back of my mind for a while now, and would be grateful if those on this list would shoot it down in flames so that I can abandon it and avoid doing a whole heap of work. I will add that Greg has badly let me down by being enthusiastic about it. The project is to target 40% of PG downloads for improvement, equating to the most popular 1000 titles. There are just two simple steps for each title: Step 1 is to either source or create a master scan (MS). The MS must be the same edition as the extant PG text, and it must be of high enough quality to support accurate OCR. It will be stored with the extant PG text and will become the unequivocal master document for that text, i.e. if any document that purports to be a version of that text disagrees with the master scan, then that document will be considered to be in error (typos in the original can be corrected directly in the MS if required). Step 2 is creation of a reference text transformation (RTT) from the master scan. The RTT is a line by line pixel to UTF-8 transformation without wrapping or additional markup. It is produced by diffing a new proofreading of the MS with the extant PG text so that only errors common to both will remain. The new proofreading will use DP's P1 and P2 rounds, possibly repeated, with a single project level directive to not clothe eol hyphens and dashes -- essentially a standard DP LOTE style project skipping P3, F1, F2 and PP. The RTT will be stored by PG alongside the MS. The point of all this? The immediate benefit is that the RTT can be used to produce a comprehensive list of errata for its extant text, thus allowing the WW to eliminate the vast majority of errors in one hit. The future benefit is that the MS is a perfect universal master format: it does not include any introduced errors and it codes every formatting nuance. All transformations from MS to either intermediate master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML, PDF, epub, mobi etc.) initially follow much the same path, and the RTT represents a fairly late divergence point. Personally, I intend to use them to make a raft of Kindle sized PDFs via LaTeX. Hopefully someone else might take care of epub and mobi. Maybe Amazon might use them to sort their Kindle Store versions out. The point is you end up with the foundations to do interesting work without having to first do a whole lot of boring OCRing and proofreading, and you can easily track changes in the RTT to keep your own version up to date. Additional benefits: both the MS and RTT are usable ebook formats in and of themselves, and the combination will allow a pretty nifty errata system to be written, whereby a reader types in a suspect phrase, gets taken to the line in the MS where that phrase is found, and can deliver the errata by simply clicking on the line and clicking a "Please Check" button. Weak points that I can see: 1. I know little about copyright law and less about the editions used for current texts, so I'm not going to be able to find scans suitable for the MS without help. 2. Whilst 1000 books is only 2.5% of the PG library, it is still rather a lot of books. I reckon I might get through 1 a month by myself. To complete the project within 2 years, it would need 42 people working at the same rate. That's rather a lot of people. 3. The elephant in the room... most of the heavy lifting relies on DP, and that means convincing Louise to let us use DP's P1 and P2 rounds. Even though standard DP workflow is used and it will help balance the rounds, I have my doubts she will agree. So... get out your best negativity and fire away! Cheers Jon

Show replies by date

Roger Frank

22 Sep 22 Sep

11:44 a.m.

Jon, on your PG library improvement project, as I understand it you want to create a match the scan version of each text in UTF-8. What UTF-8 cannot represent is variable whitespace unless the RTT is monospaced. Is that your intent? In the bigger picture, I believed that PGs mission was to host texts that would be readable forever, hence the insistence on plain text no matter what. The RTT you propose meets this goal even better than plain text, seems to me. Other thoughts: having a RTT implies that someone goes from that to a finished product. It's a variation on post-processing that still includes most of the steps. Those people are right now working mostly at DP. A lot of text might make it to a RTT edition and just sit there. That's not a bad thing, though, since the prime mission of capturing the text will have been completed. Final rendering can take place later, even to formats that haven't been invented, if the reference text is right. Guess I can't "shoot it down in flames" as you hoped. Let's hear what others say. Perhaps you only need 41 people now. -- Roger Frank

Jon Hurst

7:46 p.m.

On 2012-09-22, Roger wrote:

...

Jon, on your PG library improvement project, as I understand it you want to create a match the scan version of each text in UTF-8. What UTF-8 cannot represent is variable whitespace unless the RTT is monospaced. Is that your intent?

It probably wouldn't be possible to capture variable whitespace in the RTT -- a central premise of the plan is that DP procedures are completely locked down and we get out of their current workflow what we can. However, the MS _does_ capture the variable whitespace, so the information is not lost, just less easy to access. An interesting featurette: if you view the RTT in a proportional font with full justification, the original spacing _will_ more or less be retained, along with the original line lengths and hyphenation.

...

A lot of text might make it to a RTT edition and just sit there. That's not a bad thing, though, since the prime mission of capturing the text will have been completed. Final rendering can take place later, even to formats that haven't been invented, if the reference text is right.

If a text that is currently in the library makes it to an RTT edition, the WWs can use the RTT to remove any discovered errors from the current renderings, so that still counts as a win even if no final rendering is ever done. I hadn't really considered RTT editions of a new text, but they may be usable as an ebook themselves if the text is simple, and will certainly allow proper searching of the MS (Jeroans concept). My impression (and it is only an impression) from lurking around various forums and list-servers is that there is a lot of frustration at current renderings and a lot of appetite for doing the final renderings. Hopefully the MS and RTT will be the catalyst required.

...

Guess I can't "shoot it down in flames" as you hoped. Let's hear what others say. Perhaps you only need 41 people now.

Et tu, Brute? It looks like I may have to rely on Louise. :-) Cheers Jon

Greg Newby

5:42 p.m.

Thanks for this, John. I think a good way to start would be with just a few books (i.e., up to 10). They might not be the most popular, because those are often older and therefore it's harder to know the correct print edition. -- Greg On Sat, Sep 22, 2012 at 10:11:26AM +0100, jon.a@hursts.eclipse.co.uk wrote:

...

Hi All,

I've had a project for PG library improvement niggling away at the back of my mind for a while now, and would be grateful if those on this list would shoot it down in flames so that I can abandon it and avoid doing a whole heap of work. I will add that Greg has badly let me down by being enthusiastic about it.

The project is to target 40% of PG downloads for improvement, equating to the most popular 1000 titles. There are just two simple steps for each title:

Step 1 is to either source or create a master scan (MS). The MS must be the same edition as the extant PG text, and it must be of high enough quality to support accurate OCR. It will be stored with the extant PG text and will become the unequivocal master document for that text, i.e. if any document that purports to be a version of that text disagrees with the master scan, then that document will be considered to be in error (typos in the original can be corrected directly in the MS if required).

Step 2 is creation of a reference text transformation (RTT) from the master scan. The RTT is a line by line pixel to UTF-8 transformation without wrapping or additional markup. It is produced by diffing a new proofreading of the MS with the extant PG text so that only errors common to both will remain. The new proofreading will use DP's P1 and P2 rounds, possibly repeated, with a single project level directive to not clothe eol hyphens and dashes -- essentially a standard DP LOTE style project skipping P3, F1, F2 and PP. The RTT will be stored by PG alongside the MS.

The point of all this?

The immediate benefit is that the RTT can be used to produce a comprehensive list of errata for its extant text, thus allowing the WW to eliminate the vast majority of errors in one hit.

The future benefit is that the MS is a perfect universal master format: it does not include any introduced errors and it codes every formatting nuance. All transformations from MS to either intermediate master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML, PDF, epub, mobi etc.) initially follow much the same path, and the RTT represents a fairly late divergence point. Personally, I intend to use them to make a raft of Kindle sized PDFs via LaTeX. Hopefully someone else might take care of epub and mobi. Maybe Amazon might use them to sort their Kindle Store versions out. The point is you end up with the foundations to do interesting work without having to first do a whole lot of boring OCRing and proofreading, and you can easily track changes in the RTT to keep your own version up to date.

Additional benefits: both the MS and RTT are usable ebook formats in and of themselves, and the combination will allow a pretty nifty errata system to be written, whereby a reader types in a suspect phrase, gets taken to the line in the MS where that phrase is found, and can deliver the errata by simply clicking on the line and clicking a "Please Check" button.

Weak points that I can see:

1. I know little about copyright law and less about the editions used for current texts, so I'm not going to be able to find scans suitable for the MS without help.

2. Whilst 1000 books is only 2.5% of the PG library, it is still rather a lot of books. I reckon I might get through 1 a month by myself. To complete the project within 2 years, it would need 42 people working at the same rate. That's rather a lot of people.

3. The elephant in the room... most of the heavy lifting relies on DP, and that means convincing Louise to let us use DP's P1 and P2 rounds. Even though standard DP workflow is used and it will help balance the rounds, I have my doubts she will agree.

So... get out your best negativity and fire away!

Cheers

Jon _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Jon Hurst

8:03 p.m.

On 2012-09-22, Greg Newby wrote:

...

Thanks for this, John. I think a good way to start would be with just a few books (i.e., up to 10). They might not be the most popular, because those are often older and therefore it's harder to know the correct print edition.

The top 10 (around 1 in 20 downloads) currently seems to be: 1. Beowulf 2. The Kama Sutra 3. How to Analyze People on Sight 4. Grimm's Fairy Tales 5. Pride and Prejudice 6. Metamorphosis 7. Adventures of Huckleberry Finn 8. The Republic 9. The Adventures of Sherlock Holmes 10. Siddartha Pride and Prejudice jumps out at me as a starting point -- my wife is currently reading the Kindle Store version, and it is pretty poor. So, does anyone here have any suggestions for an MS for Pride and Prejudice?

Peter Hatch

9:26 p.m.

For Pride and Prejudice, the definitive text is generally considered to be that of R. W. Chapman, who used all the editions published during Austen's lifetime, as well as her letters (which noted some mistakes she'd seen), to create it. That was first published in 1923, so it may still be in copyright. I'm not sure how it works for things like that, which are basically old works, but with a lot of effort put in to polish them. If we can't use that text, the first edition is probably ideal. I can't find volume 1 on Google Books, but it does have scans of Volumes 2 and 3, at http://books.google.com/books?id=PHIJAAAAQAAJ and http://books.google.com/books?id=n0gJAAAAQAAJ. -- Peter Hatch On Sat, Sep 22, 2012 at 1:03 PM, Jon Hurst <jon.a@hursts.eclipse.co.uk> wrote:

...

On 2012-09-22, Greg Newby wrote:

...
Thanks for this, John. I think a good way to start would be with just a few books (i.e., up to 10). They might not be the most popular, because those are often older and therefore it's harder to know the correct print edition.

The top 10 (around 1 in 20 downloads) currently seems to be:

1. Beowulf 2. The Kama Sutra 3. How to Analyze People on Sight 4. Grimm's Fairy Tales 5. Pride and Prejudice 6. Metamorphosis 7. Adventures of Huckleberry Finn 8. The Republic 9. The Adventures of Sherlock Holmes 10. Siddartha

Pride and Prejudice jumps out at me as a starting point -- my wife is currently reading the Kindle Store version, and it is pretty poor. So, does anyone here have any suggestions for an MS for Pride and Prejudice? _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

James Adcock

24 Sep 24 Sep

5:13 a.m.

...

7. Adventures of Huckleberry Finn

I did a "redo" already on Huck Finn, resubmitting a slightly different edition. You might want to word diff those to get a feel for the issues you might find in the other "Top 10" books. PS: Not at all sure DP would be the way to go on these "redo" projects. There are much less time consuming and error prone ways to make progress.

don kretz

22 Sep 22 Sep

10:02 p.m.

Jon, The longer I think about the job we're trying to do here, and the collection of books we're assembling, the more I conclude that the most important piece of the foundation upon which it's built needs to be your Step #1 - collect the page scans. Once the text has been contributed, it's just about unassailable in its original form unless the images of the source text are preserved and are accessible. Without the images, I don't see anything useful coming from further work (and I also think even the existing procedures for text refinement are problematic.) So weeding your proposal (for which I will add myself to the line of prospects) back to its most limited validation of concept), can we identify and acquire page scans for the top 10 ebooks? (Also, the definition of "top 10" to whatever exponent apparently requires some examination. Direct downloads from PG may not well reflect actual demand. For instance, the top 10 from Feedbooks presumably but not necessarily attributable or potentially attributable to PG includes: 1. The Art of War Sun Tzu 2. Alice's Adventures in Wonderland Lewis Carroll 3. The Adventures of Sherlock Holmes Arthur Conan Doyle 4. Price and Prejudice Jane Austen 5. The Curious Case of Benjamin Button (part of PG's ebook "Tales of the Jazz Age") F. Scott Fitzgerald 6. The Count of Monte Cristo Alexandre Dumas 7. Grimm's Fairy Tales Jacob Ludwig Karl Grimm & Wilhem Karl Grimm 8. The Picture of Dorian Gray Oscar Wilde 9. War and Peace Lev Nikolayevich Tolstoy 10. The Divine Comedy Dante Alighieri)

Greg Newby

23 Sep 23 Sep

6:05 p.m.

On Sat, Sep 22, 2012 at 03:02:45PM -0700, don kretz wrote:

...

Jon,

The longer I think about the job we're trying to do here, and the collection of books we're assembling, the more I conclude that the most important piece of the foundation upon which it's built needs to be your Step #1 - collect the page scans. Once the text has been contributed, it's just about unassailable in its original form unless the images of the source text are preserved and are accessible.

For those who are not aware: the scans from all eBooks produced by Distributed Proofreaders are saved at pgdp.net or backup sites, but very few scan sets have been included with the eBooks. We have had a procedure/convention for including original scans with our eBooks since around 2004. Look for "page-images" subdirectories in our eBooks. I see around 7000, though not all are necessarily complete scan sets, and there is variation in image quality. But at least they match. Frequent producers like Al Haines (well, there's really nobody LIKE him!) save their scans, and could likely be convinced to share them. When we get errata reports, it is very frequently a first step to try to find scans from GoogleGooks or another source (Internet Archive, Gallica, etc.). -- Greg

...

Without the images, I don't see anything useful coming from further work (and I also think even the existing procedures for text refinement are problematic.)

So weeding your proposal (for which I will add myself to the line of prospects) back to its most limited validation of concept), can we identify and acquire page scans for the top 10 ebooks?

(Also, the definition of "top 10" to whatever exponent apparently requires some examination. Direct downloads from PG may not well reflect actual demand. For instance, the top 10 from Feedbooks presumably but not necessarily attributable or potentially attributable to PG includes:

1. The Art of War Sun Tzu

2. Alice's Adventures in Wonderland Lewis Carroll

3. The Adventures of Sherlock Holmes Arthur Conan Doyle

4. Price and Prejudice Jane Austen

5. The Curious Case of Benjamin Button (part of PG's ebook "Tales of the Jazz Age") F. Scott Fitzgerald

6. The Count of Monte Cristo Alexandre Dumas

7. Grimm's Fairy Tales Jacob Ludwig Karl Grimm & Wilhem Karl Grimm

8. The Picture of Dorian Gray Oscar Wilde

9. War and Peace Lev Nikolayevich Tolstoy

10. The Divine Comedy Dante Alighieri)

...

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

David Starner

22 Sep 22 Sep

10:50 p.m.

On Sat, Sep 22, 2012 at 2:11 AM, <jon.a@hursts.eclipse.co.uk> wrote:

...

(typos in the original can be corrected directly in the MS if required).

This is practically a breaking rule for me. "Typos" have been fixed in books uploaded to PG that weren't typos before. Vandalizing the master scans, too? It's easy to make mistakes in correcting the original; you certainly don't want to engrave your mistakes into the original images.

...

Step 2 is creation of a reference text transformation (RTT) from the master scan. The RTT is a line by line pixel to UTF-8 transformation without wrapping or additional markup.

I don't know why Roger Frank claims UTF-8 can't represent multiple spaces; as a superset of ASCII, it does so in the exact same way ASCII plain text does, as well as offering a variety of smaller and larger spaces if you want to go that way. (I think anything here is problematic, as the original typesetters did not have a fixed width space character, instead of having the concepts of aligning A to B and adding more space (not more spaces) here.)

...

The immediate benefit is that the RTT can be used to produce a comprehensive list of errata for its extant text, thus allowing the WW to eliminate the vast majority of errors in one hit.

If there is such an issue; I redid an old work for illustrations, and found there was but one error in the original--that DP missed on the second round through, too.

...

The future benefit is that the MS is a perfect universal master format:

I don't think that word means what you think it means. In any case, in the post-Google Books world, scans are easy to come across for many of the works we do.

...

it does not include any introduced errors and it codes every formatting nuance. All transformations from MS to either intermediate master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML, PDF, epub, mobi etc.) initially follow much the same path, and the RTT represents a fairly late divergence point.

I don't understand; the RTT misses out on a huge amount of important information. Italics can have a large impact on meaning at times, for one.

...

The point is you end up with the foundations to do interesting work without having to first do a whole lot of boring OCRing and proofreading

No; to turn the RTT into anything, you'll have to reproofread the book, to catch italics and the rest of the formatting. Toss a few sidenotes in there, and besides just catching stuff, you'd spend a lot of time separating them out from the surrounding text. Two column material? Trees?

...

Additional benefits: both the MS and RTT are usable ebook formats in and of themselves,

The RTT is not good; you've thrown away important information. The MS adds nothing to what IA or Google Books offers.

...

and the combination will allow a pretty nifty errata system to be written, whereby a reader types in a suspect phrase, gets taken to the line in the MS where that phrase is found, and can deliver the errata by simply clicking on the line and clicking a "Please Check" button.

You can't errata italics or many other important parts of the book. It's not unuseful, but it's hardly complete. -- Kie ekzistas vivo, ekzistas espero.

Jon Hurst

23 Sep 23 Sep

9:55 a.m.

Hi David, Thanks very much for your response. I'll try to clarify/explain my thinking on a few things, but please keep picking holes: the absolute last thing I want to do is start doing a whole lot of work and discover a fatal flaw down the road. I would much rather not start at all. On 2012-09-22, David Starner wrote:

...

...
(typos in the original can be corrected directly in the MS if required).

This is practically a breaking rule for me. "Typos" have been fixed in books uploaded to PG that weren't typos before. Vandalizing the master scans, too? It's easy to make mistakes in correcting the original; you certainly don't want to engrave your mistakes into the original images.

The MS is not supposed to be the original; it is supposed to be PG's definitive version. I agree that changes to the MS would have to be very carefully researched and controlled, and there may well be some benefit in storing the actual original alongside the MS if any changes of this nature are made.

...

...
The future benefit is that the MS is a perfect universal master format:

I don't think that word means what you think it means.

I define "universal master format" as a format capable of encoding any and all characters and any and all formatting nuances. There are lots of "master formats" available, each capable of encoding a subset of these things, but only the image itself is "universal".

...

I don't understand; the RTT misses out on a huge amount of important information. Italics can have a large impact on meaning at times, for one.

...

No; to turn the RTT into anything, you'll have to reproofread the book, to catch italics and the rest of the formatting. Toss a few sidenotes in there, and besides just catching stuff, you'd spend a lot of time separating them out from the surrounding text. Two column material? Trees?

This is true, although I wouldn't use the term "reproofread" as that implies checking the words and punctuation again (the time-consuming bit), which is exactly what I want to avoid. The RTT is intended as a simple foundation. It must be used in conjunction with the MS to turn it into anything else, and this may indeed involve a lot of work for a complex book. The point is that if you started with a scan of that complex book with the intention of producing an ebook using your markup language of choice, you would likely at some point in the process have something very similar to the RTT. If someone else then decided that the book would be better in their markup language of choice they would also at some stage have something very similar to the RTT, but they would have had to repeat all the work that you did because they wouldn't have had access to your RTT. There is absolutely nothing stopping other layers of information building upon the MS and RTT: the MS and RTT are simply the core requirement. Personally, in the production of Kindle PDFs via LaTeX I will be producing metadata for position of lines in images, positions of micro-formatting (formatting that needs more than a brief glance to discern) and end-of-line hyphen mappings. This metadata can be added to the PG archives in case it is useful to others, but it is not something that should be mandated, any more than intermediate master formats are something that should be mandated.

...

...
Additional benefits: both the MS and RTT are usable ebook formats in and of themselves,

The RTT is not good; you've thrown away important information. The MS adds nothing to what IA or Google Books offers.

Conceded. The RTT and MS are not intended to be final ebook formats, so the fact that they can be pressed into service as inferior formats is not relevant. I shouldn't have brought it up.

...

...
and the combination will allow a pretty nifty errata system to be written, whereby a reader types in a suspect phrase, gets taken to the line in the MS where that phrase is found, and can deliver the errata by simply clicking on the line and clicking a "Please Check" button.

You can't errata italics or many other important parts of the book. It's not unuseful, but it's hardly complete.

Errata of formatting could be done with such a system: you would just need to capture which final format contained the error. That's a really long way down the road though. Cheers Jon

David Starner

24 Sep 24 Sep

6:37 a.m.

On Sun, Sep 23, 2012 at 2:55 AM, Jon Hurst <jon.a@hursts.eclipse.co.uk> wrote:

...

The MS is not supposed to be the original; it is supposed to be PG's definitive version.

I don't see any value in that. Scans are a pain, and the only saving grace is that they accurately represent a physical printed edition.

...

I agree that changes to the MS would have to be very carefully researched and controlled, and there may well be some benefit in storing the actual original alongside the MS if any changes of this nature are made.

What's gained by mangling the scans instead of recording typos externally to them?

...

I define "universal master format" as a format capable of encoding any and all characters and any and all formatting nuances. There are lots of "master formats" available, each capable of encoding a subset of these things, but only the image itself is "universal".

I would define "master format" as one that can be effectively used to derive other formats from. As for universal, the image is not necessarily capable of encoding all formatting nuances. For properties of physical books, scans can't handle transparencies (which I've had in a book I was thinking about scanning), paper changes, metallic inks (one illustration had an EETS book printed for it, but the metallic ink didn't reproduce well in my scans), fur (plenty of 20th/21st century examples for babies), mirrors (ditto), holes (again, ditto), or pop-ups (there's a beautiful 19th century edition of Euclid with them, for example). For properties of the text, it does not show line breaks at tops of pages (ambiguous in poetry) and it obscures spellings at end of lines (is it spelled to-night or tonight?). It's powerful, but not unlimited.

...

This is true, although I wouldn't use the term "reproofread" as that implies checking the words and punctuation again (the time-consuming bit), which is exactly what I want to avoid.

Anything that requires you to look at every word on every page, which italics and bold do, is time-consuming. And perhaps as important as italics is superscript; 1030 changes quite a bit when it's comes back as 1030.

...

The point is that if you started with a scan of that complex book with the intention of producing an ebook using your markup language of choice, you would likely at some point in the process have something very similar to the RTT.

Not necessarily. If I were working on a book, I would format as I went along. Yes, DP has found for their purposes it works better to separate them, but I don't think that's what most people working on a book alone would do. Nor do most people make a line for line copy; without external systems like DP, it's easier to input the text as paragraphs ended by new lines.

...

If someone else then decided that the book would be better in their markup language of choice they would also at some stage have something very similar to the RTT, but they would have had to repeat all the work that you did because they wouldn't have had access to your RTT.

I don't see why the RTT is the ideal level for that, though. For any large book, I'd rather have the TEI version then RTT--even if you lock me in a cave without Internet and I have to figure out the TEI format by guesswork and write my own XML converters. RTT makes me figure out what I need to rewrap or not and fix all the microformatting, stuff that could be automatically pulled out of even the most stupid HTML. (Okay, so it would be a royal PIA to pull it out of "smart" HTML. Still for a sufficiently large enough work, it'd be worth it.) If I have an RTT version of the text, and an HTML version or TEI version or sufficiently smart structured text version, and I wanted to make a version in TeX or whatever, I'd start with the smart version instead of the RTT. Anything short of Postscript or PDF is going to be better than RTT. -- Kie ekzistas vivo, ekzistas espero.

don kretz

7:56 a.m.

I'm not seeing the value of this RTT thing (but I'll admit I'm not sure what it is - maybe an example would be helpful.) As best I can tell, it's a linear representation of the graphical image, provided either by OCR software or possibly by previous work by proofers preparing texts, however completely, accurately, and unambiguously, for PG projects. Then the formatting is removed. Yes? No? My experience is that OCR does a pretty poor job of properly sequencing a text; and that much of text isn't linear. Page headings and footings are not well isolated from page text. Footnotes and sidenotes are problematic. Illustrations (possibly with captions, attributions, explanatory keys with subcolumns, etc. are scattered around. Syntactic distinctions are mainly interpreted by humans by inference from layout and formatting, but not detected by OCR software. Examples are poetry, correspondence, and mathematics. Even explicitly identifed elements like quotations are frequently enough ambiguous to OCR, If there is to be a source text which is the canonical starting point for further work, it seems to me it needs to have been treated so as much implicit syntactical identification as possible has been explicated and disambiguated with some documented form of markup - which form doesn't matter much, because if it is sufficiently complete and unambiguous it can be converted into any other form. If there is a preparatory process, especially one involving people's time and attention, shouldn't it be spent disambiguating rather than removing many of the implicit clues needed to detect structure?

Jon Hurst

11:41 a.m.

On 2012-09-24, don kretz wrote:

...

I'm not seeing the value of this RTT thing (but I'll admit I'm not sure what it is - maybe an example would be helpful.)

The RTT is more or less DP's P3 output without the clothing of eol hyphens and dashes so that a line in the RTT will always correspond directly to a line on the page. Rather than: [OCR]--p1-->[P1]--p2-->[P2]--p3-->[P3] the proposal is to do [OCR]--p1-->[P1]--p2-->[P2]--Diff against extant PG text-->[RTT] possibly doing p1 and p2 repeats to improve accuracy. If you follow a workflow where you do formatting seperately, you will likely at some stage in the process have something akin to the RTT. The problem that the RTT seeks to be the solution to is: given that we all have our own ideas about workflows and we will defend those ideas to the death, what is the latest usable snapshot common to the vast majority of these workflows? What is the latest point that I can pick off from your workflow that will allow me to continue with my workflow, and vice versa? For a subset of the workflows that can be based on the RTT, there will be other usable snapshots further on. If someone is carrying out such a workflow there is nothing stopping these being captured as well; other compatible workflows can start from this snapshot instead. There is also nothing precluding deriving from a derivative if that works for you. The RTT is just a low level foundation for all these things that is there if you want to use it. In general, the default answer to how any given thing is encoded into the RTT is "like DP does it", since that is what people know, and DP or a DP like organisation will be needed to produce them. The exception is the eol clothing which actively destroys data and line correspondence, and that exception is only possible because LOTE already uses the exception. There are all sorts of things that can also be done in a "post-process to RTT" sort of way, such as converting to curly quotes and translating form "--" to "—", but, as you correctly point out, there is no point getting to that level of detail until we at least sort out how to select the right master scans.

...

If there is to be a source text which is the canonical starting point for further work, it seems to me it needs to have been treated so as much implicit syntactical identification as possible has been explicated and disambiguated with some documented form of markup - which form doesn't matter much, because if it is sufficiently complete and unambiguous it can be converted into any other form.

This canonical starting point (CSP -- I love TLAs) is something like F2 output, although I'm sure you would come up with something a lot less random. I suggest Bowerbird wouldn't like it: he would suggest you might as well just use ZML. Marcello would probably point out that RST would be the solution. Jeroan would point to TEI's superior gamut. If you can all agree on a CSP format then that's brilliant: we can have an MS, an RTT and a CSP. But I don't want to get involved in the flame war (*cough* LaTeX). Cheers Jon

jeroen＠bohol.ph

12:54 p.m.

Quoting David Starner <prosfilaes@gmail.com>:

...

I don't see any value in that. Scans are a pain, and the only saving grace is that they accurately represent a physical printed edition.

Which is not true; many scans (or the compression techniques used to compress them to manageable size) loose relevant information. Infamous in PGDP circles are "despeckled periods." But I have compression techniques based on OCR techniques mix up e and c, just as OCR software often does. However, having them is often the best we have (besides having the paper original), and I would love to have them.

...

I would define "master format" as one that can be effectively used to derive other formats from.

As for universal, the image is not necessarily capable of encoding all formatting nuances. For properties of physical books, scans can't handle transparencies (which I've had in a book I was thinking about scanning)

Did that, just scan with a neutral background behind the page, and repeat for each layer of transparency, then you can use tricks in HTML to reproduce the effect. (using the CSS hover pseudo-selector to show the transparent sheet superimposed over the original illustration did the trick for me)

...

, paper changes, metallic inks (one illustration had an EETS book printed for it, but the metallic ink didn't reproduce well in my scans), fur (plenty of 20th/21st century examples for babies),

Often a problem with nice gold-embossed covers as well...

...

mirrors (ditto), holes (again, ditto), or pop-ups (there's a beautiful 19th century edition of Euclid with them, for example).

I once did a book on the female body, with a very nice pop-up (http://www.gutenberg.org/files/22868/22868-h/22868-h.htm#d0e1934) Just scan in plenty of states, but would require some 3D modelling language to describe in the general case.

...

For properties of the text, it does not show line breaks at tops of pages (ambiguous in poetry) and it obscures spellings at end of lines (is it spelled to-night or tonight?). It's powerful, but not unlimited.

Common issue, look for other occurrences of to-?night, and pick one. Hardly troublesome in most cases. If you really desire, you can capture these using entity encodings like ‐ &softhyphen; &dubioushyphen; in your version of TEI or whatever you are using.

...

Anything that requires you to look at every word on every page, which italics and bold do, is time-consuming. And perhaps as important as italics is superscript; 1030 changes quite a bit when it's comes back as 1030.

That is why it is important to get those things right the first time, when we are directing PGDP volunteer eyeballs on it. Jeroen.

Jon Hurst

1:13 p.m.

On 2012-09-23, David Starner wrote:

...

I don't see any value in that. Scans are a pain, and the only saving grace is that they accurately represent a physical printed edition.

The value in the MS is in its unequivocal nature. It says: "After much discussion amongst knowledgeable people, this, and only this, is what we at PG consider to be a version of this book. Everything else we publish is a derivation of this. If they don't match this, they are in error." Example 1: As a PG customer I find something in an ebook that doesn't look right. I want to contribute. I want to check it, and if it is wrong, report it. Off I go to PG and I find... nothing. If I'm really keen to help, I go to TIA and I find 50 versions, and I have no idea what is the correct version. Oh well... Example 2: We propose redoing "Pride and Prejudice". Without an MS we immediately hit the buffers. What edition is the extant text? What scan was used? What edition should we use? Is there a scan we can use _anywhere_? What is the copyright situation of the 1923 version? Should we go for a first edition or the latest one we can get copyright clearance for? With the MS: throw the MS at the DP OCR pool and off we go.

...

What's gained by mangling the scans instead of recording typos externally to them?

In example 1, I am in luck. PG has been adding definitive scans to the PG archives and Marcello has done a really cool interface. I note a difference with my version and the scan, so I report it... and the WW writes back saying that it is not actually a difference: it is just a typo that was corrected in the text and the definitive scan is actually the thing that is incorrect. So much for the definitive scan...

...

As for universal, the image is not necessarily capable of encoding all formatting nuances. ... ... fur (plenty of 20th/21st century examples for babies)...

Ah... so that's what the Kindle touch is about -- fur capability. :-)

...

Anything that requires you to look at every word on every page, which italics and bold do, is time-consuming. And perhaps as important as italics is superscript; 1030 changes quite a bit when it's comes back as 1030.

There are things that can help here, but I want to keep my sights for a core format set as low as possible. I would re-iterate that my aim is only to produce a foundation. You can build whatever you like on that foundation, and some sort of optional formatting methodology would likely be one of those things. For the books that I turn into Kindle PDFs for, you will have a micro-formatting highlighting overlay available. If Bowerbird does some books you will have ZML. You may have Don's Canonical Starting Point. You may have TEI, RST, Docbook or LaTeX. If you don't like anything that is available, you will still always have the option to go back to the MS and RTT and do your own thing, even if that takes a little longer. The main point of RTT is that the phrase "let's make (X)HTML/TEI/RST/LaTeX/ZML (delete as appropriate) the PG master format" _will_, quite rightly, start an unproductive flame war. RTT is inferior to all these things by design, so that any of these things can be built upon it. Your choice once you have, say, a ZML version available, might be to base your work on a transformation from ZML, but the RTT will hopefully have saved the person doing the ZML version a heap of work. Cheers Jon

Greg Newby

3:28 p.m.

On Mon, Sep 24, 2012 at 02:13:35PM +0100, Jon Hurst wrote:

...

... The value in the MS is in its unequivocal nature. It says: "After much discussion amongst knowledgeable people, this, and only this, is what we at PG consider to be a version of this book. Everything else we publish is a derivation of this. If they don't match this, they are in error."

This is exactly opposite the PG policy. We specificaly do NOT adhere to any print edition. (That is part of why you will find it really hard to find a matching print source for many PG eBooks.) This can make it difficult to determine what to do with some error reports. Often, the original producer is involved, if he/she is still available. http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ Part of the challenge is the desire to turn an ambiguous situation into a dogmatic one. As pointed out, there are often many variations in print editions. Determining which is "right" is not always so easy. The PG approach is to emphasize readability, not adherance to a particular printed edition. That still leaves room for ambiguity, but it also gets out of being dogmatic about matching a particular print edition. Luckily, ambiguous error reports are not the norm. Usually it's pretty obvious when an error exists. The topic of adherance to a particular print edition was a perpetual battle that Michael Hart fought. Many scholars, in particular, wanted PG to only match particular printed editions, in order for them to be useful for scholarship-related purposes. This was always very firmly resisted, in favor of readability. -- Greg

Jon Hurst

4:28 p.m.

Hi Greg, On 2012-09-24, Greg Newby wrote:

...

On Mon, Sep 24, 2012 at 02:13:35PM +0100, Jon Hurst wrote:

...
... The value in the MS is in its unequivocal nature. It says: "After much discussion amongst knowledgeable people, this, and only this, is what we at PG consider to be a version of this book. Everything else we publish is a derivation of this. If they don't match this, they are in error."

This is exactly opposite the PG policy. We specificaly do NOT adhere to any print edition. (That is part of why you will find it really hard to find a matching print source for many PG eBooks.)

Ah.... that does rather torpedo the scheme. The MS is de facto what I described. The RTT would be based on it, and the derivatives would be based on the RTT, one way or another. Therefore the final ebooks _would_ all adhere to the print edition that the MS was sourced from, or be incorrect and need fixing. The whole scheme, unless I am misunderstanding something, would therefore appear to be the exact opposite of PG policy. Is this a correct interpretation? Cheers Jon

don kretz

4:35 p.m.

I don't think it matters which way the policy sits, in terms of this discussion. The fact in any case is that the text needs to accord with some printed (and therefore photographed and therefore OCRed) version, and it's equally legitimate to expect that the canonical text needs to agree with the canonical images. And we still have the same requirement for a set of images to prevent text paralysis. So the foundation is that set of images, and the first layer is an iteratively improvable text, which need to be easily comparable with each other. The definition of a "necessary and sufficient" image set is one where all the pages are present and it's readable. The definition of a "necessary and sufficient" canonical text is to be determined. Don On Mon, Sep 24, 2012 at 9:28 AM, Jon Hurst <jon.a@hursts.eclipse.co.uk> wrote:

...

Hi Greg,

On 2012-09-24, Greg Newby wrote:

...
On Mon, Sep 24, 2012 at 02:13:35PM +0100, Jon Hurst wrote:

...
... The value in the MS is in its unequivocal nature. It says: "After much discussion amongst knowledgeable people, this, and only this, is what we at PG consider to be a version of this book. Everything else we publish is a derivation of this. If they don't match this, they are in error."

This is exactly opposite the PG policy. We specificaly do NOT adhere to any print edition. (That is part of why you will find it really hard to find a matching print source for many PG eBooks.)

Ah.... that does rather torpedo the scheme. The MS is de facto what I described. The RTT would be based on it, and the derivatives would be based on the RTT, one way or another. Therefore the final ebooks _would_ all adhere to the print edition that the MS was sourced from, or be incorrect and need fixing. The whole scheme, unless I am misunderstanding something, would therefore appear to be the exact opposite of PG policy. Is this a correct interpretation?

Cheers

Jon _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

don kretz

4:49 p.m.

If PG has images correlated with text based on embedded page numbers (whether visible or not,) it's possible to envision a distribution mechanism for ebooks which includes both the (invisible) images and the book text together for, say, the Kindle, and any reader can selectively turn on the image, compare, record, and submit corrections while they read.

Greg Newby

5:35 p.m.

On Mon, Sep 24, 2012 at 05:28:13PM +0100, Jon Hurst wrote:

...

Hi Greg,

On 2012-09-24, Greg Newby wrote:

...
On Mon, Sep 24, 2012 at 02:13:35PM +0100, Jon Hurst wrote:

...
... The value in the MS is in its unequivocal nature. It says: "After much discussion amongst knowledgeable people, this, and only this, is what we at PG consider to be a version of this book. Everything else we publish is a derivation of this. If they don't match this, they are in error."

This is exactly opposite the PG policy. We specificaly do NOT adhere to any print edition. (That is part of why you will find it really hard to find a matching print source for many PG eBooks.)

Ah.... that does rather torpedo the scheme. The MS is de facto what I described. The RTT would be based on it, and the derivatives would be based on the RTT, one way or another. Therefore the final ebooks _would_ all adhere to the print edition that the MS was sourced from, or be incorrect and need fixing. The whole scheme, unless I am misunderstanding something, would therefore appear to be the exact opposite of PG policy. Is this a correct interpretation?

No. Anyone who submits an eBook is welcome to have it adhere to a particular printed copy, and in fact most do. The mistake was in thinking that PG policy is that it *must* adhere, or that any corrections must adhere to the original printed text. I don't think anything in PG practice or policy reduces the value of having a set of reference images. The only inconsistency would be in thinking that the eBook that PG distributes will (must) be completely consistent with those images. It doesn't need to be, and in practice never is (i.e., minimally there are rewrapped paragraphs and de-hyphenation, and a different font). -- Greg

David Starner

9:55 p.m.

On Mon, Sep 24, 2012 at 6:13 AM, Jon Hurst <jon.a@hursts.eclipse.co.uk> wrote:

...

On 2012-09-23, David Starner wrote:

...
I don't see any value in that. Scans are a pain, and the only saving grace is that they accurately represent a physical printed edition.

The value in the MS is in its unequivocal nature. It says: "After much discussion amongst knowledgeable people, this, and only this, is what we at PG consider to be a version of this book. Everything else we publish is a derivation of this. If they don't match this, they are in error."

The value of scans is in their unequivocal nature. Once we've started editing the scans, then it's just another modern volunteer edition. I see no reason not to list our changes in an external text file. If it's small, then it's not a big deal. If it's large, then we aren't providing an authoritative version of the work, and if we're really using the best version of the work, there may be no way for us to provide an authoritative version of the work. I'm not arguing about having backing scans; merely that they should be the original unedited scans.

...

In example 1, I am in luck. PG has been adding definitive scans to the PG archives and Marcello has done a really cool interface. I note a difference with my version and the scan, so I report it... and the WW writes back saying that it is not actually a difference: it is just a typo that was corrected in the text and the definitive scan is actually the thing that is incorrect. So much for the definitive scan...

So much for the corrected typo! If there's disagreement on whether the book or the text is the typo, it's better to stick with the book. It's not a big deal to integrate the corrections file with this interface, and cases of this are likely to be vanishingly rare; most people are only going to be checking the file if they think they found something wrong in the text, in which case we should stick with the original.

...

The main point of RTT is that the phrase "let's make (X)HTML/TEI/RST/LaTeX/ZML (delete as appropriate) the PG master format" _will_, quite rightly, start an unproductive flame war. RTT is inferior to all these things by design,

I fail to see why an inferior format solves anything. And if I'm going to be start my work from the (X)HTML/TEI/RST/LaTeX/ZML edition instead of RTT, it's useless.

...

Your choice once you have, say, a ZML version available, might be to base your work on a transformation from ZML, but the RTT will hopefully have saved the person doing the ZML version a heap of work.

No, it won't have. The person doing the RTT will almost certainly be doing an ebook, and thus they will be the person making the ZML. Moreover, if nobody is working from the RTT, you've gained little. I don't see anyone rushing to send errata back to a version that's not even used by ebook creators. I'm not even sure how the errata is supposed to get to the ebooks in this model. As long as you get things done, the flame war doesn't matter. You can punt on the hard things, but any text dump that doesn't preserve easy necessary things like italics and superscript doesn't help me much at all. As a solitary producer, I might well be able to get better straight out of the OCR program. -- Kie ekzistas vivo, ekzistas espero.

Jon Hurst

25 Sep 25 Sep

10 a.m.

On 2012-09-24, David Starner wrote:

...

The value of scans is in their unequivocal nature. Once we've started editing the scans, then it's just another modern volunteer edition. I see no reason not to list our changes in an external text file.

I'm not particularly wedded to this detail. If everyone prefers a companion file approach then that is fine by me.

...

I fail to see why an inferior format solves anything. And if I'm going to be start my work from the (X)HTML/TEI/RST/LaTeX/ZML edition instead of RTT, it's useless.

I can see that I'm not going to convince you of the value of an RTT. If you want to do a version of a book that I have done a derivative for, you will have an RTT available to you. You will also have a LaTeX version and even an MS if you really do want to redo the OCR. Feel free to use whatever works best for you. If I want to do a derivative of a book you've done, depending on what you've done I may be able to strip it back to an RTT and go from there, in which case an RTT will become available for that book as well. 97.5% of PG books will, of course, never have an RTT, so it is not like it will be mandatory. We seem to all agree about the value of an MS, with or without the RTT, so let's focus on that. Cheers Jon

Roger Frank

23 Sep 23 Sep

10:32 a.m.

David brought up a good counterexample: italics. I like the idea of capturing the text electronically beyond the scans themselves in an exact UTF-8 edition. But thinking it further, there are things the typesetter might have done that are not representable in UTF-8, italics being one example. I thought further about the proofing rounds at DP. The original proposal was to ask that proofers not clothe eol hyphens and dashes. I believe there would be other exceptions, such as start of chapter capitalization that proofers are supposed to downcase. I'm not sure DP management or users will embrace these changes, especially when the proofer's work is not leading directly to a final product but only to a RTT. Marcello's position has been that PG produces new editions. He eschews facsimile editions that DP historically has tried to produce, matching the original as closely as possible, which is what the RTT is. Instead, he proposes RST. It's the only input format that is completely usable by epubmaker. RST cannot match the scan. It feels like Jon's RTT and Marcello's RST are both trying to be PG's master format. Finally, Bowerbird has warned against proceeding and I respect his historical perspective. Can we avoid the mistakes of the past attempts? -- Roger Frank

Jon Hurst

11:36 a.m.

Hi Roger On 2012-09-23, Roger wrote:

...

I thought further about the proofing rounds at DP. The original proposal was to ask that proofers not clothe eol hyphens and dashes. I believe there would be other exceptions, such as start of chapter capitalization that proofers are supposed to downcase. I'm not sure DP management or users will embrace these changes, especially when the proofer's work is not leading directly to a final product but only to a RTT.

Lack of DP engagement is definitely top of the list of things that will kill this idea stone dead. DP doesn't do change; we must accept that as an axiom. Therefore anything that we ask them to do must essentially be what they would do anyway -- the non-clothing of eol hyphens and dashes, I believe, is often done for LOTE, so we may just get away with that. Anything the RTT requires beyond that we will have to do ourselves -- hopefully there won't be too much.

...

David brought up a good counterexample: italics. I like the idea of capturing the text electronically beyond the scans themselves in an exact UTF-8 edition. But thinking it further, there are things the typesetter might have done that are not representable in UTF-8, italics being one example.

Marcello's position has been that PG produces new editions. He eschews facsimile editions that DP historically has tried to produce, matching the original as closely as possible, which is what the RTT is. Instead, he proposes RST. It's the only input format that is completely usable by epubmaker. RST cannot match the scan. It feels like Jon's RTT and Marcello's RST are both trying to be PG's master format.

I've shuffled these paragraphs together as they are related. I am absolutely not proposing RTT as a master format. The master format I am proposing is actually the MS, which, of course, encodes italics etc. just fine. The RTT is just an intermediate stage in the process of transforming the MS into "something else"; the RTT is only really useful in conjunction with the MS unless you are content to dispose of all formatting, such as in the case of diffing against the extant text to find scannos. What the "something else" should be is intentionally beyond the scope of the project because that argument will rage until doomsday. Lots of things work, and what works best is dependent on the nature of the text. Marcello's RST would likely be a common candidate, but that should not proclude Bowerbird from using the MS and RTT to do a ZML version. If politics necessitate, Marcello can host the RST on PG and Bowerbird can host the ZML on his site. Hopefully both Marcello and Bowerbird would agree that the MS and RTT make a good starting point for a transformation to their chosen formats. If not I need to work out how to make it so, or give it up.

...

Finally, Bowerbird has warned against proceeding and I respect his historical perspective. Can we avoid the mistakes of the past attempts?

I too respect Bowerbird's historical perspective, and what's more I don't know what the mistakes of the past were, so I can't avoid them. I can only try to keep things simple and uncontroversial. I have asked Bowerbird for specifics, and will happily abandon this idea if he comes up with something that cannot be mitigated against. Cheers Jon

don kretz

3:56 a.m.

Pride and Prejudice might be an interesting first project At least it should be easy to find the online resources.. I explored the internet a bit to see what I could find for page images and for free text versions. I didn't quickly find very many page images to work from. But there are quite a few fee text versions, as we would expect. What's interesting is that they appear to almost all have the same provenance through PG. There's an interesting lesson about the consequences of cleaning up old texts (or not.) When the text was transcribed, the convention appears to have been to record small-cap text as all upper-case. It happens that the printing convention of the source text printed the salutation line for correspondence as small-caps; and there's a fair amount of correspondence in P&P. You can find an example of the printed image here<https://play.google.com/books/reader?id=2_S8xAws2G4C&printsec=frontcover&output=reader&authuser=0&hl=en&pg=GBS.PA284> : Almost every ebook text I could find, in whatever ebook format, has salutations in upper case. (PG's versions appear to want them centered on the page as well, for some reason.) The only version I could find that doesn't have uppercase is the one found at the link in bowerbird's post.

Lee Passey

25 Sep 25 Sep

3:44 p.m.

On 9/22/2012 9:56 PM, don kretz wrote:

...

Pride and Prejudice might be an interesting first project At least it should be easy to find the online resources..

I explored the internet a bit to see what I could find for page images and for free text versions.

I didn't quickly find very many page images to work from.

You must not have looked very hard. There are at least 5 scan sets at the Internet Archive, and at least 5 scan sets at Google books (the Google scans seem to be cleaner and more useable). There is some overlap between the sets, but not as much as you might think. _Pride and Prejudice_ is also included in Volume 3 of Charles Elliot's "Harvard Classics Shelf of Fiction." Given the ubiquity of that set, if you wanted a version that everyone has access to that is the one I would choose. _Pride and Prejudice_ is among the most e-re-worked works of fiction in the world, probably second only to _The Adventures of Sherlock Holmes_. I have an HTML version at my own web site: http://www.passkeysoft.com/Harvard/FICTION/AUSTEN,%20Jane%20-%20Pride%20And%..., based on the Harvard Classics edition. I doubt very much that anything could be added to the P&P corpus at this time, but the very fact that there is so much out there could make its use interesting as a learning experiment.

Lee Passey

4:11 p.m.

On 9/22/2012 9:56 PM, don kretz wrote:

...

Pride and Prejudice might be an interesting first project At least it should be easy to find the online resources..

I explored the internet a bit to see what I could find for page images and for free text versions.

I didn't quickly find very many page images to work from.

4639

Age (days ago)

4642

Last active (days ago)

List overview

Download

27 comments

11 participants

participants (11)

David Starner
don kretz
Greg Newby
James Adcock
jeroen＠bohol.ph
Jon Hurst
jon.a＠hursts.eclipse.co.uk
Lee Passey
Lee Passey
Peter Hatch
Roger Frank