
On 2/20/2010 11:52 AM, Gardner Buchanan wrote:
My question to the pagination-preservers is: what is the difference? Both hyphenation, line-endings and pagination are mainly artefacts of the physical medium -- one of width and the other of height. Bowerbird wants to keep both; I see no need to keep either. But what is the reasoning behind keeping one (pagination) and not the other?
As with most things, your position depends on your perspective. As a reader (consumer) of e-books, I want to get /all/ the production artifacts out of my way; a line should wrap wherever I would expect it to depending on the size of my viewport (screen), hyphenation should only occur between syllables at the right edge of the viewport, and page should end at the bottom of my viewport--no sooner, no later. Page numbers, if any, should reflect the number of /virtual/ pages there are in the book I'm reading; i.e. the number of viewports to complete the book. These page numbers should not be embedded in the text, but should be displayed somewhere else in the User Agent where I can refer to them if I want to, but otherwise they are inconspicuous. Of course, if I change fonts or the viewport size I would expect the page numbers to be updated to reflect that chante. BUT ... As a producer of e-books, it is my self-appointed task to create an e-book whose reading experience matches, a nearly as possible, a specific instance of a historical paper book. Clearly this doesn't mean that in the final product the page- or line-endings have to match the source, as that would in many cases lead to an awkward "ouija" board reading experience, but it does mean that I want to maintain markers throughout the e-document that can 1.) /create/ a view where the page- and line-endings match so I can do a side-by-side comparison of a page image with my electronic version, and 2.) lead me efficiently back to a particular page scan if there is any question about the correctness of the electronic edition. This apparent conflict between the two perspectives leads to two follow-on questions: 1.) where and how broad is the line between production and consumption?, and 2.) is it possible to create a single electronic document that can satisfy both needs? In the case of the PG/DP co-dependency, I think the line is clear and narrow: Distributed Proofreaders is /only/ a producer of electronic documents and its only consumer is Project Gutenberg. Project Gutenberg is /only/ a consumer of electronic texts, and while DP is its primary producer it is not the only one. According to Al Haines, one of PG's whitewashers, the PG 'errata' mechanism "is informal, at best, and there's no list of old submissions that would benefit from being re-done." Errata resolution at PG is handled via e-mail messages to a very small handful of whitewashers. According to Mr. Haines, "My PG priorities are my own productions first, followed by WWing, then Errata and Reposts." In yet another post, after detailing multiple problems with an old DP contribution he states:
Is it worth it? Personally speaking, no. It's going to take hours to fix this text, time that I'd far rather spend on my own productions, but there's currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to fix this kind of thing. (Probably simpler to just re-do this text from scratch, which is something *I'm* not about to do.)
This is precisely the reason that DP puts such an emphasis on having a /completed/ text. Once an electronic document passes over from DP to PG there is almost /no/ chance that it will every be improved, revised, corrected. This is not to cast aspersions on the hard and dedicated work of the whitewashers, simply an acknowledgment of the fact that it is not a high priority for them and there is no formal mechanism to help it get done. Because Project Gutenberg is the /only/ consumer of Distributed Proofreader's production, preservation of line- and page-breaks should be of little importance in the current DP->PG work flow. /If/, on the other hand, what you're doing lies outside of the DP->PG work flow (as it appears your does), then the calculation changes. For example, what happens to the page scans from whence your text is derived? If those scans are not, and will not ever be, publicly available then encoding markers in the text that refer back to page scans and the original text layout may not be necessary or important. Likewise, if PG is your only distribution point then you will probably be the only one who will ever make changes, corrections or improvements to the text. If you expect that once you have completed a task and transferred responsibility to Project Gutenberg you are finished, perhaps even deleting the original scans and your intermediate work files (please don't do this; I'm sure that the Internet Archive would be willing to take them off your hands) then preservation of markers referring back to the original text are probably not necessary. By contrast, if you are preparing files for broader distribution than simply via Project Gutenberg, or if you anticipate that someday a work flow may develop either inside or outside of the DP->PG chain that will support continuous improvement of your original work, then I would think that creating and preserving text markers, including original page-breaks, line-breaks, and page numbers referencing the original scan set would be advisable. This is particularly true as it is always easier to preserve data, even that data of dubious value, than it is to try and recover data that has been lost or discarded. This leads us to the second question: is it possible to create a single electronic document which can satisfy the needs of both readers and producers? I believe that it is, but it requires the use of a markup language having at least the capability of marking some text as invisible, and a user agent that is capable of recognizing that markup and /not/ rendering it as indicated. I'm sure there are a number of markup languages that could satisfy this requirement, but I have chosen to use XHTML (with one small cheat). When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it. FineReader is quite good at recognizing when line-ending hyphenation is due to splitting long words or when it is required as a part of a compound word. In the first case, when it saves using line breaks it saves the line-ending hyphenation either as a hard hyphen or a soft-hyphen. When soft hyphens are replaced by (and the following white space is removed) you have recorded a line-ending hyphen which will not be displayed (although different user agents sometimes do things differently). Originally I was in the "collapse page- and line-break" camp, but because I never submit my texts to PG, and I have hope that someday some sort of continuous improvement process may evolve (and because maintaining the data is cheap and simple) I'm moving into the "preserve everything" camp.