[gutvol-d] Re: so what is so important about pagination?

22 Feb 2010

      On 2/20/2010 11:52 AM, Gardner Buchanan wrote:
...
My question to the pagination-preservers is: what is the
difference? Both hyphenation, line-endings and pagination
are mainly artefacts of the physical medium -- one of width
and the other of height. Bowerbird wants to keep both;
I see no need to keep either. But what is the reasoning
behind keeping one (pagination) and not the other?
As with most things, your position depends on your perspective.

As a reader (consumer) of e-books, I want to get /all/ the production 
artifacts out of my way; a line should wrap wherever I would expect it 
to depending on the size of my viewport (screen), hyphenation should 
only occur between syllables at the right edge of the viewport, and page 
should end at the bottom of my viewport--no sooner, no later.

Page numbers, if any, should reflect the number of /virtual/ pages there 
are in the book I'm reading; i.e. the number of viewports to complete 
the book. These page numbers should not be embedded in the text, but 
should be displayed somewhere else in the User Agent where I can refer 
to them if I want to, but otherwise they are inconspicuous. Of course, 
if I change fonts or the viewport size I would expect the page numbers 
to be updated to reflect that chante.

BUT ...

As a producer of e-books, it is my self-appointed task to create an 
e-book whose reading experience matches, a nearly as possible, a 
specific instance of a historical paper book. Clearly this doesn't mean 
that in the final product the page- or line-endings have to match the 
source, as that would in many cases lead to an awkward "ouija" board 
reading experience, but it does mean that I want to maintain markers 
throughout the e-document that can 1.) /create/ a view where the page- 
and line-endings match so I can do a side-by-side comparison of a page 
image with my electronic version, and 2.) lead me efficiently back to a 
particular page scan if there is any question about the correctness of 
the electronic edition.

This apparent conflict between the two perspectives leads to two 
follow-on questions: 1.) where and how broad is the line between 
production and consumption?, and 2.) is it possible to create a single 
electronic document that can satisfy both needs?

In the case of the PG/DP co-dependency, I think the line is clear and 
narrow: Distributed Proofreaders is /only/ a producer of electronic 
documents and its only consumer is Project Gutenberg. Project Gutenberg 
is /only/ a consumer of electronic texts, and while DP is its primary 
producer it is not the only one.

According to Al Haines, one of PG's whitewashers, the PG 'errata' 
mechanism "is informal, at best, and there's no list of old submissions 
that would benefit from being re-done." Errata resolution at PG is 
handled via e-mail messages to a very small handful of whitewashers. 
According to Mr. Haines, "My PG priorities are my own productions first, 
followed by WWing, then Errata and Reposts." In yet another post, after 
detailing multiple problems with an old DP contribution he states:
...
...
Is it worth it? Personally speaking, no. It's going to take hours to fix
this text, time that I'd far rather spend on my own productions, but
there's currently no mechanism except for the Whitewashers, a.k.a.
Errata Team, to fix this kind of thing. (Probably simpler to just re-do
this text from scratch, which is something *I'm* not about to do.)
This is precisely the reason that DP puts such an emphasis on having a 
/completed/ text. Once an electronic document passes over from DP to PG 
there is almost /no/ chance that it will every be improved, revised, 
corrected. This is not to cast aspersions on the hard and dedicated work 
of the whitewashers, simply an acknowledgment of the fact that it is not 
a high priority for them and there is no formal mechanism to help it get 
done.

Because Project Gutenberg is the /only/ consumer of Distributed 
Proofreader's production, preservation of line- and page-breaks should 
be of little importance in the current DP->PG work flow.

/If/, on the other hand, what you're doing lies outside of the DP->PG 
work flow (as it appears your does), then the calculation changes.

For example, what happens to the page scans from whence your text is 
derived? If those scans are not, and will not ever be, publicly 
available then encoding markers in the text that refer back to page 
scans and the original text layout may not be necessary or important. 
Likewise, if PG is your only distribution point then you will probably 
be the only one who will ever make changes, corrections or improvements 
to the text. If you expect that once you have completed a task and 
transferred responsibility to Project Gutenberg you are finished, 
perhaps even deleting the original scans and your intermediate work 
files (please don't do this; I'm sure that the Internet Archive would be 
willing to take them off your hands) then preservation of markers 
referring back to the original text are probably not necessary.

By contrast, if you are preparing files for broader distribution than 
simply via Project Gutenberg, or if you anticipate that someday a work 
flow may develop either inside or outside of the DP->PG chain that will 
support continuous improvement of your original work, then I would think 
that creating and preserving text markers, including original 
page-breaks, line-breaks, and page numbers referencing the original scan 
set would be advisable. This is particularly true as it is always easier 
to preserve data, even that data of dubious value, than it is to try and 
recover data that has been lost or discarded.

This leads us to the second question: is it possible to create a single 
electronic document which can satisfy the needs of both readers and 
producers? I believe that it is, but it requires the use of a markup 
language having at least the capability of marking some text as 
invisible, and a user agent that is capable of recognizing that markup 
and /not/ rendering it as indicated.

I'm sure there are a number of markup languages that could satisfy this 
requirement, but I have chosen to use XHTML (with one small cheat).

When ABBYY FineReader saves its OCR output in HTML format it has the 
option of placing a break (<br>) at the end of each line, and a 
horizontal rule (<hr>) between each page (an alternative is to save each 
scanned page as a separate file, but I find that less convenient). I 
then wrote a short program (could probably be done just as easily with a 
perl script, or even sed) that replaced each <hr> with an anchor tag 
indicating the page number (<a name="page##" />), and replaced each <br> 
with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but 
I know of no user agent that will fail to render an HTML file just 
because it has an invalid element in it.

FineReader is quite good at recognizing when line-ending hyphenation is 
due to splitting long words or when it is required as a part of a 
compound word. In the first case, when it saves using line breaks it 
saves the line-ending hyphenation either as a hard hyphen or a 
soft-hyphen. When soft hyphens are replaced by  (and the following 
white space is removed) you have recorded a line-ending hyphen which 
will not be displayed (although different user agents sometimes do 
things differently).

Originally I was in the "collapse page- and line-break" camp, but 
because I never submit my texts to PG, and I have hope that someday some 
sort of continuous improvement process may evolve (and because 
maintaining the data is cheap and simple) I'm moving into the "preserve 
everything" camp.

[gutvol-d] Re: so what is so important about pagination?

Lee Passey