[gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about pagination?

23 Feb 2010

      The real issue at play here is, I think, one of manuscript preservation versus
text preservation.  Manuscript preservation is only really relevant to
academia, but text preservation is relevant to anyone interested in content. 

What Bowerbird has been talking about preserving, pagination (for each
edition; this seems extreme--how many bargain basement editions have there
been of Austen? or Twain? Few of these have any critical significance),
hyphenation, etc. are manuscript details. Generic machine-readable formats
like HTML & Text are terrible for manuscript preservation. They make wonderful
text preservation and reading formats, but there is no way to accurately
reproduce the nuances of a manuscript on them.

To the discerning reader who has need of this information, the sad fact is
that they would be much better examining page scans, than they ever would be
using a text or HTML version of the book. Images are, and always will be, the
only way to put all of the print information in the user's hands.

Again, let us not fool ourselves. The only people who will be interested in
the manuscripts are academians and the odd-ball home enthusiast. Readers do
not care about the original pages. There have been many editions of Twain or
Shakespeare. If I read an etext, I cannot give a page number. But if I read a
paper copy, the page number does not do much by way of good in ordinary
conversation. It is only in an academic atmosphere, where the ability to check
sources is crucial where this matters. In any sort of informal discussion, the
answer would be to use textual land marks. "Chapter III of Huck Finn", "Book
I, line 120" of Paradise Lost", etc.

The ultimate form of manuscript preservation is not digitization nor is it
transcription, but it is rounding up all the manuscripts to be preserved and
sinking them in a huge concrete bunker (like the US Archives do the
Declaration of Independence and the Constitution).

So far, PG has been about text preservation. To the larger world, this is of
far more importance. Few of us care about the manuscripts of the Iliad, but
many of us (including me!) would love the missing texts of the epic cycle.

It comes down, then, to what you are interested in. There is a place, I think,
for both, but the intersection is a little peculiar. Scans without any further
human intervention are the most time effective way to digitize a volume. It is
also the way that has the smallest use case. Transcription (whether through
human-proofed OCR or manual typing is irrelevant) has the widest use case, but
the greatest time investment.

So, Bowerbird, are you interested in text preservation or in manuscript
preservation? If the former, than artifacts of print are immaterial. Not only
will they change with digital, but they have and will change with print as
well. If the latter, than why not create DJVU scans of all pages, replacing
the OCR with highly proofed text (drawn, perhaps, from the PG archive)? This
would, after all, get you the best of both worlds. A searchable,
print-faithful computer readable file.

Excerpts from Bowerbird's message of Mon Feb 22 18:23:44 -0600 2010:
...
michael said:
...
I had to have read the whole thing 
   to get to the part I quoted. . .duh!
you can quote something without reading it.
but i should have said i wish you would have _understood_
what i was saying, because then you would have understood
that your response didn't address the point that i was making.
...
When you ask people to pay attention, it helps to PAY ATTENTION.
i am paying attention, michael.   even though i've heard
what you're saying lots and lots and lots of times before.
and it makes sense when you are addressing the people
who are making _other_ points.   but they left long ago...
...
It addresses EXACTLY the point you made that I quoted. . . .
no, it doesn't.
...
Then SAY that!!!  Right up front in plain language!!!
i _have_ said it, every time i've talked about this issue...
including once when marcello brought up this same point
you brought up (think about that, michael), about editions.
(it was back in september of 2007, if you're curious.)
marcello pointed out that the various different editions of
"pride and prejudice" had different pagination, and he asked
which of the editions should be used to do the pagination...
here's the reply i made to marcello's point:
...
http://z-m-l.com/go/pap/pride%20and%20prejudice(4).txt
if you don't care to read all of it, here are the guts of that reply:
...
the answer to the question as to which set of linebreaks and
   pagebreaks to use is this: the ones in the edition you digitize.
plain old common sense.   if you didn't already know the answer,
   perhaps you might want to exercise your brain a little bit more...
if you're digitizing the 1844, use its linebreaks and pagebreaks.
   if you're digitizing the 1853, use its linebreaks and pagebreaks.
   if you're digitizing the 1870, use its linebreaks and pagebreaks.
   if you're digitizing the 1892, use its linebreaks and pagebreaks.
and here is a web-page showing the first page of those 4 editions:
...
http://z-m-l.com/go/pap/pride_and_prejudice(4).html
and yes, in case you're wondering, if a p-book was important enough
to go through different editions, we should digitize _every_ edition...
i'm not going to tell you which of those 4 editions you _should_ use,
which one is the "right" one.   whichever one you want to use is "right".
and you should be able to determine if any specific e-book _does_ or
_does_ not match the edition you want to use, or some other edition.
***
by the way, there's another web-page of interest in this directory:
...
http://z-m-l.com/go/pap/jose(4).html
this fascinating page shows some work done by jose menendez.
jose adopted my suggestion that the e-book be able to mimic the
p-book, and he created a series of .pdf books that did just that...
shown on this web-page are some screenshots of his .pdf-books,
compared to the page-scans from those pages.   he did a great job.
of course, since a .pdf-book is unable to reflow its text, jose's work
doesn't fit the more-important criterion of reflowability, but it does
show that the ability to mimic the original can be extremely valuable.
...
However, that still relegates us to being a Xerox machine, no?
no.   because a xerox machine can't do reflow.   or fix typos.
or pull in spacey contractions.   or change the font, or size.
look, i understand the appeal of digital text _extremely_ well.
i've made all of the arguments myself, so there's no need for
you to repeat 'em back at me, you're just wasting your breath.
but there's a problem looming here, a problem that the future
will have to face, and solve, and i'm telling you what you need
to do, so that you can _help_ the future _solve_ that problem,
such that your e-texts will continue to be used, and not tossed.
i'm on _your_ side, michael...   i've got your back, good buddy...
so you need to get that through your skull and start listening...
...
I'm never going to get into any of these semantic arguments!!!!!!!
   Mimic means to copy as closely as possible. . . .  Synonym:  copy.
it's not a semantic argument, michael.   it's protective coloration.
if your copy isn't capable of _assuming_the_look_and_feel_ of
the thing that it _purports_ to be copying, nobody will trust it.
you seem to be forgetting that you are claiming to _be_ a copy.
perhaps you are an "improved" copy, but you are _still_ a copy.
certainly if you came out and said "we rewrote parts of the middle,
because the original was too boring", you would expect that people
would throw you away.
but what if someone points out a few errors in your work, and says
"see, you can't trust this work, it hasn't been faithfully transcribed,"
then what is your defense?   you can say that "it was just a few errors",
but what if they then point out a few more, and a few more after that?
at what point can you no longer expect the end-user to believe you?
...
As I have said before, if you would listen, 
   I am not AGAINST keeping a copy with such pagination 
   for such purposes
well, good, and bully for you, and all that, but the fact of the matter is
that project gutenberg is not, at this point in time, actually doing that.
...
but I draw the lines, pun intended, at keeping every character 
   in the same page position when there is no need for pages, 
   in all available PG editions.
and, as i have said before, if you would listen,
i'm not suggesting, in any way, shape, or form,
that pagination and linebreaks need to be kept
"in all available p.g. editions".   that'd be stupid,
absolutely and totally and ridiculously _stupid_,
and i don't usually feel a need to rule out stuff
that is absolutely, totally, ridiculously stupid...
...
I want our eBooks to be optimally readable:
   Minimal end of line hyphenation.
   No page headers or footers.
   Just plain reading.
i'm 100% in favor of that.   and i have demonstrated before, and
will be happy to demonstrate once again, any time that you like,
how p.g, could save its texts in a format that allows verification
of the type that i am talking about, _and_ allows the end-user to
have the text exactly like you specify.
...
Once again, I have no stance AGAINST people who want pagination,
   I just don't want for force any such arbitrary formats on anyone
   and neither should you or anyone else.
   STOP TRYING TO FORCE YOUR OPINIONS ON OTHERS, 
   MAKE THEM OPTIONS!
see, that's precisely why i said you're not listening to me, because
there's no way in the world i would try to "force" this on end-users.
i've never, ever, said _anything_ even _remotely_ like that, in all the
years i've been on this listserve, or the decades i have done e-books,
so i don't know who you're having this conversation with, michael,
but it's obviously not me.
...
I  CAN  tell you that most of the paper editions' page numbers 
   will fade along with the hyphenation.
no, they won't.   because our cultural heritage is full of references
to page-numbers, and it'll be several orders of magnitude cheaper
and more efficient to keep track of those page-numbers than to
attempt to re-do all those references using hyperlinks or whatever.
...
Last time I looked there were still pretty ubiquitous programs to
   lay out all such differences.
IFF you have such deep interests, you can simply put up two editions
   side by side when you look at them. . .I do. . . .
If not, then you aren't really that interested. . .it's all smoke.
this is very amusing.   i do this kind of work, michael.   regularly.
and i can tell you that it's not nearly as simple as you make it sound.
...
"_RIGHT_" copy???
   Now you've contradicted yourself back into the ivory tower. . . .
   "_RIGHT_" copy, indeeeeed. . . .
you just can't _wait_ to jump to the wrong conclusion, can you?
by "right" copy, i mean the one that the person _wants_ to see.
...
This will  ONLY  do you any good if you manage to find 
   that edition, out of all the other paper editions in the world.
again, "that edition" is whatever edition the person wants to use.
i have a paper copy of "catcher in the rye" on my bookshelf now.
let's say, 10 years on, i can find a dozen digital versions online.
let's also say that some analysis shows differences between them.
i haven't compared all, not in full, but i know there's some diffs.
i don't want a dozen different versions.   i want the one that matches
the paper copy that has been sitting on my bookshelf for 4 decades.
how do i determine which one -- _if_any_ -- is the same version
as the one that is sitting on my shelf?   that's the difficult question.
...
Sorry, but I anticipated ALL of these questions when I first started,
   and have answered, and will continue to answer, at length.
no, you didn't answer the question.   so i just asked it to you again.
...
Why can't you just propose your ideas as OPTIONS, 
   not CARVED IN STONE?
stop making the thread ridiculous.
no one can carve anything in stone any more.
-bowerbird
-- 
Michael McDermott
www.mad-computer-scientist.com