
The real issue at play here is, I think, one of manuscript preservation versus text preservation. Manuscript preservation is only really relevant to academia, but text preservation is relevant to anyone interested in content. What Bowerbird has been talking about preserving, pagination (for each edition; this seems extreme--how many bargain basement editions have there been of Austen? or Twain? Few of these have any critical significance), hyphenation, etc. are manuscript details. Generic machine-readable formats like HTML & Text are terrible for manuscript preservation. They make wonderful text preservation and reading formats, but there is no way to accurately reproduce the nuances of a manuscript on them. To the discerning reader who has need of this information, the sad fact is that they would be much better examining page scans, than they ever would be using a text or HTML version of the book. Images are, and always will be, the only way to put all of the print information in the user's hands. Again, let us not fool ourselves. The only people who will be interested in the manuscripts are academians and the odd-ball home enthusiast. Readers do not care about the original pages. There have been many editions of Twain or Shakespeare. If I read an etext, I cannot give a page number. But if I read a paper copy, the page number does not do much by way of good in ordinary conversation. It is only in an academic atmosphere, where the ability to check sources is crucial where this matters. In any sort of informal discussion, the answer would be to use textual land marks. "Chapter III of Huck Finn", "Book I, line 120" of Paradise Lost", etc. The ultimate form of manuscript preservation is not digitization nor is it transcription, but it is rounding up all the manuscripts to be preserved and sinking them in a huge concrete bunker (like the US Archives do the Declaration of Independence and the Constitution). So far, PG has been about text preservation. To the larger world, this is of far more importance. Few of us care about the manuscripts of the Iliad, but many of us (including me!) would love the missing texts of the epic cycle. It comes down, then, to what you are interested in. There is a place, I think, for both, but the intersection is a little peculiar. Scans without any further human intervention are the most time effective way to digitize a volume. It is also the way that has the smallest use case. Transcription (whether through human-proofed OCR or manual typing is irrelevant) has the widest use case, but the greatest time investment. So, Bowerbird, are you interested in text preservation or in manuscript preservation? If the former, than artifacts of print are immaterial. Not only will they change with digital, but they have and will change with print as well. If the latter, than why not create DJVU scans of all pages, replacing the OCR with highly proofed text (drawn, perhaps, from the PG archive)? This would, after all, get you the best of both worlds. A searchable, print-faithful computer readable file. Excerpts from Bowerbird's message of Mon Feb 22 18:23:44 -0600 2010:
michael said:
I had to have read the whole thing to get to the part I quoted. . .duh!
you can quote something without reading it.
but i should have said i wish you would have _understood_ what i was saying, because then you would have understood that your response didn't address the point that i was making.
When you ask people to pay attention, it helps to PAY ATTENTION.
i am paying attention, michael. even though i've heard what you're saying lots and lots and lots of times before.
and it makes sense when you are addressing the people who are making _other_ points. but they left long ago...
It addresses EXACTLY the point you made that I quoted. . . .
no, it doesn't.
Then SAY that!!! Right up front in plain language!!!
i _have_ said it, every time i've talked about this issue...
including once when marcello brought up this same point you brought up (think about that, michael), about editions. (it was back in september of 2007, if you're curious.)
marcello pointed out that the various different editions of "pride and prejudice" had different pagination, and he asked which of the editions should be used to do the pagination...
here's the reply i made to marcello's point:
if you don't care to read all of it, here are the guts of that reply:
the answer to the question as to which set of linebreaks and pagebreaks to use is this: the ones in the edition you digitize.
plain old common sense. if you didn't already know the answer, perhaps you might want to exercise your brain a little bit more...
if you're digitizing the 1844, use its linebreaks and pagebreaks. if you're digitizing the 1853, use its linebreaks and pagebreaks. if you're digitizing the 1870, use its linebreaks and pagebreaks. if you're digitizing the 1892, use its linebreaks and pagebreaks.
and here is a web-page showing the first page of those 4 editions:
and yes, in case you're wondering, if a p-book was important enough to go through different editions, we should digitize _every_ edition...
i'm not going to tell you which of those 4 editions you _should_ use, which one is the "right" one. whichever one you want to use is "right". and you should be able to determine if any specific e-book _does_ or _does_ not match the edition you want to use, or some other edition.
***
by the way, there's another web-page of interest in this directory:
this fascinating page shows some work done by jose menendez.
jose adopted my suggestion that the e-book be able to mimic the p-book, and he created a series of .pdf books that did just that...
shown on this web-page are some screenshots of his .pdf-books, compared to the page-scans from those pages. he did a great job.
of course, since a .pdf-book is unable to reflow its text, jose's work doesn't fit the more-important criterion of reflowability, but it does show that the ability to mimic the original can be extremely valuable.
However, that still relegates us to being a Xerox machine, no?
no. because a xerox machine can't do reflow. or fix typos. or pull in spacey contractions. or change the font, or size.
look, i understand the appeal of digital text _extremely_ well. i've made all of the arguments myself, so there's no need for you to repeat 'em back at me, you're just wasting your breath.
but there's a problem looming here, a problem that the future will have to face, and solve, and i'm telling you what you need to do, so that you can _help_ the future _solve_ that problem, such that your e-texts will continue to be used, and not tossed.
i'm on _your_ side, michael... i've got your back, good buddy...
so you need to get that through your skull and start listening...
I'm never going to get into any of these semantic arguments!!!!!!! Mimic means to copy as closely as possible. . . . Synonym: copy.
it's not a semantic argument, michael. it's protective coloration.
if your copy isn't capable of _assuming_the_look_and_feel_ of the thing that it _purports_ to be copying, nobody will trust it.
you seem to be forgetting that you are claiming to _be_ a copy. perhaps you are an "improved" copy, but you are _still_ a copy.
certainly if you came out and said "we rewrote parts of the middle, because the original was too boring", you would expect that people would throw you away.
but what if someone points out a few errors in your work, and says "see, you can't trust this work, it hasn't been faithfully transcribed," then what is your defense? you can say that "it was just a few errors", but what if they then point out a few more, and a few more after that? at what point can you no longer expect the end-user to believe you?
As I have said before, if you would listen, I am not AGAINST keeping a copy with such pagination for such purposes
well, good, and bully for you, and all that, but the fact of the matter is that project gutenberg is not, at this point in time, actually doing that.
but I draw the lines, pun intended, at keeping every character in the same page position when there is no need for pages, in all available PG editions.
and, as i have said before, if you would listen, i'm not suggesting, in any way, shape, or form, that pagination and linebreaks need to be kept "in all available p.g. editions". that'd be stupid, absolutely and totally and ridiculously _stupid_, and i don't usually feel a need to rule out stuff that is absolutely, totally, ridiculously stupid...
I want our eBooks to be optimally readable: Minimal end of line hyphenation. No page headers or footers. Just plain reading.
i'm 100% in favor of that. and i have demonstrated before, and will be happy to demonstrate once again, any time that you like, how p.g, could save its texts in a format that allows verification of the type that i am talking about, _and_ allows the end-user to have the text exactly like you specify.
Once again, I have no stance AGAINST people who want pagination, I just don't want for force any such arbitrary formats on anyone and neither should you or anyone else. STOP TRYING TO FORCE YOUR OPINIONS ON OTHERS, MAKE THEM OPTIONS!
see, that's precisely why i said you're not listening to me, because there's no way in the world i would try to "force" this on end-users.
i've never, ever, said _anything_ even _remotely_ like that, in all the years i've been on this listserve, or the decades i have done e-books, so i don't know who you're having this conversation with, michael, but it's obviously not me.
I CAN tell you that most of the paper editions' page numbers will fade along with the hyphenation.
no, they won't. because our cultural heritage is full of references to page-numbers, and it'll be several orders of magnitude cheaper and more efficient to keep track of those page-numbers than to attempt to re-do all those references using hyperlinks or whatever.
Last time I looked there were still pretty ubiquitous programs to lay out all such differences.
IFF you have such deep interests, you can simply put up two editions side by side when you look at them. . .I do. . . .
If not, then you aren't really that interested. . .it's all smoke.
this is very amusing. i do this kind of work, michael. regularly.
and i can tell you that it's not nearly as simple as you make it sound.
"_RIGHT_" copy??? Now you've contradicted yourself back into the ivory tower. . . . "_RIGHT_" copy, indeeeeed. . . .
you just can't _wait_ to jump to the wrong conclusion, can you?
by "right" copy, i mean the one that the person _wants_ to see.
This will ONLY do you any good if you manage to find that edition, out of all the other paper editions in the world.
again, "that edition" is whatever edition the person wants to use.
i have a paper copy of "catcher in the rye" on my bookshelf now.
let's say, 10 years on, i can find a dozen digital versions online. let's also say that some analysis shows differences between them. i haven't compared all, not in full, but i know there's some diffs.
i don't want a dozen different versions. i want the one that matches the paper copy that has been sitting on my bookshelf for 4 decades.
how do i determine which one -- _if_any_ -- is the same version as the one that is sitting on my shelf? that's the difficult question.
Sorry, but I anticipated ALL of these questions when I first started, and have answered, and will continue to answer, at length.
no, you didn't answer the question. so i just asked it to you again.
Why can't you just propose your ideas as OPTIONS, not CARVED IN STONE?
stop making the thread ridiculous.
no one can carve anything in stone any more.
-bowerbird -- Michael McDermott www.mad-computer-scientist.com