as i watch all these p.g. e-texts float across my screen,
i just can't help but have some thoughts recur to me...

in the old days, when -- for some very good reasons --
a p.g. e-text was considered to be an _amalgamation_ of
different versions of a book (even when it really was not,
a fiction advised by p.g. legal counsel at that early time),
that gave a good reason to remove end-line hyphenation
and reflow text (without hyphenation) to p.g. margination.
after all, hyphenation mostly causes problems in e-books.

in the current era, however, where most p.g. e-texts are
pegged to a specific version of a book (and where, for the
most part, the scans are now retained to cement this direct
correspondence), it no longer makes sense to discard the
line-breaks, or even the end-line hyphenation, to be frank.

yes, end-line hyphenation should be _marked_ in some way,
so it can be automatically eliminated, but the _default_action_
should be to retain it.  it would defeat the purpose of saving
the line-breaks if you didn't also retain end-line hyphenation,
because the goal here would be to duplicate the print version.

(don't bother arguing that there would never be such a desire;
maybe you'd never have any need for it, but _someone_ might.
i can think of half-a-dozen such reasons -- want to hear them?)

if you want to see what the future of electronic-books looks like,
see the "digital reprints" that jose menendez has been producing.
>   http://www.ibiblio.org/ebooks/Mabie/
>   http://www.ibiblio.org/ebooks/Cather/
>   http://www.ibiblio.org/ebooks/Einstein/

the deep links to the actual .pdf "digital reprints" are these:
>   http://www.ibiblio.org/ebooks/Mabie/Books_Culture.pdf
>   http://www.ibiblio.org/ebooks/Cather/Antonia/Antonia.pdf
>   http://www.ibiblio.org/ebooks/Einstein/Einstein_Relativity.pdf

aside from the unfortunate fact that jose is using the .pdf format
(a format which makes it far to difficult to repurpose the content),
these "digital reprints" carve out an awesome model for e-books.

they replicate the original paper-book to a high degree of fidelity,
and do so using a small percentage of the disk-space of the scans.
yet because it is an e-book, it gives all the benefits that they give.
(at least it _would_, if it wasn't a .pdf.  but that part can be fixed.)

and the secret of these "digital reprints" is extremely simple, folks;
all that jose has done is merely to retain the original line-breaks...

so, once again, i recommend and request that you start retaining
this valuable information, instead of intentionally tossing it away.

(it is very ironic, to me, that distributed proofreaders _retains_
the line-breaks during their proofing -- because it makes that
process so much easier -- but then they discard the line-breaks!
hey, there might be some end-users out there who need 'em too!)

honestly, folks, when i look at your p.g. e-texts, what i see is that
they're gonna be thrown on the trashpile one day -- maybe soon.

in a world that is awash in scans, and where o.c.r. is a commodity,
it'll be trivial to convert those scans to text.  so if someone needs
to have the ability to duplicate the print version -- i.e., they _need_
to have the line-break information you are routinely discarding --
they'll simply o.c.r. the scans again.  they will be required to do that,
because your e-texts simply won't do the job that they want done...

that's not to say that your p.g. e-texts will be _completely_ worthless.
as an independent digitization, they'll go a long way toward helping
to move any new o.c.r. effort up to an absurdly high level of accuracy.

but since the absurd level of accuracy can be applied to either e-text,
and since the new effort will have retained the line-break information,
that will be the one that's retained.  the p.g. e-text will be thrown away.
and it would break my heart to see all your hard work just thrown away.

on the other hand, if y'all started retaining that line-break information,
then it'd be _your_ version which would be kept (because of its primacy),
and the new o.c.r. effort would just be seen as a tool to increase accuracy.

if project gutenberg wants to remain as the premiere library in cyberspace,
you're going to have to fix this glitch, and do it quickly.  mark my words...

-bowerbird

p.s.  at some of you aren't  good at reading between the lines, i'll tell you
that i intend to mount such a massive o.c.r. effort, so the question about
which version, p.g. or not, receives the higher accuracy is a very real one.
i don't want to challenge the p.g. library, _unless_ you've made it deficient.
i'm trying to help you by giving you this advice before it becomes crucial...