[gutvol-d] Accented characters are important to reproduce in PG texts (was: lest the message be missed)

8 Mar 2005

      Bowerbird wrote:
...
jon said:
...
...
But I believe it is also essential to preserve 
all accented Latin and non-accented characters 
found in *all* books.
...
once again, the minutiae is being brought to the surface.
The devil is in the details.
...
as usual, you look only at the _benefits_,
without factoring _costs_ into the equation.
On the other hand, there are certain minimum requirements for every
project. As a corollary of an adage I've given earlier: "If a job is
to be done, it is to be done right."
...
the _cost_ of including high-bit characters
is the e-text then _breaks_ for some users,
ones who are using viewer-programs that
are not encoding-savvy, or who don't have
all of the correct fonts on their computer.
All web browsers today, and most more advanced formats, such as PDF,
support the full Unicode set.

That's the future. Embrace it, don't fight it.

There's a saying: "I focus on the future since that's where I'm going
to spend the rest of my life."
...
if the unicode people had done their job right,
and made unicode follow the mac philosophy
-- "it just works" -- i would be up there on the 
unicode bandwagon with you and your friends.
This is a specious argument. The Unicode working group is doing their
job right because before Unicode things were a *real* mess and were
NOT working. There is a clear need to unify the world's character
sets and to create universal text encoding formats (e.g. UTF-8)

There is still some controversy regarding some Han scripts, but by and
large Unicode has been successful at its stated goals.
...
wanna do something useful?  _make_it_work_!
not just on the new machines, with certain
browsers and not any other viewer-programs
-- on _every_ machine, with _every_ program.
Throwing out important accented characters is unacceptable. Period.
The author/publisher considered it important enough to spend the $$$
to include these characters (in the 19th century it took more effort
to print books with accented and foreign characters.) It adds richness
to the text, and it is hard to argue that the characters are not
somehow an integral part of the text.

Anyway, it is trivial, as *you said yourself*, to autoconvert text
with accented characters to 7-bit ASCII text. So you *can* make your
system work for the folk using legacy systems.

It is far better to do the job right for the long-term future, than to
compromise it for the short-term (legacy hardware and software that is
rapidly becoming obsolete.)
...
but until then, just stop bugging all of us about it.
we've heard it, too often, and we are unconvinced.
Who's "we"? It would not surprise me if the majority of PG and DP
volunteers consider it important (or at least a very good idea) to
reproduce the full character set in all Public Domain texts,
especially now that it is easy to do (both by UTF-8/16 encoding, and
using character entities in XML/XHTML/TEI.)

Hopefully a few of the PGers and DPers will give their thoughts on
this particular topic.
...
and buddy, you are _not_ going to convince us 
by repeating the same old argument _again_,
or by asserting your beliefs again and again...
Who's "us"?
...
with all the time i've wasted discussing this
stupid topic for the 829th time, i could have
cleaned up the rest of that "my antonia" text.
If it weren't important *to you*, you would not have replied.

I can only interpret your vociferous replies to mean that you consider
permanently dumping accented characters to be an *important*
requirement to implement your system. That's why I have used the word
"inconvenient" since that's the only reason I can think of.

But if you have another reason why you believe it o.k. to dump
accented characters for most English language PG texts, let us know.
You've not given a good reason why they should not be reproduced.

(The argument of meeting legacy needs is not a compelling argument
since, as you said and I'm repeating what I said above, one can
autoconvert a Master document with accented characters to 7-bit ASCII
for use by legacy-users. Thus, you can meet the needs of these people
*and* the needs and preferences of future generations by preserving
the non-ASCII characters. Instead, you inexplicably want to
permanently remove accented characters from the digital *Master*
versions of most public domain English-language digital texts.)

There's a lot of aspects to Public Domain texts that are
"inconvenient" which prevent easy digitizing. We figure out how to
overcome these "inconveniences" and produce a high-quality product,
not make short-term short-cuts so we can avoid dealing with them.

Distributed Proofreaders is one example of not giving in to the
"convenient", but rather to figure out how to do it right in a
reasonably efficient way.

Anyway, why the rush to digitize (make structured digital texts) out
of page scans, to the point you are willing to sacrifice textual
accuracy and quality? So long as the page scans are available for
posterity, they can be transcribed any time, and done more carefully
and thoughtfully. To me, the most critical thing is to make archival-
quality scans of public domain texts and get them online via IA and
similar organizations. In the meanwhile, the most popular of these
texts can be carefully and methodically converted to Structured
Digital Texts (SDT). There are about 1000 very classic Public Domain
works (part of the pre-DP PG collection) that should be redone to at
least the quality of the "My Antonia" demo project (for those who have
not seen it, it is at: http://www.openreader.org/myantonia/  It is
still an early "beta", but it's been a real learning experience for
several of us working on it.)

Jon

[gutvol-d] Accented characters are important to reproduce in PG texts (was: lest the message be missed)

Jon Noring