
Bowerbird wrote:
jon said:
But I believe it is also essential to preserve all accented Latin and non-accented characters found in *all* books.
once again, the minutiae is being brought to the surface.
The devil is in the details.
as usual, you look only at the _benefits_, without factoring _costs_ into the equation.
On the other hand, there are certain minimum requirements for every project. As a corollary of an adage I've given earlier: "If a job is to be done, it is to be done right."
the _cost_ of including high-bit characters is the e-text then _breaks_ for some users, ones who are using viewer-programs that are not encoding-savvy, or who don't have all of the correct fonts on their computer.
All web browsers today, and most more advanced formats, such as PDF, support the full Unicode set. That's the future. Embrace it, don't fight it. There's a saying: "I focus on the future since that's where I'm going to spend the rest of my life."
if the unicode people had done their job right, and made unicode follow the mac philosophy -- "it just works" -- i would be up there on the unicode bandwagon with you and your friends.
This is a specious argument. The Unicode working group is doing their job right because before Unicode things were a *real* mess and were NOT working. There is a clear need to unify the world's character sets and to create universal text encoding formats (e.g. UTF-8) There is still some controversy regarding some Han scripts, but by and large Unicode has been successful at its stated goals.
wanna do something useful? _make_it_work_! not just on the new machines, with certain browsers and not any other viewer-programs -- on _every_ machine, with _every_ program.
Throwing out important accented characters is unacceptable. Period. The author/publisher considered it important enough to spend the $$$ to include these characters (in the 19th century it took more effort to print books with accented and foreign characters.) It adds richness to the text, and it is hard to argue that the characters are not somehow an integral part of the text. Anyway, it is trivial, as *you said yourself*, to autoconvert text with accented characters to 7-bit ASCII text. So you *can* make your system work for the folk using legacy systems. It is far better to do the job right for the long-term future, than to compromise it for the short-term (legacy hardware and software that is rapidly becoming obsolete.)
but until then, just stop bugging all of us about it. we've heard it, too often, and we are unconvinced.
Who's "we"? It would not surprise me if the majority of PG and DP volunteers consider it important (or at least a very good idea) to reproduce the full character set in all Public Domain texts, especially now that it is easy to do (both by UTF-8/16 encoding, and using character entities in XML/XHTML/TEI.) Hopefully a few of the PGers and DPers will give their thoughts on this particular topic.
and buddy, you are _not_ going to convince us by repeating the same old argument _again_, or by asserting your beliefs again and again...
Who's "us"?
with all the time i've wasted discussing this stupid topic for the 829th time, i could have cleaned up the rest of that "my antonia" text.
If it weren't important *to you*, you would not have replied. I can only interpret your vociferous replies to mean that you consider permanently dumping accented characters to be an *important* requirement to implement your system. That's why I have used the word "inconvenient" since that's the only reason I can think of. But if you have another reason why you believe it o.k. to dump accented characters for most English language PG texts, let us know. You've not given a good reason why they should not be reproduced. (The argument of meeting legacy needs is not a compelling argument since, as you said and I'm repeating what I said above, one can autoconvert a Master document with accented characters to 7-bit ASCII for use by legacy-users. Thus, you can meet the needs of these people *and* the needs and preferences of future generations by preserving the non-ASCII characters. Instead, you inexplicably want to permanently remove accented characters from the digital *Master* versions of most public domain English-language digital texts.) There's a lot of aspects to Public Domain texts that are "inconvenient" which prevent easy digitizing. We figure out how to overcome these "inconveniences" and produce a high-quality product, not make short-term short-cuts so we can avoid dealing with them. Distributed Proofreaders is one example of not giving in to the "convenient", but rather to figure out how to do it right in a reasonably efficient way. Anyway, why the rush to digitize (make structured digital texts) out of page scans, to the point you are willing to sacrifice textual accuracy and quality? So long as the page scans are available for posterity, they can be transcribed any time, and done more carefully and thoughtfully. To me, the most critical thing is to make archival- quality scans of public domain texts and get them online via IA and similar organizations. In the meanwhile, the most popular of these texts can be carefully and methodically converted to Structured Digital Texts (SDT). There are about 1000 very classic Public Domain works (part of the pre-DP PG collection) that should be redone to at least the quality of the "My Antonia" demo project (for those who have not seen it, it is at: http://www.openreader.org/myantonia/ It is still an early "beta", but it's been a real learning experience for several of us working on it.) Jon