
On Jan 25, 2012, at 9:52 AM, Carlo Traverso wrote:
I believe that the problem in handling UTF-8 submissions is unitame, that is the tool that the WWers use to recode UTF-8 to iso-Latin-1. It cannot handle simple things like bullets and greek, but it would be very easy to extend it to be able to handle these characters and more (I did).
Unitame is part of it. But PG wants to include a plain ASCII file. Look at the first paragraph of "The Black Star" (etext 35833). It started in UTF-8 as 35833-0.txt: They poured through the man-made cañons then with unitame into Latin-1 as 35833-8.txt: They poured through the man-made cañons but the ASCII version of 35833.txt has They poured through the man-made canons I would write that in the ASCII file as They poured through the man-made canyons There are similar special cases for "oo" in a word and others. Is there a tool to catch these special cases that the WWers could use if they are given a Latin-1 file? It's not a discussion of should a UTF-8 file alone be sufficient--that's one the WWers, Marcello, and Greg should agree upon. Right now, the ASCII version is in the mix, and unitame alone isn't enough to get it done. Anybody up for writing a latin1tame program? --Roger