
On Tue, 23 Aug 2005, Joshua Hutchinson wrote:
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8 was a superset of Latin1? Anyway, I know if this particular file there are quite a few UTF-8 encoded characters (and a couple more that should be that we found yesterday backchannel).
Well, if you look merely at abstract numbered code points, it is correct to say that the initial code points of Unicode are numbered the same as ISO Latin-1. However, you have to realize that, while ISO Latin-1 is a legacy encoding in which each character is encoded using only one byte, the nature of Unicode has led to different different methods (Unicode Transformation Formats) of actually encoding each character in a series of bytes. One way to look at UTF-8 is as a compressed format. (When used to encode texts which consist primarily of the character found in lower ascii, UTF-16, which uses two bytes for each character, results in noticably longer files) Ascii characters are encoded the same in UTF-8 as in common legacy single-byte encodings, but all higer numbered characters are represented by muli-byte sequences. Excerpt from: http://en.wikipedia.org/wiki/UTF-8 So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes. I hope that is somewhat clear.... Andrew