Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives!

23 Aug 2005

      On Tue, 23 Aug 2005, Joshua Hutchinson wrote:
...
I may be wrong here (Marcello is my unicode guru), but I thought UTF-8
was a superset of Latin1?  Anyway, I know if this particular file there
are quite a few UTF-8 encoded characters (and a couple more that should
be that we found yesterday backchannel).
Well, if you look merely at abstract numbered code points,
it is correct to say that the initial code points of Unicode
are numbered the same as ISO Latin-1.

However, you have to realize that, while ISO Latin-1 is a
legacy encoding in which each character is encoded using
only one byte, the nature of Unicode has led to different
different methods (Unicode Transformation Formats) of
actually encoding each character in a series of bytes.

One way to look at UTF-8 is as a compressed format.
(When used to encode texts which consist primarily of
the character found in lower ascii, UTF-16, which uses
two bytes for each character, results in noticably
longer files) Ascii characters are encoded the same
in UTF-8 as in common legacy single-byte encodings,
but all higer numbered characters are represented by
muli-byte sequences.

Excerpt from: http://en.wikipedia.org/wiki/UTF-8
   So the first 128 characters need one byte. The next 1920 characters
   need two bytes to encode. This includes Latin alphabet characters
   with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and
   Arabic characters. The rest of the BMP characters use three bytes,
   and additional characters are encoded in four bytes.

I hope that is somewhat clear....

Andrew

Re: [gutvol-d] ANNOUNCEMENT: XML has hit the PG archives!

Andrew Sly