ietf-822
[Top] [All Lists]

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)

2003-01-09 21:45:48

Andrew Gierth wrote:
"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:


 Bruce> Clearly RFC 2279 provides for 5- and 6-byte utf-8 sequences,
 Bruce> which are not provided for by Unicode through 3.2.

that is because Unicode does not have any codepoints above U+10FFFF,
and therefore such sequences could not be generated. I understand that
there are also no assigned ISO 10646 codepoints above 10FFFF, and also
that there will be no future assignments of such.

From the point of view of parsing some stream of octets, according to one
"utf-8" specification a certain sequence *is* a utf-8 sequence, and according
to other "utf-8" specifications is is *not* a utf-8 sequence.  I.e. one
cannot design a parser to recognize "utf-8" from a sequence of octets
unless one specifies *which* of the mutually-incompatible "utf-8"
specifications is to be used, viz. whether or not the 5- and 6-byte
sequnces are or are not "utf-8".

I stand by my assertion that (from the aforementioned point of view) there
are multiple incompatible "utf-8" specifications.