UTF-8 versions (was: Re: RFC 2047 and gatewaying)

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:


I repeatedly asked you to be specific on the USEFOR list when you
started making claims about differing UTF-8 versions there; since
you declined to respond, I will deal with it here instead.

 Bruce> "Most notable among the corrigenda to the Standard is a
 Bruce> further tightening of the definition of UTF-8, to eliminate
 Bruce> irregular UTF-8 and to bring the Unicode specification of
 Bruce> UTF-8 more completely into line with other specifications of
 Bruce> UTF-8. "

 Bruce> Obviously if the Unicode consortium states unequivocally that
 Bruce> there are multiple utf-8 specifications which differ, there
 Bruce> cannot be "precicely one" utf-8 specification.

The difference is solely between "illegal" and "irregular" sequences;
earlier versions of the UTF-8 definition in the Unicode spec made a
distinction between the two, though neither could be legally generated;
later versions remove the distinction.

 Bruce> Clearly RFC 2279 provides for 5- and 6-byte utf-8 sequences,
 Bruce> which are not provided for by Unicode through 3.2.

that is because Unicode does not have any codepoints above U+10FFFF,
and therefore such sequences could not be generated. I understand that
there are also no assigned ISO 10646 codepoints above 10FFFF, and also
that there will be no future assignments of such.

 Bruce> And some 4-byte sequences differ in different Unicode versions
 Bruce> (particulary those corresponding to surrogate pairs).

no, there are no 4-byte sequences that differ in any way between the
different versions.

The only difference has been in the wording describing isolated
constituents of a surrogate pair (i.e. values D800-DFFF, which would
encode into 3 bytes if that were allowed) which have been illegally
encoded. It has never been legal in any version of the Unicode spec to
encode the constituents of a surrogate pair into UTF-8 separately;
they must be encoded together as a single character (i.e.  decoded
into UCS-4 and then encoded as a 4-byte sequence). The only difference
is in the description of whether such sequences are "illegal" or
"irregular" or "ill-formed" or some such term.

 Bruce> Whether or not Unicode 4.0 and/or the draft mentioned above
 Bruce> will introduce additional variants is another matter.

since there have been no substantive changes made to UTF-8 _ever_
since its adoption as any sort of standard, why would you expect
future changes to introduce incompatibilities?

-- 
Andrew.