"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
I repeatedly asked you to be specific on the USEFOR list when you
started making claims about differing UTF-8 versions there; since
you declined to respond, I will deal with it here instead.
Bruce> "Most notable among the corrigenda to the Standard is a
Bruce> further tightening of the definition of UTF-8, to eliminate
Bruce> irregular UTF-8 and to bring the Unicode specification of
Bruce> UTF-8 more completely into line with other specifications of
Bruce> UTF-8. "
Bruce> Obviously if the Unicode consortium states unequivocally that
Bruce> there are multiple utf-8 specifications which differ, there
Bruce> cannot be "precicely one" utf-8 specification.
The difference is solely between "illegal" and "irregular" sequences;
earlier versions of the UTF-8 definition in the Unicode spec made a
distinction between the two, though neither could be legally generated;
later versions remove the distinction.
Bruce> Clearly RFC 2279 provides for 5- and 6-byte utf-8 sequences,
Bruce> which are not provided for by Unicode through 3.2.
that is because Unicode does not have any codepoints above U+10FFFF,
and therefore such sequences could not be generated. I understand that
there are also no assigned ISO 10646 codepoints above 10FFFF, and also
that there will be no future assignments of such.
Bruce> And some 4-byte sequences differ in different Unicode versions
Bruce> (particulary those corresponding to surrogate pairs).
no, there are no 4-byte sequences that differ in any way between the
different versions.
The only difference has been in the wording describing isolated
constituents of a surrogate pair (i.e. values D800-DFFF, which would
encode into 3 bytes if that were allowed) which have been illegally
encoded. It has never been legal in any version of the Unicode spec to
encode the constituents of a surrogate pair into UTF-8 separately;
they must be encoded together as a single character (i.e. decoded
into UCS-4 and then encoded as a 4-byte sequence). The only difference
is in the description of whether such sequences are "illegal" or
"irregular" or "ill-formed" or some such term.
Bruce> Whether or not Unicode 4.0 and/or the draft mentioned above
Bruce> will introduce additional variants is another matter.
since there have been no substantive changes made to UTF-8 _ever_
since its adoption as any sort of standard, why would you expect
future changes to introduce incompatibilities?
--
Andrew.