Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:


 Bruce> U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F
 Bruce> U+DBC0 U+DC01 U+DBC0 U+DC7F
 
 Bruce> That has 5 surrogate pairs which were not in the original
 Bruce> Unicode 3.1 text.

but those surrogate pairs are precisely the ones that represent the
original codepoints. i.e. if you merely converted the original
character sequence into UTF-16, you would get the same result.

The fact that those codepoints may not be known to the receiving
application is to do with the difference in the Unicode specification
itself, and nothing to do with UTF-8.


 Bruce> It has to do with the difference in the two utf-8
 Bruce> specifications, specifically for 4-byte sequences, which in
 Bruce> one case (but not the other) involve surrogate pairs.  The
 Bruce> original had 8 "characters" (actually at most 5 could really
 Bruce> be called characters in any meaningful sense; the others are
 Bruce> syntactic glue), whereas the result has 13 16-bit
 Bruce> "characters", 10 of which are "garbage".

The original, if expressed in 16-bit form (i.e. in UTF-16), would also
have had the exact same sequence of 16-bit values.

The difference in the number of characters and whether or not they are
"garbage" is solely due to the difference in Unicode versions and not on
the transport format used; there is no way to encode a Unicode 3.1 text
that contains codepoints not used in Unicode 2.0 in such a way that this
will not happen.

 Bruce> One "utf-8" encoding which transforms between uniform-width
 Bruce> 16-bit "characters" and a sequence of bytes cannot be said to
 Bruce> be "the same as" another "utf-8" specification which
 Bruce> transforms between variable-width "characters" and a series of
 Bruce> bytes; they are different because the domain of one side of
 Bruce> the transformation differs.

But this is a difference of _description_ not a genuine difference in
the underlying definition. If I can transform A<->C by one definition
and B<->C by another, where the A<->B transformation is known and
well-defined, then the two definitions are logically equivalent even
though they have different domains and describe different functions.
Either definition can be obtained from the other merely by composing
it with the A<->B transformation or its inverse.

You can describe UTF-8 either in terms of its relationship to UTF-16
or in terms of its relationship to UCS-4. (The older definition you
quoted from the Unicode specs does the former; other definitions,
including the later Unicode specs do the latter.) These descriptions
will look different, but only until you also factor in the UCS-4 to
UTF-16 transformation at which point they become equivalent.

Which transformation function you actually _use_ then of course
depends on what your desired domain is.

[snip language-tagging stuff irrelevent to this issue]


 Bruce> The language tagging is relevant for a number of reasons:

it's irrelevent to my argument and to the question of whether there
is more than one version of UTF-8 (independently of the question of
whether there is more than one version of Unicode which is obviously
true).

I am not trying to argue for use of utf-8 language tags (or even for
raw utf-8 in headers shared with email - I'll leave that position for
others to take).

it encodes the exact same sequence of codepoints regardless of
specification. The fact that some of those codepoints are not defined
in earlier unicode versions is nothing to do with the use of UTF-8.


 Bruce> It is related to the different utf-8 specifications inasmuch
 Bruce> as the different specifications in conjunction with the change
 Bruce> from a 16-bit pre-3.1 character width and the newly-introduced
 Bruce> codepoints (including language tags) together result in
 Bruce> obfuscated content with a different number of bits.  It's
 Bruce> difficult to cleanly separate those issues, since if musical
 Bruce> notes, baroque encoding of language tags etc. hadn't been
 Bruce> added (deviating from the original Unicode Design Priciples),
 Bruce> the character width wouldn't have had to be increased beyond
 Bruce> 16 bits

sure, and the addition of about 43 thousand more Han ideographs had
nothing to do with the increase in number of bits. Right.

-- 
Andrew.