"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
Bruce> U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F
Bruce> U+DBC0 U+DC01 U+DBC0 U+DC7F
Bruce> That has 5 surrogate pairs which were not in the original
Bruce> Unicode 3.1 text.
but those surrogate pairs are precisely the ones that represent the
original codepoints. i.e. if you merely converted the original
character sequence into UTF-16, you would get the same result.
The fact that those codepoints may not be known to the receiving
application is to do with the difference in the Unicode specification
itself, and nothing to do with UTF-8.
Bruce> It has to do with the difference in the two utf-8
Bruce> specifications, specifically for 4-byte sequences, which in
Bruce> one case (but not the other) involve surrogate pairs. The
Bruce> original had 8 "characters" (actually at most 5 could really
Bruce> be called characters in any meaningful sense; the others are
Bruce> syntactic glue), whereas the result has 13 16-bit
Bruce> "characters", 10 of which are "garbage".
The original, if expressed in 16-bit form (i.e. in UTF-16), would also
have had the exact same sequence of 16-bit values.
The difference in the number of characters and whether or not they are
"garbage" is solely due to the difference in Unicode versions and not on
the transport format used; there is no way to encode a Unicode 3.1 text
that contains codepoints not used in Unicode 2.0 in such a way that this
will not happen.
Bruce> One "utf-8" encoding which transforms between uniform-width
Bruce> 16-bit "characters" and a sequence of bytes cannot be said to
Bruce> be "the same as" another "utf-8" specification which
Bruce> transforms between variable-width "characters" and a series of
Bruce> bytes; they are different because the domain of one side of
Bruce> the transformation differs.
But this is a difference of _description_ not a genuine difference in
the underlying definition. If I can transform A<->C by one definition
and B<->C by another, where the A<->B transformation is known and
well-defined, then the two definitions are logically equivalent even
though they have different domains and describe different functions.
Either definition can be obtained from the other merely by composing
it with the A<->B transformation or its inverse.
You can describe UTF-8 either in terms of its relationship to UTF-16
or in terms of its relationship to UCS-4. (The older definition you
quoted from the Unicode specs does the former; other definitions,
including the later Unicode specs do the latter.) These descriptions
will look different, but only until you also factor in the UCS-4 to
UTF-16 transformation at which point they become equivalent.
Which transformation function you actually _use_ then of course
depends on what your desired domain is.
[snip language-tagging stuff irrelevent to this issue]
Bruce> The language tagging is relevant for a number of reasons:
it's irrelevent to my argument and to the question of whether there
is more than one version of UTF-8 (independently of the question of
whether there is more than one version of Unicode which is obviously
true).
I am not trying to argue for use of utf-8 language tags (or even for
raw utf-8 in headers shared with email - I'll leave that position for
others to take).
it encodes the exact same sequence of codepoints regardless of
specification. The fact that some of those codepoints are not defined
in earlier unicode versions is nothing to do with the use of UTF-8.
Bruce> It is related to the different utf-8 specifications inasmuch
Bruce> as the different specifications in conjunction with the change
Bruce> from a 16-bit pre-3.1 character width and the newly-introduced
Bruce> codepoints (including language tags) together result in
Bruce> obfuscated content with a different number of bits. It's
Bruce> difficult to cleanly separate those issues, since if musical
Bruce> notes, baroque encoding of language tags etc. hadn't been
Bruce> added (deviating from the original Unicode Design Priciples),
Bruce> the character width wouldn't have had to be increased beyond
Bruce> 16 bits
sure, and the addition of about 43 thousand more Han ideographs had
nothing to do with the increase in number of bits. Right.
--
Andrew.