ietf-822
[Top] [All Lists]

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)

2003-01-10 13:52:22

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

 Bruce> I did not decline; indeed in
 Bruce> <3D8DF17D(_dot_)4040505(_at_)alex(_dot_)blilly(_dot_)com> posted to the 
Usefor list
 Bruce> Sun, 22 Sep 2002 12:36:13 -0400 I worote:

that was hardly an answer to my question of Sep 30th which was part
of a different branch of the discussion.

 [snip]

 Bruce> U+E0001 U+E0065 U+E006E U+0066 U+006F U+006F U+E0001 U+E007F

 Bruce> Transformed to Unicode 3.1 UTF-8 (hex values for 8-bit codes):

 Bruce> F3 A0 80 81 F3 A0 81 A5 F3 A0 81 AE foo F3 A0 80 81 F3 A0 81 8F

 Bruce> The Unicode 2.0 UTF-8 reverse transformation yields:

 Bruce> U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F
 Bruce> U+DBC0 U+DC01 U+DBC0 U+DC7F

 Bruce> That has 5 surrogate pairs which were not in the original
 Bruce> Unicode 3.1 text.

but those surrogate pairs are precisely the ones that represent the
original codepoints. i.e. if you merely converted the original
character sequence into UTF-16, you would get the same result.

The fact that those codepoints may not be known to the receiving
application is to do with the difference in the Unicode specification
itself, and nothing to do with UTF-8.

 [snip language-tagging stuff irrelevent to this issue]

 Bruce> That of course is not a parsing issue, but one of semantic
 Bruce> interpretation of the "utf-8".  According to one of the many
 Bruce> "utf-8" specifications, the "utf-8" stream encodes a
 Bruce> language-tagged string, while in at least one other "utf-8"
 Bruce> specification it encodes something quite different.

it encodes the exact same sequence of codepoints regardless of
specification. The fact that some of those codepoints are not defined
in earlier unicode versions is nothing to do with the use of UTF-8.

-- 
Andrew.

<Prev in Thread] Current Thread [Next in Thread>