"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
Bruce> I did not decline; indeed in
Bruce> <3D8DF17D(_dot_)4040505(_at_)alex(_dot_)blilly(_dot_)com> posted to the
Usefor list
Bruce> Sun, 22 Sep 2002 12:36:13 -0400 I worote:
that was hardly an answer to my question of Sep 30th which was part
of a different branch of the discussion.
[snip]
Bruce> U+E0001 U+E0065 U+E006E U+0066 U+006F U+006F U+E0001 U+E007F
Bruce> Transformed to Unicode 3.1 UTF-8 (hex values for 8-bit codes):
Bruce> F3 A0 80 81 F3 A0 81 A5 F3 A0 81 AE foo F3 A0 80 81 F3 A0 81 8F
Bruce> The Unicode 2.0 UTF-8 reverse transformation yields:
Bruce> U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F
Bruce> U+DBC0 U+DC01 U+DBC0 U+DC7F
Bruce> That has 5 surrogate pairs which were not in the original
Bruce> Unicode 3.1 text.
but those surrogate pairs are precisely the ones that represent the
original codepoints. i.e. if you merely converted the original
character sequence into UTF-16, you would get the same result.
The fact that those codepoints may not be known to the receiving
application is to do with the difference in the Unicode specification
itself, and nothing to do with UTF-8.
[snip language-tagging stuff irrelevent to this issue]
Bruce> That of course is not a parsing issue, but one of semantic
Bruce> interpretation of the "utf-8". According to one of the many
Bruce> "utf-8" specifications, the "utf-8" stream encodes a
Bruce> language-tagged string, while in at least one other "utf-8"
Bruce> specification it encodes something quite different.
it encodes the exact same sequence of codepoints regardless of
specification. The fact that some of those codepoints are not defined
in earlier unicode versions is nothing to do with the use of UTF-8.
--
Andrew.