Charles Lindsey wrote:
In <3E1731A6(_dot_)5030604(_at_)alex(_dot_)blilly(_dot_)com> Bruce Lilly
<blilly(_at_)erols(_dot_)com> writes:
It's worse than that; there are at least 3 different versions of UTF-8.
They differ in the longer multi-byte sequences.
No, there is precisely one, as defined by the relevant Unicode documents.
See RFC 2044 and draft-yergeau-rfc2279bis-02.txt, with all of which Usefor
is fully compatible.
Charles, in spite of having been shown the differences in the past, you persist
in claiming that there are none. RFC 2044 is not a Unicode document, and has
long been obsoleted. One clue that you are wrong is the following quotation
from Unicode Technical report 28:
"Most notable among the corrigenda to the Standard is a further tightening of the
definition of UTF-8, to eliminate irregular UTF-8 and to bring the Unicode specification
of UTF-8 more completely into line with other specifications of UTF-8. "
Obviously if the Unicode consortium states unequivocally that there are
multiple utf-8 specifications which differ, there cannot be "precicely
one" utf-8 specification.
Here, once again:
Unicode 2.0, table A-3 (applies through Unicode 3.0):
Unicode Value 1st Byte 2nd Byte 3rd Byte 4th Byte
000000000xxxxxxx 0xxxxxxx
00000yyyyyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
110110wwwwzzzzyy +
110111yyyyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
RFC 2044 is obsolete; here's the table from RFC 2279:
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
Unicode 3.2, Unicode Technical Report #28:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF ill-formed
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Clearly RFC 2279 provides for 5- and 6-byte utf-8 sequences, which are
not provided for by Unicode through 3.2. And some 4-byte sequences
differ in different Unicode versions (particulary those corresponding
to surrogate pairs). Whether or not Unicode 4.0 and/or the draft
mentioned above will introduce additional variants is another matter.