Jean-Marc,
Tuesday, January 7, 2003, 10:15:12 AM, you wrote:
Jean-Marc> And here raw UTF-8 is a clear winner. No complex implementation
rules,
Jean-Marc> no border cases, one string will always have one and only one
Jean-Marc> representation.
Correct me if I am wrong, but I believe the "native" representation for
Unicode is something like 24 bits. (No, folks, please don't correct me it
is any other number over 16.)
Hence, UTF-16 and UTF-8 are methods of encoding a larger bit space into a
smaller representation space, producing variable-length strings. One crams
the larger space into a 16-bit world. The other crams it into an 8-bit
world.
There depending on the situation, there can be processing or space
efficiencies gained by one encoding over another.
But there is no theoretical or aesthetic superiority that can be claimed by
one over the other.
Cramming those bits into a 7-bit environment is just one more cramming
effort. It stands equal to the others as an alternative that has benefits
and detriments.
The confusion on this issue probably stems from the fact that you can use
existing data viewers -- such as text editors -- to view the result of a
7-bit encoding and cannot use such "legacy" services for viewing UTF-8 or
UTF-16.
If you do not have UTF-8 or UTF-16 tools, you cannot view the data at all.
If you have a text editor, you can view the 7-bit encoding. The fact that
it is "ugly", therefore is actually a feature, not a bug.
Unless you think that "invisible or broken" is better than "ugly".
d/
--
Dave <mailto:dcrocker(_at_)brandenburg(_dot_)com>
Brandenburg InternetWorking <http://www.brandenburg.com>
t +1.408.246.8253; f +1.408.850.1850