D J Bernstein <djb(_at_)cr(_dot_)yp(_dot_)to> writes:
Russ Allbery writes:
However, UTF-8 penalizes non-ASCII characters spacewise, and is
somewhat more complex to parse and reason about than a pure multibyte
character set.
Have you ever written a program to handle Unicode characters correctly?
Do you realize that UTF-16 is not a ``pure multibyte'' encoding outside
the Basic Multilingual Plane?
I'm sorry, I should have been clearer. The intended comparison was not to
UTF-16, which combines the worst of both worlds due to surrogate pairs,
but to UTF-32, which is (so far as I know) a pure multibyte encoding.
Do you realize that Unicode has zero-width accents, so any ``byte count
equals width'' rule can't possibly work?
Yes. I know about combining marks and other similar characters, and I'm
not saying that even UTF-32 is simple, just that UTF-32 is somewhat
simpler to parse and reason about than UTF-8.
--
Russ Allbery (rra(_at_)stanford(_dot_)edu)
<http://www.eyrie.org/~eagle/>