Russ Allbery writes:
However, UTF-8 penalizes non-ASCII characters spacewise, and is
somewhat more complex to parse and reason about than a pure multibyte
character set.
Have you ever written a program to handle Unicode characters correctly?
Do you realize that UTF-16 is not a ``pure multibyte'' encoding outside
the Basic Multilingual Plane? Do you realize that Unicode has zero-width
accents, so any ``byte count equals width'' rule can't possibly work?
The notion that UTF-16 is simpler than UTF-8 seems to come from broken
programs that (1) don't handle zero-width characters, (2) don't handle
double-width characters, and (3) are limited to the BMP.
---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago