ietf-822
[Top] [All Lists]

Re: RFC 2047 and gatewaying

2003-01-04 12:31:50

D J Bernstein <djb(_at_)cr(_dot_)yp(_dot_)to> writes:
Russ Allbery writes:

However, UTF-8 penalizes non-ASCII characters spacewise, and is
somewhat more complex to parse and reason about than a pure multibyte
character set.

Have you ever written a program to handle Unicode characters correctly?
Do you realize that UTF-16 is not a ``pure multibyte'' encoding outside
the Basic Multilingual Plane?

I'm sorry, I should have been clearer.  The intended comparison was not to
UTF-16, which combines the worst of both worlds due to surrogate pairs,
but to UTF-32, which is (so far as I know) a pure multibyte encoding.

Do you realize that Unicode has zero-width accents, so any ``byte count
equals width'' rule can't possibly work?

Yes.  I know about combining marks and other similar characters, and I'm
not saying that even UTF-32 is simple, just that UTF-32 is somewhat
simpler to parse and reason about than UTF-8.

-- 
Russ Allbery (rra(_at_)stanford(_dot_)edu)             
<http://www.eyrie.org/~eagle/>