On Feb 24, 2005, at 2:53 PM, Bruce Lilly wrote:
o 16-bit Unicode matched well with 16-bit wchar_t
wchar_t is 32 bits on all the computers near me. This is one reason
why UTF-16 is irritating for the C programmer.
o while the raw data size doubles in going from 16 bits per character
to 32 bits, the size of tables (normalization, etc.) indexed by
character increases by more than 4 orders of magnitude. [yes,
table compression can be used -- provided the locations and sizes
of "holes" is guaranteed -- but that requires additional
computational power]
Unicode data is usually persisted as either UTF-8 or UTF-16, so the
fact that 21 bits are potentially available is irrelevant to the space
actually occupied. For everyday in-memory character processing,
consensus is building around UTF8 (in C/C++ land) and UTF16 (in Java/C#
land). I expect wchar_t to be become increasingly irrelevant.
It would be convenient if you could use 64k bitmaps to define character
classes, but you can't, so naive implementations are simply not
suitable.
My experience since 1996 is that the Unicode people are neither
capricious nor malicious. I gather yours is different. -Tim
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf