Re: Unicode points

On Feb 24, 2005, at 2:53 PM, Bruce Lilly wrote:

o 16-bit Unicode matched well with 16-bit wchar_t

wchar_t is 32 bits on all the computers near me. This is one reasonwhy UTF-16 is irritating for the C programmer.

o while the raw data size doubles in going from 16 bits per character
  to 32 bits, the size of tables (normalization, etc.) indexed by
  character increases by more than 4 orders of magnitude. [yes,
  table compression can be used -- provided the locations and sizes
  of "holes" is guaranteed -- but that requires additional
  computational power]

Unicode data is usually persisted as either UTF-8 or UTF-16, so thefact that 21 bits are potentially available is irrelevant to the spaceactually occupied. For everyday in-memory character processing,consensus is building around UTF8 (in C/C++ land) and UTF16 (in Java/C#land). I expect wchar_t to be become increasingly irrelevant.

It would be convenient if you could use 64k bitmaps to define characterclasses, but you can't, so naive implementations are simply notsuitable.

My experience since 1996 is that the Unicode people are neithercapricious nor malicious. I gather yours is different. -Tim



_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf