perl-unicode

Re: C<use utf8> dynamic scope?

1999-06-08 15:30:53
Chaim Frenkel writes:
: >>>>> "LW" == Larry Wall <larry(_at_)wall(_dot_)org> writes:
: 
: LW> You can already say
: 
: LW>     use utf8 'Big5';
: 
: LW> for that sort of thing--it just defaults to Unicode.  Two caveats:
: 
: Then to handle different encodings one would:
: 
:       {use utf8 'Unicode';    .... }
:       {use utf8 'Big5'; .... }
: 
: Why not split it out?

Because almost everyone who wants utf8 will also want Unicode by default.

: LW> If you want to get a big headache, think about
: LW>     use utf16 'Big5';
: LW> You'll note that utf16 is not so very 'u'.
: 
: You went over my head here. What do you mean?

The 'u' supposedly stands for "universal".  I'm afraid utf16 (unlike
utf8) is far from universal, in that it has many Unicode assumptions
built in.  In particular, surrogate pairs stink.  There are also
big-endian/little-endian issues that are only slightly solved by use of
U+FEFF and U+FFFE.  You can't use cmp on it.  But the surrogates are
the killer.  Basically, you can't use utf16 to map arbitrary-sized
integers, because there are holes.  You can't have a character with the
value D800, or D801, or D802, or...  So you can't stuff any arbitrary
Asian 16-bit character set into utf16 as you can into utf8.

Larry

<Prev in Thread] Current Thread [Next in Thread>