On Saturday, April 6, 2002, at 01:16 , Jarkko Hietaniemi wrote:
Yes. I know that. My question is whether we support CONVERSION.
Internals have nothing to do with that. When we say UCS-2,
\x{10000}-\x{10ffff} must be discarded or croak for error. When we say
I suggest croak.
UTF-16, however, We have to convert them into surrogate pairs when we
convert and decode back to \x{10000}-\x{10ffff} when we decode.
Well, there seems to be
Perl_utf16_to_utf8(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)
in utf8.c that seems to be doing surrogate arithmetics, but I think
that's not much used (if at all), and I cannot see utf8_to_utf16.
(There's also
Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, I32 bytelen, I32
*newlen)
Okay, here is my strategy.
decode("\x{8C00}-\0x{8FFFF}") encode("\x{10000}-\x{10FFFF}")
------------------------------------------------------------------------
UCS-2 croak croak
UTF-16 convert via chr() convert via ord()
UTF-32 croak (not supposed be here) simply let out
------------------------------------------------------------------------
So no matter what, utf8 string in remains surrogate free. Keep the
camel away from such poison!
And as for conversions, I will abstain from PerlAPI for
compatibility. I believe (un)*pack will behave the same for binary data
in future.
Dan the Encode Maintainer