perl-unicode

Re: [Encode] UCS/UTF mess and Surrogate Handlings

2002-04-05 09:29:39
On Saturday, April 6, 2002, at 01:16 , Jarkko Hietaniemi wrote:
Yes.  I know that.  My question is whether we support CONVERSION.
Internals have nothing to do with that.  When we say UCS-2,
\x{10000}-\x{10ffff} must be discarded or croak for error.  When we say

I suggest croak.

UTF-16, however, We have to convert them into surrogate pairs when we
convert and decode back to \x{10000}-\x{10ffff} when we decode.

Well, there seems to be

  Perl_utf16_to_utf8(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)

in utf8.c that seems to be doing surrogate arithmetics, but I think
that's not much used (if at all), and I cannot see utf8_to_utf16.
(There's also

Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)

Okay, here is my strategy.

                decode("\x{8C00}-\0x{8FFFF}") encode("\x{10000}-\x{10FFFF}")
------------------------------------------------------------------------
UCS-2   croak                                           croak
UTF-16  convert via chr()                               convert via ord()
UTF-32  croak (not supposed be here)    simply let out
------------------------------------------------------------------------

So no matter what, utf8 string in remains surrogate free. Keep the camel away from such poison! And as for conversions, I will abstain from PerlAPI for compatibility. I believe (un)*pack will behave the same for binary data in future.

Dan the Encode Maintainer