Re: [Encode] UCS/UTF mess and Surrogate Handlings

On Saturday, April 6, 2002, at 01:16 , Jarkko Hietaniemi wrote:

Yes.  I know that.  My question is whether we support CONVERSION.
Internals have nothing to do with that.  When we say UCS-2,
\x{10000}-\x{10ffff} must be discarded or croak for error.  When we say


I suggest croak.

UTF-16, however, We have to convert them into surrogate pairs when we
convert and decode back to \x{10000}-\x{10ffff} when we decode.


Well, there seems to be

  Perl_utf16_to_utf8(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)

in utf8.c that seems to be doing surrogate arithmetics, but I think
that's not much used (if at all), and I cannot see utf8_to_utf16.
(There's also

Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, I32 bytelen, I32*newlen)


Okay, here is my strategy.

                decode("\x{8C00}-\0x{8FFFF}") encode("\x{10000}-\x{10FFFF}")
------------------------------------------------------------------------
UCS-2   croak                                           croak
UTF-16  convert via chr()                               convert via ord()
UTF-32  croak (not supposed be here)    simply let out
------------------------------------------------------------------------

So no matter what, utf8 string in remains surrogate free. Keep thecamel away from such poison!And as for conversions, I will abstain from PerlAPI forcompatibility. I believe (un)*pack will behave the same for binary datain future.


Dan the Encode Maintainer

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Nick Ing-Simmons

Next by Date:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Dan Kogai

Previous by Thread:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Dan Kogai

Next by Thread:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Brian Stell

Indexes:

[Date] [Thread] [Top] [All Lists]