Re: [Encode] UCS/UTF mess and Surrogate Handlings

On Saturday, April 6, 2002, at 12:18 , Jarkko Hietaniemi wrote:

P.S.  Does utf8 support surrogates?  Surrogate pair is definitely the


No.  Surrogates are solely for UTF-16.  There's no need for surrogates
in UTF-8 -- if we wanted to encode U+D800 using UTF-8, we *could* --
BUT we should not.  Encoding U+D800 as UTF-8 should not be attempted,
the whole surrogate space is a discontinuity in the Unicode code point
space reserved for the evils of UTF-16.

Yes. I know that. My question is whether we support CONVERSION.Internals have nothing to do with that. When we say UCS-2,\x{10000}-\x{10ffff} must be discarded or croak for error. When we sayUTF-16, however, We have to convert them into surrogate pairs when weconvert and decode back to \x{10000}-\x{10ffff} when we decode.

FYI I have already cleaned up UCS-2 part. Now their canonical names areUCS-2BE and UCS-2LE (modules are renamed as well to be more cannonical,ucs_2(be|le).pm. Yes, underscore first). UTF-32 is trivial because weonly have to pack the ord value to 32-bit. It's UTF-16 in question.

If we want perl to be surrogates-free, then ironically we have tosupport UTF-16 because ucs_2*.pm simply let \x{D800}-\x{DFFF} in so far.


Dan the Man with Too Many UnicodeS to tackle

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: [Encode] Farsi is Okay. The problem is in Indics!, Mark Leisher

Next by Date:

A FIX. [Re: qr/^UCS2-le$/i => '"UCS-2"' -- what is it?], Nick Ing-Simmons

Previous by Thread:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Jarkko Hietaniemi

Next by Thread:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Jarkko Hietaniemi

Indexes:

[Date] [Thread] [Top] [All Lists]