perl-unicode

[Encode] UCS/UTF mess and Surrogate Handlings

2002-04-05 08:11:06
On Friday, April 5, 2002, at 11:10 , Jarkko Hietaniemi wrote:
Change 15745 by jhi(_at_)alpha on 2002/04/05 13:07:21

        Integrate perlio;
        
        Not only did UCS-2 have dodgy name it was buggy.

Affected files ...

... //depot/perl/ext/Encode/lib/Encode/10646_1.pm#4 integrate

Differences ...

I've just ci'd 1.21 before I got this.   Hell.  1.22 that is.

-__PACKAGE__->Define(qw(UCS-2));
+__PACKAGE__->Define(qw(UCS-2BE UCS-2));

This one was done (with UCS-2 relocated to Alias.pm)

@@ -30,7 +30,7 @@
     {
        my $ch = substr($uni,0,1,'');
        my $x  = ord($ch);
-       unless ($x < 32768)
+       unless ($x <= 0xffff)
        {
            last if ($chk);
            $x = 0;
End of Patch.

I have reviewed the code following this and found this is *really* UCS-2BE, not UTF-16 in a sense it does not handle surrogates (encode() simply croaks for chars above BMP). Internally perl does support 0x10000 and above so why not support UTF-16 AND UCS-2 CORRECTLY and DISTICTIVELY? I also found that UTF-32 is missing (well, no one yet uses it but it is well-stated by Unicode Consortium). I'll clean up the UCS/UTF mess. It won't take much time.

Oh, the same bug was there in UCS-2LE.

Dan the Encode Maintainer

P.S. Does utf8 support surrogates? Surrogate pair is definitely the ugliest SOB of Unicode but without it, we can't print \x{8000}-\x{10ffffff} to the stream....