perl-unicode

Re: Hassles with Unicode::String

2001-01-18 18:20:30
Paul Hoffman <phoffman(_at_)proper(_dot_)com> writes:

Using Unicode-String-2.06, I have the following test program:

=====

#!/usr/bin/perl -w

use Unicode::String qw(utf8 utf16 uchr);
Unicode::String->stringify_as('utf8');

@TestArr = ("0061 0062", "0063 12345");

foreach $TheString (@TestArr) {
     @AllHexIn = split(/\s+/, $TheString);
     $OutString = '';
     foreach $PartString (@AllHexIn)
         { $OutString .= utf8(uchr(hex("0x$PartString"))); }

     $TheLen = utf8($OutString)->length;

     $HexOfInput = '';
     foreach($i=0; $i<utf8($OutString)->length; $i++) {
         $HexOfInput .= utf8($OutString)->substr($i, 1)->hex . ' | ';
     }
     print "$TheString  $TheLen    $HexOfInput\n";
}

=====

The output is:

0061 0062  2    U+0061 | U+0062 |
0063 12345  3    U+0063 | U+d808 | U+df45 |

Why is uchr putting out UTF16 instead of UTF8 for the non-BMP character?

Unicode::String is simply UTF16 internally.  The ->length and ->substr
methods all operate directly on the UTF16 representation without
looking for surrogates.  This is actually the wrong thing to do.  If
these where fixed to know about surrogates then I think this example
would work as you expected.  The ->hex function should probably also
be made surrogate aware.

Even if uchr is putting out UTF16, why isn't the utf8() call coercing
the value from UTF16 to UTF8?

utf8() is actually converting from UTF8 to UTF16.  uchr() is
converting a numeric value to UTF16.

How do I get this to put out UTF8, which is what I need?

The ->utf8 method should do that.

--Gisle