perl-unicode

Re: Hassles with Unicode::String

2001-01-18 21:00:37
At 5:19 PM -0800 1/18/01, Gisle Aas wrote:
 > How do I get this to put out UTF8, which is what I need?

The ->utf8 method should do that.

OK, I now see that you meant to do this for the output. However, this doesn't fix what I need, which is for length and substr to not go to surrogates. For that matter, hex goes to surrogates as well!

=====
#!/usr/bin/perl -w

use Unicode::String qw(utf8 utf16 uchr);
Unicode::String->stringify_as('utf8');

@TestVectors = ("0x0010", "0x0100", "0x1000", "0x10000", "0x100000");

foreach $ThisVector (@TestVectors) {
    $SomeUTF8 = uchr(hex($ThisVector))->utf8;
    $TheLen = length($SomeUTF8);
    $TheHex = utf8($SomeUTF8)->hex;
    print "$ThisVector   $TheLen   $TheHex   Ords: ";
    @TheOctets = split(//, $SomeUTF8);
    foreach $ThisOctet (@TheOctets) { print ord($ThisOctet), " " };
    print "\n";
}
=====

0x0010   1   U+0010   Ords: 16
0x0100   2   U+0100   Ords: 196 128
0x1000   3   U+1000   Ords: 225 128 128
0x10000   4   U+d800 U+dc00   Ords: 240 144 128 128
0x100000   4   U+dbc0 U+dc00   Ords: 244 128 128 128

Clearly, $SomeUTF8 is in UTF8, as exhibited by the lengths and by the ords. But hex turns it into a surrogate before outputting the hex values.

Is there any way that I can break a UTF8 string into individual characters without doing some kludge of using UTF16 characters and checking manually for half-surrogates?

--Paul Hoffman