Re: Hassles with Unicode::String

Paul Hoffman <phoffman(_at_)proper(_dot_)com> writes:

At 5:19 PM -0800 1/18/01, Gisle Aas wrote:

 > How do I get this to put out UTF8, which is what I need?

The ->utf8 method should do that.


OK, I now see that you meant to do this for the output. However, this
doesn't fix what I need, which is for length and substr to not go to
surrogates. For that matter, hex goes to surrogates as well!


I agree this ought to be fixed.  The easiest way is probably to change
Unicode::String so that it uses UTF32 internally.  Then length/substr
can still be as simple (and fast) as they are now.

=====
#!/usr/bin/perl -w

use Unicode::String qw(utf8 utf16 uchr);
Unicode::String->stringify_as('utf8');

@TestVectors = ("0x0010", "0x0100", "0x1000", "0x10000", "0x100000");

foreach $ThisVector (@TestVectors) {
     $SomeUTF8 = uchr(hex($ThisVector))->utf8;
     $TheLen = length($SomeUTF8);
     $TheHex = utf8($SomeUTF8)->hex;
     print "$ThisVector   $TheLen   $TheHex   Ords: ";
     @TheOctets = split(//, $SomeUTF8);
     foreach $ThisOctet (@TheOctets) { print ord($ThisOctet), " " };
     print "\n";
}
=====

0x0010   1   U+0010   Ords: 16
0x0100   2   U+0100   Ords: 196 128
0x1000   3   U+1000   Ords: 225 128 128
0x10000   4   U+d800 U+dc00   Ords: 240 144 128 128
0x100000   4   U+dbc0 U+dc00   Ords: 244 128 128 128

Clearly, $SomeUTF8 is in UTF8, as exhibited by the lengths and by the
ords. But hex turns it into a surrogate before outputting the hex
values.

Is there any way that I can break a UTF8 string into individual
characters without doing some kludge of using UTF16 characters and
checking manually for half-surrogates?


You could try to use perl's native UTF8 support for that.  unpack("U",...)

--Gisle