Paul Hoffman <phoffman(_at_)proper(_dot_)com> writes:
At 5:19 PM -0800 1/18/01, Gisle Aas wrote:
> How do I get this to put out UTF8, which is what I need?
The ->utf8 method should do that.
OK, I now see that you meant to do this for the output. However, this
doesn't fix what I need, which is for length and substr to not go to
surrogates. For that matter, hex goes to surrogates as well!
I agree this ought to be fixed. The easiest way is probably to change
Unicode::String so that it uses UTF32 internally. Then length/substr
can still be as simple (and fast) as they are now.
=====
#!/usr/bin/perl -w
use Unicode::String qw(utf8 utf16 uchr);
Unicode::String->stringify_as('utf8');
@TestVectors = ("0x0010", "0x0100", "0x1000", "0x10000", "0x100000");
foreach $ThisVector (@TestVectors) {
$SomeUTF8 = uchr(hex($ThisVector))->utf8;
$TheLen = length($SomeUTF8);
$TheHex = utf8($SomeUTF8)->hex;
print "$ThisVector $TheLen $TheHex Ords: ";
@TheOctets = split(//, $SomeUTF8);
foreach $ThisOctet (@TheOctets) { print ord($ThisOctet), " " };
print "\n";
}
=====
0x0010 1 U+0010 Ords: 16
0x0100 2 U+0100 Ords: 196 128
0x1000 3 U+1000 Ords: 225 128 128
0x10000 4 U+d800 U+dc00 Ords: 240 144 128 128
0x100000 4 U+dbc0 U+dc00 Ords: 244 128 128 128
Clearly, $SomeUTF8 is in UTF8, as exhibited by the lengths and by the
ords. But hex turns it into a surrogate before outputting the hex
values.
Is there any way that I can break a UTF8 string into individual
characters without doing some kludge of using UTF16 characters and
checking manually for half-surrogates?
You could try to use perl's native UTF8 support for that. unpack("U",...)
--Gisle