perl-unicode

Weird interaction of ord, split, and substr with UTF-8?

2000-10-30 20:02:36
Greetings. My apologies if this has been brought up on the list before; I couldn't find a pointer to the archives (if they exist).

ord, split, and substr appear to mess up with UTF-8 when returning single characters:

=====
use utf8;

$a = "\x{0061}\x{0222}\x{0061}";
print "The whole length is ", length($a), "\n";

@b = split(//, $a);
foreach $c (@b) {
    print "The ord in the split is ", ord($c);
    if($c eq "\x{0222}") { print " and is equal to U+0222"}
    print "\n";
}

for($i=0; $i<length($a); $i++) {
    $c = substr($a, $i, 1);
    print "The ord in the index is ", ord($c);
    if($c eq "\x{0222}") { print " and is equal to U+0222"}
    print "\n";
}

$d = "\x{0222}";
print "The ord outside the split is ", ord($d), "\n";
=====
In 5.6.0, this produces:

The whole length is 3
The ord in the split is 97
The ord in the split is 200 and is equal to U+0222
The ord in the split is 97
The ord in the index is 97
The ord in the index is 200 and is equal to U+0222
The ord in the index is 97
The ord outside the split is 546

Has anyone else come across this? Is there a way to use ord in a loop after a split that works?

<Prev in Thread] Current Thread [Next in Thread>