For convenience, I have test script source code in UTF-8.
The test also deals with non-breaking spaces, which I prefer
to keep as character references since they are not visible
and might be mistaken by the casual onlooker for ordinary
spaces. So I write them as "\xa0". Or "\x{a0}", or "\x{00a0}".
Now I find that they seem to be byte references, not character
references. Consider the following test script:
use strict;
use warnings;
use utf8; # source code in UTF-8 ("Zurück")
use open OUT => ':encoding(UTF-8)', ':std';
my $str1 = "<<\xa0Zurück\n"; # byte -> bad
my $str2 = "<<\x{a0}Zurück\n"; # should be character, but isn't
my $str3 = "<<\x{00a0}Zurück\n"; # ditto
my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works
print $str1, $str2, $str3, $str4;
$str1 ne $str2 and die "won't die";
$str1 ne $str3 and die "won't die";
$str1 ne $str4 and die 'die now, somewhat counter-intuitively';
The correct version of the string uses implicit upgrading of
the byte escape "\xa0" to a Unicode character. I've read upgrading
should rather be avoided, but here it does the job.
Am I mistaken in my expectation that while "\xa0" should be a byte,
"\x{a0}" and "\x{00a0}" should be characters? Note that perlretut(1)
seems to support this assumption:
Unicode characters in the range of 128-255 use two hexadecimal
digits with braces: \x{ab}. Note that this is different than \xab,
which is just a hexadecimal byte with no Unicode significance.
http://perl.active-venture.com/pod/perlretut-morecharacter.html
But maybe this only refers to these escapes inside regular expressions.
Or maybe the utf8 pragma breaks things here? Don't think so, though.
If I comment it out, I have to recode my script to Latin1 in order for
the strings to be valid.
Note that the reason I use the utf8 pragma is so I can write "Zurück"
in my source code and automatically have Perl informed that these are
characters, not bytes - which is a great convenience.
Yeah, it would also work in Latin1, and our editors handle various
encodings just fine - but we have a good UTF-8 development environment
and there might be characters not representable in Latin1 that I'd like
to add to the script source.
What's your advice for handling this situation more elegantly?
--
Michael.Ludwig (#) XING.com