perl-unicode

Re: Character (or byte?) escapes under utf8 pragma

2010-03-07 00:50:11
Hi Michael,

[ perlbug readers, you will find the nut of the issue in the
  section marked BUG ]

* Michael Ludwig <michael(_dot_)ludwig(_at_)xing(_dot_)com> [2010-03-03 14:05]:
For convenience, I have test script source code in UTF-8. The
test also deals with non-breaking spaces, which I prefer to
keep as character references since they are not visible and
might be mistaken by the casual onlooker for ordinary spaces.
So I write them as "\xa0". Or "\x{a0}", or "\x{00a0}".

Now I find that they seem to be byte references, not character
references.

Perl does not distinguish between bytes and characters. It does
distinguish between scalars that use a packed byte buffer for
storage vs strings that use variable-width integer sequence for
storage, but this is an implementation detail and does not mean
anything in terms of semantics. Strings are simply strings in
Perl. You cannot tell what kind of data they contain just by
looking at them and the UTF8 flag doesn’t tell you either.

Consider the following test script:

use strict;
use warnings;
use utf8; # source code in UTF-8 ("Zurück")
use open OUT => ':encoding(UTF-8)', ':std';

my $str1 = "<<\xa0Zurück\n";      # byte -> bad
my $str2 = "<<\x{a0}Zurück\n";    # should be character, but isn't
my $str3 = "<<\x{00a0}Zurück\n";  # ditto
my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works

print $str1, $str2, $str3, $str4;

$str1 ne $str2 and die "won't die";
$str1 ne $str3 and die "won't die";
$str1 ne $str4 and die 'die now, somewhat counter-intuitively';

    "\x{00a0}" does not map to utf8 at t.pl line 11.
    <<\xA0Zurück
    "\x{00a0}" does not map to utf8 at t.pl line 11.
    <<\xA0Zurück
    "\x{00a0}" does not map to utf8 at t.pl line 11.
    <<\xA0Zurück
    << Zurück
    die now, somewhat counter-intuitively at t.pl line 15.

This is definitely a bug.

The correct version of the string uses implicit upgrading of
the byte escape "\xa0" to a Unicode character. I've read
upgrading should rather be avoided, but here it does the job.

No, upgrading is perfectly fine. Mixing byte and character data
is what should be avoided, because then Perl will assume it’s all
characters, which will result in mangling of one of the two kinds
of data. Usually the byte data is encoded text, in which case the
problem becomes apparent as double-encoded text. But it’s really
a problem both ways.

Am I mistaken in my expectation that while "\xa0" should be
a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that
perlretut(1) seems to support this assumption:

 Unicode characters in the range of 128-255 use two hexadecimal
 digits with braces: \x{ab}. Note that this is different than
 \xab, which is just a hexadecimal byte with no Unicode
 significance.

http://perl.active-venture.com/pod/perlretut-morecharacter.html

But maybe this only refers to these escapes inside regular expressions.

The documentation appears to be wrong. Unfortunately a lot of the
documentation of Perl itself is wrong or confused about Perl’s
string model.

Or maybe the utf8 pragma breaks things here? Don't think so,
though. If I comment it out, I have to recode my script to
Latin1 in order for the strings to be valid.

Yes. This appears to be a utf8 pragma bug or a bug in the parser
that shows up in interaction with the utf8 pragma.

    ====================== BUG ======================

What happens is that the presence of the ü under the utf8 pragma
triggers using the variable-width integer sequence format for the
string, but the 0xA0 byte from the \x escape gets written into
that buffer verbatim, as if it were a packed byted array string.
This is wrong and completely broken.

    ====================== BUG ======================

Note that the reason I use the utf8 pragma is so I can write
"Zurück" in my source code and automatically have Perl informed
that these are characters, not bytes - which is a great
convenience.

Yeah, it would also work in Latin1, and our editors handle
various encodings just fine - but we have a good UTF-8
development environment and there might be characters not
representable in Latin1 that I'd like to add to the script
source.

Writing source in UTF-8 is a perfectly sane practice. No need to
justify it.

What's your advice for handling this situation more elegantly?

Use the \U escape to indicate that you always mean a Unicode code
point. Due to other quirks in how \U is implemented, it ends up
not triggering the bug that \x would.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

<Prev in Thread] Current Thread [Next in Thread>