perl-unicode

Re: Character (or byte?) escapes under utf8 pragma

2010-03-08 08:55:56
Hi Aristotle,

thanks for your answer - much appreciated! Please see my comments
inline.

Am 07.03.2010 um 07:39 schrieb Aristotle Pagaltzis:

Perl does not distinguish between bytes and characters. It does
distinguish between scalars that use a packed byte buffer for
storage vs strings that use variable-width integer sequence for
storage, but this is an implementation detail and does not mean
anything in terms of semantics. Strings are simply strings in
Perl. You cannot tell what kind of data they contain just by
looking at them and the UTF8 flag doesn’t tell you either.

Okay. But unless I'm completely misled, you can tell whether a
string is supposed to contain characters (<- Encode::decode) or
bytes (<- Encode::encode). With the utf8 pragma in scope, it seems
to me that my literal strings are supposed to contain characters,
not bytes.

   "\x{00a0}" does not map to utf8 at t.pl line 11.
   <<\xA0Zurück
   "\x{00a0}" does not map to utf8 at t.pl line 11.
   <<\xA0Zurück
   "\x{00a0}" does not map to utf8 at t.pl line 11.
   <<\xA0Zurück
   << Zurück
   die now, somewhat counter-intuitively at t.pl line 15.

This is definitely a bug.

Good. It looked like one to me. Thanks for logging it with the
Perl maintainers.

However, it might already have been fixed for Perl 5.10.1 - at
least, ActiveState v5.10.1 produces what I think is a correct
result:


michael(_dot_)ludwig(_at_)nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl nbsp.pl 
<< Zurück
<< Zurück
<< Zurück
<< Zurück

michael(_dot_)ludwig(_at_)nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl -v

This is perl, v5.10.1 built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)


Am I mistaken in my expectation that while "\xa0" should be
a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that
perlretut(1) seems to support this assumption:

Unicode characters in the range of 128-255 use two hexadecimal
digits with braces: \x{ab}. Note that this is different than
\xab, which is just a hexadecimal byte with no Unicode
significance.

http://perl.active-venture.com/pod/perlretut-morecharacter.html

But maybe this only refers to these escapes inside regular expressions.

The documentation appears to be wrong. Unfortunately a lot of the
documentation of Perl itself is wrong or confused about Perl’s
string model.

The documentation I referred to is outdated. Sorry for that.

What's your advice for handling this situation more elegantly?

Use the \U escape to indicate that you always mean a Unicode code
point. Due to other quirks in how \U is implemented, it ends up
not triggering the bug that \x would.


How would I use that? I only know about the U specifier for pack:

my $smiley = pack 'U', 0x263a;

-- 
Michael.Ludwig (#) XING.com

<Prev in Thread] Current Thread [Next in Thread>