perl-unicode

Re: Don't use the \C escape in regexes - Why not?

2010-05-04 04:42:31
Am 04.05.2010 um 11:09 schrieb Gisle Aas:

I regret that I let \C sneak into the URI module.  Now we have an interface 
that depends on the internal UTF-8 flag of the stings passed in.

Does it? How so? If it's a byte string, well, it's a byte string, and \C 
doesn't change that. If, on the other hand, it's a text string, \C forces byte 
semantics upon it. Isn't that what you want to do in that function? (Okay, 
there's no spec for that function, so I don't really know what you want to do.) 
But doesn't the function return the same result regardless of the UTF-8 flag 
being set or not? As demonstrated by this test script:

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode;
binmode STDOUT, ':utf8'; # terminal UTF-8
my $text   = 'Käse'; # all characters below 256
my $bytes  = encode_utf8 $text;
my $text2  = 'Jiří'; # some characters above 255
my $bytes2 = encode_utf8 $text2;
printf "%x %s\n", ord $_, $_ for
    $text,
    $text =~ m/(\C)/g,
    $bytes,
    $bytes =~ m/(\C)/g,
    $text2,
    $text2 =~ m/(\C)/g,
    $bytes2,
    $bytes2 =~ m/(\C)/g;


This makes it very hard to explain, makes it not do what you want when 
different type of strings are combined and makes it hard to fix in ways that 
don't break some code.

Could you provide an example of how this might not do what you want when 
different types of strings are combined?

My plan for fixing this is to introduce URI::IRI with an interface that 
encode all non-URI characters as percent-encoded UTF-8 and live with the 
inconsistency for URI (until Perl redefine what \C means).


-- 
Michael.Ludwig (#) XING.com