Am 04.05.2010 um 11:09 schrieb Gisle Aas:
I regret that I let \C sneak into the URI module. Now we have an interface
that depends on the internal UTF-8 flag of the stings passed in.
Does it? How so? If it's a byte string, well, it's a byte string, and \C
doesn't change that. If, on the other hand, it's a text string, \C forces byte
semantics upon it. Isn't that what you want to do in that function? (Okay,
there's no spec for that function, so I don't really know what you want to do.)
But doesn't the function return the same result regardless of the UTF-8 flag
being set or not? As demonstrated by this test script:
use strict;
use warnings;
use utf8; # source in UTF-8
use Encode;
binmode STDOUT, ':utf8'; # terminal UTF-8
my $text = 'Käse'; # all characters below 256
my $bytes = encode_utf8 $text;
my $text2 = 'Jiří'; # some characters above 255
my $bytes2 = encode_utf8 $text2;
printf "%x %s\n", ord $_, $_ for
$text,
$text =~ m/(\C)/g,
$bytes,
$bytes =~ m/(\C)/g,
$text2,
$text2 =~ m/(\C)/g,
$bytes2,
$bytes2 =~ m/(\C)/g;
This makes it very hard to explain, makes it not do what you want when
different type of strings are combined and makes it hard to fix in ways that
don't break some code.
Could you provide an example of how this might not do what you want when
different types of strings are combined?
My plan for fixing this is to introduce URI::IRI with an interface that
encode all non-URI characters as percent-encoded UTF-8 and live with the
inconsistency for URI (until Perl redefine what \C means).
--
Michael.Ludwig (#) XING.com