Am 04.05.2010 um 13:24 schrieb Aristotle Pagaltzis:
* Michael Ludwig <michael(_dot_)ludwig(_at_)xing(_dot_)com> [2010-05-04
13:10]:
Is it this (theoretically fragile) implicitness in handling
character strings that makes \C a bad idea?
Yes. It will do different things with semantically identical
strings whose only difference is whether the UTF8 flag is set,
ie. it suffers the same problems that the `bytes` pragma has.
That's not what I meant, but I see your point:
$ cat uri-regex-C.pl
use strict;
use utf8;
use Encode;
use URI;
use Test::More tests => 2;
my $builder = Test::More->builder;
binmode $builder->$_, ':utf8' for qw/output failure_output todo_output/;
my $txt = 'Käse';
my $iso = encode( 'ISO-8859-1', $txt );;
is $iso, $txt, "strings are equal: $iso = $txt";
my $uri_txt = URI->new( $txt );
my $uri_iso = URI->new( $iso );
# URI overloads stringification
is $uri_iso, $uri_txt, "URIs are equal: $uri_iso = $uri_txt";
$ perl uri-regex-C.pl
1..2
ok 1 - strings are equal: Käse = Käse
not ok 2 - URIs are equal: K%E4se = K%C3%A4se
# Failed test 'URIs are equal: K%E4se = K%C3%A4se'
# at uri-regex-C.pl line 16.
# got: 'K%E4se'
# expected: 'K%C3%A4se'
# Looks like you failed 1 test of 2.
The strings compare equal, but the URIs derived from them don't.
But wait a second: While URIs are meant to be made of characters, they're also
meant to go over the wire, and there are no characters on the wire, only bytes.
There is no standard encoding defined for the wire, although UTF-8 has come to
be seen as the standard encoding for URIs containing non-ASCII characters. Perl
having two standard encodings (UTF-8 and ISO-8859-1) for text and relying on
the internal flag to tell which one is meant to matter, shouldn't the URI
module either only accept bytes or only characters? Or rather, provide two
different constructors instead of only one trying to be intelligent?
URI->bytes( $bytes ); # byte string
URI->chars( $chars ); # character string
And, in addition, define the character encoding used for serialization.
So, \C implicitly encodes character strings as UTF-8 (Michael), and implicitly
encodes byte strings as such, which is ISO-8859-1 (Aristoteles).
The input for URI->new is not specified as either character or byte string, and
the output of URI->as_string is not specified with regard to a wire encoding.
(But how could it be if the input is not defined?) The perldoc for URI#new 1.54
only says:
The set of characters available for building URI references is
restricted (see URI::Escape). Characters outside this set are
automatically escaped by the URI constructor.
http://search.cpan.org/dist/URI/URI.pm
What does Java do? The java.net.URI constructors only accept character strings,
and the wire encoding has been fixed to UTF-8. To quote:
A character is encoded by replacing it with the sequence
of escaped octets that represent that character in the UTF-8
character set. The Euro currency symbol ('\u20AC'), for example,
is encoded as "%E2%82%AC". (Deviation from RFC 2396, which does
not specify any particular character set.)
http://java.sun.com/javase/6/docs/api/java/net/URI.html
So documentation and behaviour are very clear in Java.
--
Michael.Ludwig (#) XING.com