perl-unicode

Re: Don't use the \C escape in regexes - Why not?

2010-05-04 07:56:11
Am 04.05.2010 um 13:24 schrieb Aristotle Pagaltzis:

* Michael Ludwig <michael(_dot_)ludwig(_at_)xing(_dot_)com> [2010-05-04 
13:10]:
Is it this (theoretically fragile) implicitness in handling
character strings that makes \C a bad idea?

Yes. It will do different things with semantically identical
strings whose only difference is whether the UTF8 flag is set,
ie. it suffers the same problems that the `bytes` pragma has.

That's not what I meant, but I see your point:

$ cat uri-regex-C.pl 
use strict;
use utf8;
use Encode;
use URI;
use Test::More tests => 2;

my $builder = Test::More->builder;
binmode $builder->$_, ':utf8' for qw/output failure_output todo_output/;
my $txt = 'Käse';
my $iso = encode( 'ISO-8859-1', $txt );;

is $iso, $txt, "strings are equal: $iso = $txt";
my $uri_txt = URI->new( $txt );
my $uri_iso = URI->new( $iso );
# URI overloads stringification
is $uri_iso, $uri_txt, "URIs are equal: $uri_iso = $uri_txt";


$ perl uri-regex-C.pl 
1..2
ok 1 - strings are equal: Käse = Käse
not ok 2 - URIs are equal: K%E4se = K%C3%A4se
#   Failed test 'URIs are equal: K%E4se = K%C3%A4se'
#   at uri-regex-C.pl line 16.
#          got: 'K%E4se'
#     expected: 'K%C3%A4se'
# Looks like you failed 1 test of 2.

The strings compare equal, but the URIs derived from them don't.

But wait a second: While URIs are meant to be made of characters, they're also 
meant to go over the wire, and there are no characters on the wire, only bytes. 
There is no standard encoding defined for the wire, although UTF-8 has come to 
be seen as the standard encoding for URIs containing non-ASCII characters. Perl 
having two standard encodings (UTF-8 and ISO-8859-1) for text and relying on 
the internal flag to tell which one is meant to matter, shouldn't the URI 
module either only accept bytes or only characters? Or rather, provide two 
different constructors instead of only one trying to be intelligent?

  URI->bytes( $bytes ); # byte string
  URI->chars( $chars ); # character string

And, in addition, define the character encoding used for serialization.

So, \C implicitly encodes character strings as UTF-8 (Michael), and implicitly 
encodes byte strings as such, which is ISO-8859-1 (Aristoteles).

The input for URI->new is not specified as either character or byte string, and 
the output of URI->as_string is not specified with regard to a wire encoding. 
(But how could it be if the input is not defined?) The perldoc for URI#new 1.54 
only says:

  The set of characters available for building URI references is
  restricted (see URI::Escape). Characters outside this set are
  automatically escaped by the URI constructor.

http://search.cpan.org/dist/URI/URI.pm

What does Java do? The java.net.URI constructors only accept character strings, 
and the wire encoding has been fixed to UTF-8. To quote:

  A character is encoded by replacing it with the sequence
  of escaped octets that represent that character in the UTF-8
  character set. The Euro currency symbol ('\u20AC'), for example,
  is encoded as "%E2%82%AC". (Deviation from RFC 2396, which does
  not specify any particular character set.)

http://java.sun.com/javase/6/docs/api/java/net/URI.html

So documentation and behaviour are very clear in Java.

-- 
Michael.Ludwig (#) XING.com