perl-unicode

Re: Don't use the \C escape in regexes - Why not?

2010-05-04 12:46:18
* Michael Ludwig <michael(_dot_)ludwig(_at_)xing(_dot_)com> [2010-05-04 14:55]:
But wait a second: While URIs are meant to be made of
characters, they're also meant to go over the wire, and there
are no characters on the wire, only bytes. There is no standard
encoding defined for the wire, although UTF-8 has come to be
seen as the standard encoding for URIs containing non-ASCII
characters. Perl having two standard encodings (UTF-8 and
ISO-8859-1) for text and relying on the internal flag to tell
which one is meant to matter, shouldn't the URI module either
only accept bytes or only characters? Or rather, provide two
different constructors instead of only one trying to be
intelligent?

 URI->bytes( $bytes ); # byte string
 URI->chars( $chars ); # character string

And, in addition, define the character encoding used for
serialization.

Yes, exactly. And both methods would use the moral equivalent of
a plain `split //` – no trickery such as with `\C`. The only
difference between then is that the `chars` method would
`encode_utf8` the string first and then encode it blindly,
whereas the `bytes` method would leave it as is but then croak if
it found a codepoint > 0xFF (since the string is supposed to
represent an octet sequence already).

Notably absent in both cases: any dependence on the state of the
UTF8 flag of the string.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>