perl-unicode

Re: Don't use the \C escape in regexes - Why not?

2010-05-04 09:28:29
I regret that I let \C sneak into the URI module.  Now we have an interface 
that depends on the internal UTF-8 flag of the stings passed in.  This makes it 
very hard to explain, makes it not do what you want when different type of 
strings are combined and makes it hard to fix in ways that don't break some 
code.  My plan for fixing this is to introduce URI::IRI with an interface that 
encode all non-URI characters as percent-encoded UTF-8 and live with the 
inconsistency for URI (until Perl redefine what \C means).

--Gisle


On May 3, 2010, at 20:34, Michael Ludwig wrote:

"Don't use the \C escape in regexes" - taken from Juerd's Unicode Advice page:

 http://juerd.nl/site.plp/perluniadvice

Why not?

------ perldoc perlre:
\C  Match a single C char (octet) even under Unicode.
   NOTE: breaks up characters into their UTF-8 bytes,
   so you may end up with malformed pieces of UTF-8.
   Unsupported in lookbehind.

------ URI::Escape
sub escape_char {
   return join '', @URI::Escape::escapes{$_[0] =~ /(\C)/g};
}

The regular expression is used to disassemble an incoming text string into 
individual bytes (and then use the resulting list in a hash slice). It is a 
legitimate use case, and the means seems to do the job. What's the problem 
with the \C escape?

-- 
Michael.Ludwig (#) XING.com