uri_escape and UTF-8 flag

Lincoln and Gisle,

I happen to find that your codes that escape URI has a problem withPerl 5.8. First try the script below on Perl 5.8.0.


use strict;
require CGI::Util;
require URI::Escape;
my $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji
print "CGI::Util =>   ", CGI::Util::escape($uri), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri), "\n";
my $uri_bytes = pack("C*", unpack("C*", $uri));
print "CGI::Util =>   ", CGI::Util::escape($uri_bytes), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri_bytes), "\n";
__END__

It will print as follows;

CGI::Util =>   %5C0F%98FC%20%5F3E.txt
URI::Escape => %20.txt
CGI::Util =>   %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt
URI::Escape => %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt

The reason is simple; In Perl 5.8.0 and later, character class matchescharacter, not single byte. That's why CGI::Util prints %5C0F Thatcorresponds to ord("\x{5c0f}") and URI::Escape prints none (becauseURI::Escape uses lookup hash which is empty for "\x{5c0f}"). Thelatter half is the correct output.


CGI::Util

sub escape {
shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq$CGI::Default
Class));
  my $toencode = shift;
  return undef unless defined($toencode);
    if ($EBCDIC) {
$toencode=~s/([^a-zA-Z0-9_.-])/ucsprintf("%%%02x",$E2A[ord($1)])/eg;
    } else {
      $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg;
    }
  return $toencode;
}


URI::Escape

sub uri_escape
{
    my($text, $patn) = @_;
    return undef unless defined $text;
    if (defined $patn){
        unless (exists  $subst{$patn}) {

# Because we can't compile the regex we fake it with acached sub

            (my $tmp = $patn) =~ s,/,\\/,g;

eval "\$subst{\$patn} = sub {\$_[0] =~s/([$tmp])/\$escapes{\$1}/g;

}";
            Carp::croak("uri_escape: $@") if $@;
        }
        &{$subst{$patn}}($text);
    } else {
        # Default unsafe characters.  RFC 2732 ^(uric - reserved)
        $text =~ s/([^A-Za-z0-9\-_.!~*'()])/$escapes{$1}/g;
    }
    $text;
}

The solution is already shown in the example script; Apply pack("C*",unpack("C*", $string)) before feeding it to the regex. This guaranteesboth forward and backward compatibility. No Encode is used forbackward compatibility.

Though this function is rather trivial to reinvent (and may scriptsthat use CGI.pm and LWP do reinvent this), It would be nice thatstandard modules are up to date.


Dan the Encode Maintainer