perl-unicode

uri_escape and UTF-8 flag

2003-04-08 23:30:05
Lincoln and Gisle,

I happen to find that your codes that escape URI has a problem with Perl 5.8. First try the script below on Perl 5.8.0.

use strict;
require CGI::Util;
require URI::Escape;
my $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji
print "CGI::Util =>   ", CGI::Util::escape($uri), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri), "\n";
my $uri_bytes = pack("C*", unpack("C*", $uri));
print "CGI::Util =>   ", CGI::Util::escape($uri_bytes), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri_bytes), "\n";
__END__

It will print as follows;
CGI::Util =>   %5C0F%98FC%20%5F3E.txt
URI::Escape => %20.txt
CGI::Util =>   %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt
URI::Escape => %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt

The reason is simple; In Perl 5.8.0 and later, character class matches character, not single byte. That's why CGI::Util prints %5C0F That corresponds to ord("\x{5c0f}") and URI::Escape prints none (because URI::Escape uses lookup hash which is empty for "\x{5c0f}"). The latter half is the correct output.

CGI::Util
sub escape {
shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::Default
Class));
  my $toencode = shift;
  return undef unless defined($toencode);
    if ($EBCDIC) {
$toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg;
    } else {
      $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg;
    }
  return $toencode;
}

URI::Escape
sub uri_escape
{
    my($text, $patn) = @_;
    return undef unless defined $text;
    if (defined $patn){
        unless (exists  $subst{$patn}) {
# Because we can't compile the regex we fake it with a cached sub
            (my $tmp = $patn) =~ s,/,\\/,g;
eval "\$subst{\$patn} = sub {\$_[0] =~ s/([$tmp])/\$escapes{\$1}/g;
}";
            Carp::croak("uri_escape: $@") if $@;
        }
        &{$subst{$patn}}($text);
    } else {
        # Default unsafe characters.  RFC 2732 ^(uric - reserved)
        $text =~ s/([^A-Za-z0-9\-_.!~*'()])/$escapes{$1}/g;
    }
    $text;
}

The solution is already shown in the example script; Apply pack("C*", unpack("C*", $string)) before feeding it to the regex. This guarantees both forward and backward compatibility. No Encode is used for backward compatibility.

Though this function is rather trivial to reinvent (and may scripts that use CGI.pm and LWP do reinvent this), It would be nice that standard modules are up to date.

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>