Lincoln and Gisle,
I happen to find that your codes that escape URI has a problem with
Perl 5.8. First try the script below on Perl 5.8.0.
use strict;
require CGI::Util;
require URI::Escape;
my $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji
print "CGI::Util => ", CGI::Util::escape($uri), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri), "\n";
my $uri_bytes = pack("C*", unpack("C*", $uri));
print "CGI::Util => ", CGI::Util::escape($uri_bytes), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri_bytes), "\n";
__END__
It will print as follows;
CGI::Util => %5C0F%98FC%20%5F3E.txt
URI::Escape => %20.txt
CGI::Util => %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt
URI::Escape => %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt
The reason is simple; In Perl 5.8.0 and later, character class matches
character, not single byte. That's why CGI::Util prints %5C0F That
corresponds to ord("\x{5c0f}") and URI::Escape prints none (because
URI::Escape uses lookup hash which is empty for "\x{5c0f}"). The
latter half is the correct output.
CGI::Util
sub escape {
shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq
$CGI::Default
Class));
my $toencode = shift;
return undef unless defined($toencode);
if ($EBCDIC) {
$toencode=~s/([^a-zA-Z0-9_.-])/uc
sprintf("%%%02x",$E2A[ord($1)])/eg;
} else {
$toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg;
}
return $toencode;
}
URI::Escape
sub uri_escape
{
my($text, $patn) = @_;
return undef unless defined $text;
if (defined $patn){
unless (exists $subst{$patn}) {
# Because we can't compile the regex we fake it with a
cached sub
(my $tmp = $patn) =~ s,/,\\/,g;
eval "\$subst{\$patn} = sub {\$_[0] =~
s/([$tmp])/\$escapes{\$1}/g;
}";
Carp::croak("uri_escape: $@") if $@;
}
&{$subst{$patn}}($text);
} else {
# Default unsafe characters. RFC 2732 ^(uric - reserved)
$text =~ s/([^A-Za-z0-9\-_.!~*'()])/$escapes{$1}/g;
}
$text;
}
The solution is already shown in the example script; Apply pack("C*",
unpack("C*", $string)) before feeding it to the regex. This guarantees
both forward and backward compatibility. No Encode is used for
backward compatibility.
Though this function is rather trivial to reinvent (and may scripts
that use CGI.pm and LWP do reinvent this), It would be nice that
standard modules are up to date.
Dan the Encode Maintainer