perl-unicode

Re: uri_escape and UTF-8 flag

2003-04-11 02:30:04
Hi Dan,

I don't understand this stuff, but send me a patch file and I will incorporate 
it, since I trust you to get it right.

Lincoln

On Wednesday 09 April 2003 01:43 am, Dan Kogai wrote:
Lincoln and Gisle,

   I happen to find that your codes that escape URI has a problem with
Perl 5.8.  First try the script below on Perl 5.8.0.

use strict;
require CGI::Util;
require URI::Escape;
my $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji
print "CGI::Util =>   ", CGI::Util::escape($uri), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri), "\n";
my $uri_bytes = pack("C*", unpack("C*", $uri));
print "CGI::Util =>   ", CGI::Util::escape($uri_bytes), "\n";
print "URI::Escape => ", URI::Escape::uri_escape($uri_bytes), "\n";
__END__

It will print as follows;

CGI::Util =>   %5C0F%98FC%20%5F3E.txt
URI::Escape => %20.txt
CGI::Util =>   %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt
URI::Escape => %E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt

The reason is simple; In Perl 5.8.0 and later, character class matches
character, not single byte.  That's why CGI::Util prints %5C0F That
corresponds to ord("\x{5c0f}") and URI::Escape prints none (because
URI::Escape uses lookup hash which is empty for  "\x{5c0f}").  The
latter half is the correct output.

CGI::Util

sub escape {
  shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq
$CGI::Default
Class));
  my $toencode = shift;
  return undef unless defined($toencode);
    if ($EBCDIC) {
      $toencode=~s/([^a-zA-Z0-9_.-])/uc
sprintf("%%%02x",$E2A[ord($1)])/eg;
    } else {
      $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg;
    }
  return $toencode;
}

URI::Escape

sub uri_escape
{
    my($text, $patn) = @_;
    return undef unless defined $text;
    if (defined $patn){
        unless (exists  $subst{$patn}) {
            # Because we can't compile the regex we fake it with a
cached sub
            (my $tmp = $patn) =~ s,/,\\/,g;
            eval "\$subst{\$patn} = sub {\$_[0] =~
s/([$tmp])/\$escapes{\$1}/g;
}";
            Carp::croak("uri_escape: $@") if $@;
        }
        &{$subst{$patn}}($text);
    } else {
        # Default unsafe characters.  RFC 2732 ^(uric - reserved)
        $text =~ s/([^A-Za-z0-9\-_.!~*'()])/$escapes{$1}/g;
    }
    $text;
}

The solution is already shown in the example script;  Apply pack("C*",
unpack("C*", $string)) before feeding it to the regex.  This guarantees
both forward and backward compatibility.  No Encode is used for
backward compatibility.

Though this function is rather trivial to reinvent (and may scripts
that use CGI.pm and LWP do reinvent this),  It would be nice that
standard modules are up to date.

Dan the Encode Maintainer

-- 
Lincoln Stein
lstein(_at_)cshl(_dot_)org
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)

<Prev in Thread] Current Thread [Next in Thread>