perl-unicode

Re: CGI::Util unescape() after escape() loses utf8 flag

2005-09-28 02:02:09

khadrin(_at_)columbus(_dot_)rr(_dot_)com said:
CGI::Util has a couple functions escape() and unescape() which url encode/
decode strings.  Unfortunately I lose the utf8 flag on my scalar when I
encode then decode using those functions (see below).  Should unescape()
be setting the utf8 flag? Or is there no way for unescape() to know that
it should set the utf8 flag?

Looking at the source for CGI::Util, it appears that disabling the utf8 
flag is intended as a feature, not a bug:

# URL-encode data
sub escape {
  shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq 
$CGI::DefaultClass));
  my $toencode = shift;
  return undef unless defined($toencode);
  # force bytes while preserving backward compatibility -- dankogai
  $toencode = pack("C*", unpack("C*", $toencode));
    if ($EBCDIC) {
      $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg;
    } else {
      $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg;
    }
  return $toencode;
}

Seeing how this and the "unescape" function are set up, I would guess that
there is no way for "unescape" to "know" when a given input string should
be decoded as utf8 data.  Only the calling app can know that, and it should
apply the conversion to the output of "unescape".  CGI::Util is way too 
"general purpose" to make assumptions about character encodings.

Since Dan Kogai is a frequent contributor to this list, he might have more
to say on this.

        David Graff