Re: Unicode vs. \s [Was: Re: Encode::Unicode::UTF7]


On Sat, 17 May 2003 23:44:29 +0900
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> wrote:

On Saturday, May 17, 2003, at 03:18  AM, Dan Kogai wrote:

Whole module right after my signature.  Based upon Unicode::String 
"some" body please fill in the POD section.  Test suite is even more 
welcome.

Dan the Encode Maintainer


I found that we can't use \s in $re_asis because \s matches U+3000 
(IDEOGRAPHIC SPACE), which needs to be encoded.  A few RFC readings 
later, I concluded that we should use \x00-\x20, all ASCII controls 
plus white space.  Here is the patch.


\s => qr/[\n\r\t\ ]/x;
It's a bad idea to use \x00-\x20.

cf. RFC 2152 (UTF-7) says

      Rule 3: The space (decimal 32), tab (decimal 9), carriage return
      (decimal 13), and line feed (decimal 10) characters may be
      directly represented by their ASCII equivalents. However, note
      that MIME content transfer encodings have rules concerning the use
      of such characters. Usage that does not conform to the
      restrictions of RFC 822, for example, would have to be encoded
      using MIME content transfer encodings other than 7bit or 8bit,
      such as quoted-printable, binary, or base64.

===================================================================
RCS file: lib/Encode/Unicode/UTF7.pm,v
retrieving revision 0.1
diff -u -r0.1 lib/Encode/Unicode/UTF7.pm
--- lib/Encode/Unicode/UTF7.pm  2003/05/16 18:06:24     0.1
+++ lib/Encode/Unicode/UTF7.pm  2003/05/17 14:21:21
@@ -18,8 +18,10 @@
  my $specials =   quotemeta "\'(),-.:?";
  $OPTIONAL_DIRECT_CHARS and
      $specials .= quotemeta "!\"#$%&*;<=>@[]^_`{|}";


$specials missing '/'.

Since this is derived from Unicode::String->utf7(), I am mailing this 
also to Gisle so the corresponding part in Unicode::String can be fixed.

 From Unicode::String

 208:            if (($UTF7_OPTIONAL_DIRECT_CHARS &&
 209:            $$self =~ 
/\G((?:\0[A-Za-z0-9\'\(\)\,\-\.\/\:\?\!\"\#\$\%\&\*\;
\<\=\>\(_at_)\[\]\^\_\`\{\|\}\s])+)/gc)
 210:           || $$self =~ 
/\G((?:\0[A-Za-z0-9\'\(\)\,\-\.\/\:\?\s])+)/gc)
[snip]
 215:            elsif (($UTF7_OPTIONAL_DIRECT_CHARS &&
 216:                    $$self =~ 
/\G((?:[^\0].|\0[^A-Za-z0-9\'\(\)\,\-\.\/\:\?
\!\"\#\$\%\&\*\;\<\=\>\(_at_)\[\]\^\_\`\{\|\}\s])+)/gsc)
 217:                   || $$self =~ 
/\G((?:[^\0].|\0[^A-Za-z0-9\'\(\)\,\-\.\/\:
\?\s])+)/gsc)


In the era of Unicode, beware of \s.

Dan the Encode Maintainer


In the case of Unicode::String,
a referent in $self is encoded in UCS-2.
So chr(0x3000) never occurs.

It is already notified that some expressions like \s,\w
have different meaning for Unicode.

http://www.perldoc.com/perl5.8.0/pod/perlunicode.html#Security-Implications-of-Unicode

But it had been notified that \d,\s,\w,\D,\S,\W may have
different meaning depending on locale.

http://www.perldoc.com/perl5.005_03/pod/perlre.html

SADAHIRO Tomoyuki