On Sat, 17 May 2003 23:44:29 +0900
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> wrote:
On Saturday, May 17, 2003, at 03:18 AM, Dan Kogai wrote:
Whole module right after my signature. Based upon Unicode::String
"some" body please fill in the POD section. Test suite is even more
welcome.
Dan the Encode Maintainer
I found that we can't use \s in $re_asis because \s matches U+3000
(IDEOGRAPHIC SPACE), which needs to be encoded. A few RFC readings
later, I concluded that we should use \x00-\x20, all ASCII controls
plus white space. Here is the patch.
\s => qr/[\n\r\t\ ]/x;
It's a bad idea to use \x00-\x20.
cf. RFC 2152 (UTF-7) says
Rule 3: The space (decimal 32), tab (decimal 9), carriage return
(decimal 13), and line feed (decimal 10) characters may be
directly represented by their ASCII equivalents. However, note
that MIME content transfer encodings have rules concerning the use
of such characters. Usage that does not conform to the
restrictions of RFC 822, for example, would have to be encoded
using MIME content transfer encodings other than 7bit or 8bit,
such as quoted-printable, binary, or base64.
===================================================================
RCS file: lib/Encode/Unicode/UTF7.pm,v
retrieving revision 0.1
diff -u -r0.1 lib/Encode/Unicode/UTF7.pm
--- lib/Encode/Unicode/UTF7.pm 2003/05/16 18:06:24 0.1
+++ lib/Encode/Unicode/UTF7.pm 2003/05/17 14:21:21
@@ -18,8 +18,10 @@
my $specials = quotemeta "\'(),-.:?";
$OPTIONAL_DIRECT_CHARS and
$specials .= quotemeta "!\"#$%&*;<=>@[]^_`{|}";
$specials missing '/'.
Since this is derived from Unicode::String->utf7(), I am mailing this
also to Gisle so the corresponding part in Unicode::String can be fixed.
From Unicode::String
208: if (($UTF7_OPTIONAL_DIRECT_CHARS &&
209: $$self =~
/\G((?:\0[A-Za-z0-9\'\(\)\,\-\.\/\:\?\!\"\#\$\%\&\*\;
\<\=\>\(_at_)\[\]\^\_\`\{\|\}\s])+)/gc)
210: || $$self =~
/\G((?:\0[A-Za-z0-9\'\(\)\,\-\.\/\:\?\s])+)/gc)
[snip]
215: elsif (($UTF7_OPTIONAL_DIRECT_CHARS &&
216: $$self =~
/\G((?:[^\0].|\0[^A-Za-z0-9\'\(\)\,\-\.\/\:\?
\!\"\#\$\%\&\*\;\<\=\>\(_at_)\[\]\^\_\`\{\|\}\s])+)/gsc)
217: || $$self =~
/\G((?:[^\0].|\0[^A-Za-z0-9\'\(\)\,\-\.\/\:
\?\s])+)/gsc)
In the era of Unicode, beware of \s.
Dan the Encode Maintainer
In the case of Unicode::String,
a referent in $self is encoded in UCS-2.
So chr(0x3000) never occurs.
It is already notified that some expressions like \s,\w
have different meaning for Unicode.
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html#Security-Implications-of-Unicode
But it had been notified that \d,\s,\w,\D,\S,\W may have
different meaning depending on locale.
http://www.perldoc.com/perl5.005_03/pod/perlre.html
SADAHIRO Tomoyuki