perl-unicode

Unicode vs. \s [Was: Re: Encode::Unicode::UTF7]

2003-05-17 08:30:06
On Saturday, May 17, 2003, at 03:18  AM, Dan Kogai wrote:
Whole module right after my signature. Based upon Unicode::String "some" body please fill in the POD section. Test suite is even more welcome.

Dan the Encode Maintainer

I found that we can't use \s in $re_asis because \s matches U+3000 (IDEOGRAPHIC SPACE), which needs to be encoded. A few RFC readings later, I concluded that we should use \x00-\x20, all ASCII controls plus white space. Here is the patch.

===================================================================
RCS file: lib/Encode/Unicode/UTF7.pm,v
retrieving revision 0.1
diff -u -r0.1 lib/Encode/Unicode/UTF7.pm
--- lib/Encode/Unicode/UTF7.pm  2003/05/16 18:06:24     0.1
+++ lib/Encode/Unicode/UTF7.pm  2003/05/17 14:21:21
@@ -18,8 +18,10 @@
 my $specials =   quotemeta "\'(),-.:?";
 $OPTIONAL_DIRECT_CHARS and
     $specials .= quotemeta "!\"#$%&*;<=>@[]^_`{|}";
-my $re_asis =     qr/(?:[\sA-Za-z0-9$specials])/;
-my $re_encoded = qr/(?:[^\sA-Za-z0-9$specials])/;
+# \s will not work because it matches U+3000 DEOGRAPHIC SPACE
+# We use \x00-\x20 instead (controls + space)
+my $re_asis =     qr/(?:[\x00-\x20A-Za-z0-9$specials])/;
+my $re_encoded = qr/(?:[^\x00-\x20A-Za-z0-9$specials])/;
 my $e_utf16 = find_encoding("UTF-16BE");

 sub needs_lines { 1 };

Since this is derived from Unicode::String->utf7(), I am mailing this also to Gisle so the corresponding part in Unicode::String can be fixed.

From Unicode::String
 208:            if (($UTF7_OPTIONAL_DIRECT_CHARS &&
209: $$self =~ /\G((?:\0[A-Za-z0-9\'\(\)\,\-\.\/\:\?\!\"\#\$\%\&\*\;
\<\=\>\(_at_)\[\]\^\_\`\{\|\}\s])+)/gc)
210: || $$self =~ /\G((?:\0[A-Za-z0-9\'\(\)\,\-\.\/\:\?\s])+)/gc)
[snip]
 215:            elsif (($UTF7_OPTIONAL_DIRECT_CHARS &&
216: $$self =~ /\G((?:[^\0].|\0[^A-Za-z0-9\'\(\)\,\-\.\/\:\?
\!\"\#\$\%\&\*\;\<\=\>\(_at_)\[\]\^\_\`\{\|\}\s])+)/gsc)
217: || $$self =~ /\G((?:[^\0].|\0[^A-Za-z0-9\'\(\)\,\-\.\/\:
\?\s])+)/gsc)

In the era of Unicode, beware of \s.

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>