perl-unicode

[Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option

2003-01-26 01:30:07
Porters,

In the recent discussion in various perl-related MLs in Japanese, I have discovered a problem that the encoding pragma does not work on such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in the 2nd byte. Though not test I am pretty sure big5 is also prone to this.

  To understand this problem please have a look at the hexdump below;

% hexdump -C enc-sjis.pl
00000000 23 2f 75 73 72 2f 6c 6f 63 61 6c 2f 62 69 6e 2f |#/usr/local/bin/| 00000010 70 65 72 6c 20 2d 77 0a 75 73 65 20 73 74 72 69 |perl -w.use stri| 00000020 63 74 3b 0a 75 73 65 20 65 6e 63 6f 64 69 6e 67 |ct;.use encoding| 00000030 20 27 73 68 69 66 74 2d 6a 69 73 27 3b 0a 0a 6d | 'shift-jis';..m| 00000040 79 20 24 6e 61 6d 65 20 3d 20 22 94 5c 22 3b 0a |y $name = ".\";.| 00000050 70 72 69 6e 74 20 24 6e 61 6d 65 3b 0a 77 72 69 |print $name;.wri| 00000060 74 65 3b 0a 0a 66 6f 72 6d 61 74 20 53 54 44 4f |te;..format STDO| 00000070 55 54 20 3d 0a 94 5c 97 cd 3a 40 3c 3c 3c 0a 24 |UT =..\..:@<<<.$|
00000080  6e 61 6d 65 0a 2e 0a                              |name...|

The perl script is a valid perl script in Shift JIS but the quoted character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte, mangling the script. The encoding pragma needs to be parsable ASCII-wise. Fortunately, the encoding pragma offers a different approach via Filter=>1. The problem is that Filter option was incomplete in two ways.

0. Filter=>1 leaves STD(IN|OUT) untouched. Not only does it leave STD* untouched it completely ignores STD*=> hooks that non-filter version offers.

1. In order to touch STD(IN|OUT) sensibly you have to 'use utf8' in the script to make sure the literals therein are utf8-flagged but that makes the code too counterintuitive.

The following patch fixes that so the filter option is more useful. I am planning to apply this patch to the next version of Encode but I still need to fix the POD and write test suites. So I decided to issue a waring before committing a release.

Dan the Encode Maintainer

--- encoding.pm 2003/01/22 03:29:07     1.40
+++ encoding.pm 2003/01/26 07:03:59
@@ -35,33 +35,11 @@
     unless ($arg{Filter}) {
        ${^ENCODING} = $enc unless $] <= 5.008 and $utfs{$name};
        $HAS_PERLIO or return 1;
-       for my $h (qw(STDIN STDOUT)){
-           if ($arg{$h}){
-               unless (defined find_encoding($arg{$h})) {
-                   require Carp;
-                   Carp::croak("Unknown encoding for $h, '$arg{$h}'");
-               }
-               eval { binmode($h, ":encoding($arg{$h})") };
-           }else{
-               unless (exists $arg{$h}){
-                   eval {
-                       no warnings 'uninitialized';
-                       binmode($h, ":encoding($name)");
-                   };
-               }
-           }
-           if ($@){
-               require Carp;
-               Carp::croak($@);
-           }
-       }
     }else{
        defined(${^ENCODING}) and undef ${^ENCODING};
        eval {
            require Filter::Util::Call ;
            Filter::Util::Call->import ;
-           binmode(STDIN);
-           binmode(STDOUT);
            filter_add(sub{
                           my $status;
                            if (($status = filter_read()) > 0){
@@ -71,7 +49,31 @@
                           $status ;
                       });
        };
+       # internally use utf8 to make sure utf8 flags are set
+       # for literals.
+       use utf8 (); # to fetch $utf8::hint_bits;
+       $^H |= $utf8::hint_bits;
        # warn "Filter installed";
+    }
+    for my $h (qw(STDIN STDOUT)){
+       if ($arg{$h}){
+           unless (defined find_encoding($arg{$h})) {
+               require Carp;
+               Carp::croak("Unknown encoding for $h, '$arg{$h}'");
+           }
+           eval { binmode($h, ":encoding($arg{$h})") };
+       }else{
+           unless (exists $arg{$h}){
+               eval {
+                   no warnings 'uninitialized';
+                   binmode($h, ":encoding($name)");
+               };
+           }
+       }
+       if ($@){
+           require Carp;
+           Carp::croak($@);
+       }
     }
     return 1; # I doubt if we need it, though
 }

<Prev in Thread] Current Thread [Next in Thread>
  • [Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option, Dan Kogai <=