perl-unicode

use Encode; # on Japanese; LONG!

2002-01-09 18:58:13
Hi jhi,

My name is Dan Kogai. I am a writer of Jcode.pm, which converts from various Japanese charset to others. With the advent of Encode module that comes with Perl 5.7.2 and up, I finally though that the role of Jcode, Jcode to r.i.p. When I tested the module however, I found it was far from it. Rather, I believe I can help in great deal with the current implementation.

Problem #1: Where is the rest of charset!?

When perl5.7.2 gets installed, it installs bunch of .enc files under Encoding/, including good-old euc-jp. but when you

perl5.7.2 -MEncode -e 'print join(",", encodings), "\n";'

  You get

koi8-r,dingbats,iso-8859-10,iso-8859-13,cp37,iso-8859-9,iso-8859-6,iso-8859-1,
cp1047,iso-8859-4,Internal,iso-8859-2,symbol,iso-8859-3,US-
ascii,iso-8859-8,iso-8859-14,UCS-2,iso-8859-5,UTF-8,iso-8859-7,iso-8859-15,
cp1250,iso-8859-16,posix-bc

Those are only 8-bit chars. I was at first disappointed but I thought it over and found Encode::Tcl module which contains no document. I read the document over and over and finally found

perl5.7.2 -MEncode -MEncode::Tcl -e 'print join(",", encodings), "\n";'

  That gave me

gb1988,cp857,macUkraine,dingbats,iso2022-jp,iso-8859-10,ksc5601,iso-8859-13,
iso-8859-6,macTurkish,Internal,symbol,macJapan,iso2022,cp1250,posix-
bc,cp1251,koi8-r,7bit-
kr,cp437,cp866,iso-8859-3,cp874,iso-8859-8,macCyrillic,UCS-2,shiftjis,UTF-8,
euc-jp,cp862,7bit-
kana,cp861,cp860,macCroatian,jis0208,cp1254,cp37,iso-8859-9,7bit-
jis,macGreek,big5,cp852,cp869,macCentEuro,iso-8859-1,cp1047,cp863,macIceland,
macRoman,euc-
kr,gsm0338,cp775,cp950,cp1253,cp424,cp856,cp850,iso-8859-16,cp1256,cp737,cp1252,
macDingbats,jis0212,iso2022-kr,cp1006,euc-
cn,cp949,cp855,gb2312,cp1255,iso-8859-4,iso-8859-2,cp1258,jis0201,cp864,US-ascii,
cp936,iso-8859-14,iso-8859-5,iso-8859-7,iso-8859-15,cp865,macThai,HZ,macRomania,
cp1257,gb12345,cp932

  And I smiled and then wrote a test code as follows;

Problem #2: Does it really work?

So here is the code #1 that encodes and decodes depending depending on the option.

#!/usr/local/bin/perl5.7.2

use strict;
use Encode;
use Encode::Tcl;

my ($which, $from, $to) = @ARGV;
my $op = $which =~ /e/ ?  \&encode :
    $which =~ /d/ ?  \&decode : die "$0 [-[e|d]c] from to\n";
my $check = $which =~ /c/;
$check and warn "check set.\n";

open my $in,  "<$from" or die "$from:$!";
open my $out, ">$to"   or die "$to:$!";

while(defined(my $line = <$in>)){
    use bytes;
    # or print bitches as follows;
    # Wide character in print at ./classic.pl line 15, <$in> line 260.
    print $out $op->('euc-jp', $line, $check);

}
__END__

  It APPEARS to (en|de)code chars -- with lots of problem.
I fed Jcode/t/table.euc, the file that contains all characters defined in JISX201 and JISX208. Jcode tests it self by converting that file then back. If (en|de)coder is OK, euc-jp -> utf8 -> euc-jp must convert the character back. In case of the code above it did not. However many of the characters appeared converted. Emacs failed to auto-recognize the character format but when I fed the resulting files to JEdit with character set explicitly specified, there appears converted characters.

  Then I also tried this one.

#!/usr/local/bin/perl5.7.2

use strict;
use Encode;
use Encode::Tcl;

my ($which, $from, $to) = @ARGV;
my ($op, $icode, $ocode);
if    ($which =~ /e/){
     $icode = "utf8"; $ocode="encoding('euc-jp')";
}elsif($which =~ /d/){
     $icode = "encoding('euc-jp')"; $ocode="utf8";
}else{
    die "$0 -[e|d] from to\n";
}

open my $in,  "<:$icode", $from or die "$from:$!";
open my $out, ">:$ocode", $to   or die "$to:$!";

while(defined(my $line = <$in>)){
    use bytes;
    print $out $line;

}
__END__

A new style. It does convert but convert differently from the previous code. Also this

Cannot find encoding "'euc-jp'" at ./newway.pl line 17.
:Invalid argument.

  appears for some reasons.
I can only say Encode is far from production level, so far as Japanese charset is concerned.

Problem #3: How about performance?

It's silly to talk about perfomance before code runs right from the first place but I could not help checking it out Encode::Tcl implements conversion by filling lookup-table on-the-fly. That's what Jcode::Unicode::NoXS does (well, mine uses lookup hash, though). How's the performance? I naturally benchmarked.

#!/usr/local/bin/perl5.7.2

use Benchmark;
use Encode;
use Encode::Tcl;
use Jcode;

my $count = $ARGV[0] || 1;

sub subread{
    open my $fh, 'table.euc';
    read $fh, my $eucstr, -f 'table.euc';
    undef $fh;
}

timethese($count,
          {
              "Encode::Tcl" =>
                  sub { my $decoded = decode('euc-jp', $eucstr, 1) },
                  "Jcode" =>
sub { my $decoded = Jcode::convert($eucstr, 'utf8', 'euc') },
              }
          );
__END__

And here is the result.

Benchmark: timing 1 iterations of Encode::Tcl, Jcode...
Encode::Tcl: 1 wallclock secs ( 0.28 usr + 0.00 sys = 0.28 CPU) @ 3.57/s (n=1)
            (warning: too few iterations for a reliable count)
Jcode: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @ 50.00/s (n=1)
            (warning: too few iterations for a reliable count)
Benchmark: timing 100 iterations of Encode::Tcl, Jcode...
Encode::Tcl: 1 wallclock secs ( 0.32 usr + 0.00 sys = 0.32 CPU) @ 312.50/s (n=100)
            (warning: too few iterations for a reliable count)
Jcode: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 3333.33/s (n=100)
            (warning: too few iterations for a reliable count)
Benchmark: timing 1000 iterations of Encode::Tcl, Jcode...
Encode::Tcl: 1 wallclock secs ( 0.38 usr + 0.00 sys = 0.38 CPU) @ 2631.58/s (n=1000)
            (warning: too few iterations for a reliable count)
Jcode: 1 wallclock secs ( 0.11 usr + 0.00 sys = 0.11 CPU) @ 9090.91/s (n=1000)
            (warning: too few iterations for a reliable count)

Just as I guessed. The first invocation of Encode::Tcl is way slow because it has to fill the lookup table. It gets faster as time goes by. The current implementation of Jcode (with XS) also suffers the performance problem on utf8 because it first converts the chars to UCS2 then UTF8.

#4;  Conclusion

I think I have grokked both in fullness to implement Encode::Japanese. I know you don't grok Japanese very well (which you don't have to; I don't grok Finnish either :). It takes more than a simple table lookup to handle Japanese well enough to make native grokkers happy. It has to automatically detect which of many charsets are used, it has to be robust, and most of all, it must be documented in Japanese :) I can do all that. I believe Jcode must someday cease to exist as Camel starts to grok Japanese. With Encode module the day is sooner than I expected and I want to help you make my day. If I submit Encode::Japanese, are you going to merge it standard module?

Dan the Man with Too Many Charsets to Deal With

--
_____  Dan Kogai
  __/ ____   CEO, DAN co. ltd.
 /__ /-+-/  2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan
   /--/--- mailto: dankogai(_at_)dan(_dot_)co(_dot_)jp / http://www.dan.co.jp/ 
---------
__/  /    Tel:+81 3-5665-6131   Fax:+81 3-5665-6132
         PGP Key: http://www.dan.co.jp/ ̄dankogai/dankogai.pgp.asc

<Prev in Thread] Current Thread [Next in Thread>