use Encode; # on Japanese; LONG!

Hi jhi,

My name is Dan Kogai. I am a writer of Jcode.pm, which converts fromvarious Japanese charset to others. With the advent of Encode modulethat comes with Perl 5.7.2 and up, I finally though that the role ofJcode, Jcode to r.i.p. When I tested the module however, I found it wasfar from it. Rather, I believe I can help in great deal with thecurrent implementation.


Problem #1: Where is the rest of charset!?

When perl5.7.2 gets installed, it installs bunch of .enc files underEncoding/, including good-old euc-jp. but when you


perl5.7.2 -MEncode -e 'print join(",", encodings), "\n";'

  You get

koi8-r,dingbats,iso-8859-10,iso-8859-13,cp37,iso-8859-9,iso-8859-6,iso-8859-1,
cp1047,iso-8859-4,Internal,iso-8859-2,symbol,iso-8859-3,US-
ascii,iso-8859-8,iso-8859-14,UCS-2,iso-8859-5,UTF-8,iso-8859-7,iso-8859-15,
cp1250,iso-8859-16,posix-bc

Those are only 8-bit chars. I was at first disappointed but I thoughtit over and found Encode::Tcl module which contains no document. I readthe document over and over and finally found


perl5.7.2 -MEncode -MEncode::Tcl -e 'print join(",", encodings), "\n";'

  That gave me

gb1988,cp857,macUkraine,dingbats,iso2022-jp,iso-8859-10,ksc5601,iso-8859-13,
iso-8859-6,macTurkish,Internal,symbol,macJapan,iso2022,cp1250,posix-
bc,cp1251,koi8-r,7bit-
kr,cp437,cp866,iso-8859-3,cp874,iso-8859-8,macCyrillic,UCS-2,shiftjis,UTF-8,
euc-jp,cp862,7bit-
kana,cp861,cp860,macCroatian,jis0208,cp1254,cp37,iso-8859-9,7bit-
jis,macGreek,big5,cp852,cp869,macCentEuro,iso-8859-1,cp1047,cp863,macIceland,
macRoman,euc-
kr,gsm0338,cp775,cp950,cp1253,cp424,cp856,cp850,iso-8859-16,cp1256,cp737,cp1252,
macDingbats,jis0212,iso2022-kr,cp1006,euc-
cn,cp949,cp855,gb2312,cp1255,iso-8859-4,iso-8859-2,cp1258,jis0201,cp864,US-ascii,
cp936,iso-8859-14,iso-8859-5,iso-8859-7,iso-8859-15,cp865,macThai,HZ,macRomania,
cp1257,gb12345,cp932

  And I smiled and then wrote a test code as follows;

Problem #2: Does it really work?

So here is the code #1 that encodes and decodes depending depending onthe option.


#!/usr/local/bin/perl5.7.2

use strict;
use Encode;
use Encode::Tcl;

my ($which, $from, $to) = @ARGV;
my $op = $which =~ /e/ ?  \&encode :
    $which =~ /d/ ?  \&decode : die "$0 [-[e|d]c] from to\n";
my $check = $which =~ /c/;
$check and warn "check set.\n";

open my $in,  "<$from" or die "$from:$!";
open my $out, ">$to"   or die "$to:$!";

while(defined(my $line = <$in>)){
    use bytes;
    # or print bitches as follows;
    # Wide character in print at ./classic.pl line 15, <$in> line 260.
    print $out $op->('euc-jp', $line, $check);

}
__END__

  It APPEARS to (en|de)code chars -- with lots of problem.

I fed Jcode/t/table.euc, the file that contains all characters definedin JISX201 and JISX208. Jcode tests it self by converting that filethen back. If (en|de)coder is OK, euc-jp -> utf8 -> euc-jp must convertthe character back. In case of the code above it did not. Howevermany of the characters appeared converted. Emacs failed toauto-recognize the character format but when I fed the resulting filesto JEdit with character set explicitly specified, there appearsconverted characters.


  Then I also tried this one.

#!/usr/local/bin/perl5.7.2

use strict;
use Encode;
use Encode::Tcl;

my ($which, $from, $to) = @ARGV;
my ($op, $icode, $ocode);
if    ($which =~ /e/){
     $icode = "utf8"; $ocode="encoding('euc-jp')";
}elsif($which =~ /d/){
     $icode = "encoding('euc-jp')"; $ocode="utf8";
}else{
    die "$0 -[e|d] from to\n";
}

open my $in,  "<:$icode", $from or die "$from:$!";
open my $out, ">:$ocode", $to   or die "$to:$!";

while(defined(my $line = <$in>)){
    use bytes;
    print $out $line;

}
__END__

A new style. It does convert but convert differently from theprevious code. Also this

Cannot find encoding "'euc-jp'" at ./newway.pl line 17.
:Invalid argument.


  appears for some reasons.

I can only say Encode is far from production level, so far as Japanesecharset is concerned.


Problem #3: How about performance?

It's silly to talk about perfomance before code runs right from thefirst place but I could not help checking it outEncode::Tcl implements conversion by filling lookup-table on-the-fly.That's what Jcode::Unicode::NoXS does (well, mine uses lookup hash,though). How's the performance? I naturally benchmarked.


#!/usr/local/bin/perl5.7.2

use Benchmark;
use Encode;
use Encode::Tcl;
use Jcode;

my $count = $ARGV[0] || 1;

sub subread{
    open my $fh, 'table.euc';
    read $fh, my $eucstr, -f 'table.euc';
    undef $fh;
}

timethese($count,
          {
              "Encode::Tcl" =>
                  sub { my $decoded = decode('euc-jp', $eucstr, 1) },
                  "Jcode" =>

sub { my $decoded = Jcode::convert($eucstr, 'utf8','euc') },

              }
          );
__END__

And here is the result.

Benchmark: timing 1 iterations of Encode::Tcl, Jcode...

Encode::Tcl: 1 wallclock secs ( 0.28 usr + 0.00 sys = 0.28 CPU) @3.57/s (n=1)

            (warning: too few iterations for a reliable count)

Jcode: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @50.00/s (n=1)

            (warning: too few iterations for a reliable count)
Benchmark: timing 100 iterations of Encode::Tcl, Jcode...

Encode::Tcl: 1 wallclock secs ( 0.32 usr + 0.00 sys = 0.32 CPU) @312.50/s (n=100)

            (warning: too few iterations for a reliable count)

Jcode: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @3333.33/s (n=100)

            (warning: too few iterations for a reliable count)
Benchmark: timing 1000 iterations of Encode::Tcl, Jcode...

Encode::Tcl: 1 wallclock secs ( 0.38 usr + 0.00 sys = 0.38 CPU) @2631.58/s (n=1000)

            (warning: too few iterations for a reliable count)

Jcode: 1 wallclock secs ( 0.11 usr + 0.00 sys = 0.11 CPU) @9090.91/s (n=1000)

            (warning: too few iterations for a reliable count)

Just as I guessed. The first invocation of Encode::Tcl is way slowbecause it has to fill the lookup table. It gets faster as time goesby. The current implementation of Jcode (with XS) also suffers theperformance problem on utf8 because it first converts the chars to UCS2then UTF8.


#4;  Conclusion

I think I have grokked both in fullness to implementEncode::Japanese. I know you don't grok Japanese very well (which youdon't have to; I don't grok Finnish either :). It takes more than asimple table lookup to handle Japanese well enough to make nativegrokkers happy. It has to automatically detect which of many charsetsare used, it has to be robust, and most of all, it must be documented inJapanese :) I can do all that.I believe Jcode must someday cease to exist as Camel starts to grokJapanese. With Encode module the day is sooner than I expected and Iwant to help you make my day.If I submit Encode::Japanese, are you going to merge it standardmodule?


Dan the Man with Too Many Charsets to Deal With

--
_____  Dan Kogai
  __/ ____   CEO, DAN co. ltd.
 /__ /-+-/  2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan
   /--/--- mailto: dankogai(_at_)dan(_dot_)co(_dot_)jp / http://www.dan.co.jp/ 
---------
__/  /    Tel:+81 3-5665-6131   Fax:+81 3-5665-6132
         PGP Key: http://www.dan.co.jp/￣dankogai/dankogai.pgp.asc