Hi jhi,
My name is Dan Kogai. I am a writer of Jcode.pm, which converts from
various Japanese charset to others. With the advent of Encode module
that comes with Perl 5.7.2 and up, I finally though that the role of
Jcode, Jcode to r.i.p. When I tested the module however, I found it was
far from it. Rather, I believe I can help in great deal with the
current implementation.
Problem #1: Where is the rest of charset!?
When perl5.7.2 gets installed, it installs bunch of .enc files under
Encoding/, including good-old euc-jp. but when you
perl5.7.2 -MEncode -e 'print join(",", encodings), "\n";'
You get
koi8-r,dingbats,iso-8859-10,iso-8859-13,cp37,iso-8859-9,iso-8859-6,iso-8859-1,
cp1047,iso-8859-4,Internal,iso-8859-2,symbol,iso-8859-3,US-
ascii,iso-8859-8,iso-8859-14,UCS-2,iso-8859-5,UTF-8,iso-8859-7,iso-8859-15,
cp1250,iso-8859-16,posix-bc
Those are only 8-bit chars. I was at first disappointed but I thought
it over and found Encode::Tcl module which contains no document. I read
the document over and over and finally found
perl5.7.2 -MEncode -MEncode::Tcl -e 'print join(",", encodings), "\n";'
That gave me
gb1988,cp857,macUkraine,dingbats,iso2022-jp,iso-8859-10,ksc5601,iso-8859-13,
iso-8859-6,macTurkish,Internal,symbol,macJapan,iso2022,cp1250,posix-
bc,cp1251,koi8-r,7bit-
kr,cp437,cp866,iso-8859-3,cp874,iso-8859-8,macCyrillic,UCS-2,shiftjis,UTF-8,
euc-jp,cp862,7bit-
kana,cp861,cp860,macCroatian,jis0208,cp1254,cp37,iso-8859-9,7bit-
jis,macGreek,big5,cp852,cp869,macCentEuro,iso-8859-1,cp1047,cp863,macIceland,
macRoman,euc-
kr,gsm0338,cp775,cp950,cp1253,cp424,cp856,cp850,iso-8859-16,cp1256,cp737,cp1252,
macDingbats,jis0212,iso2022-kr,cp1006,euc-
cn,cp949,cp855,gb2312,cp1255,iso-8859-4,iso-8859-2,cp1258,jis0201,cp864,US-ascii,
cp936,iso-8859-14,iso-8859-5,iso-8859-7,iso-8859-15,cp865,macThai,HZ,macRomania,
cp1257,gb12345,cp932
And I smiled and then wrote a test code as follows;
Problem #2: Does it really work?
So here is the code #1 that encodes and decodes depending depending on
the option.
#!/usr/local/bin/perl5.7.2
use strict;
use Encode;
use Encode::Tcl;
my ($which, $from, $to) = @ARGV;
my $op = $which =~ /e/ ? \&encode :
$which =~ /d/ ? \&decode : die "$0 [-[e|d]c] from to\n";
my $check = $which =~ /c/;
$check and warn "check set.\n";
open my $in, "<$from" or die "$from:$!";
open my $out, ">$to" or die "$to:$!";
while(defined(my $line = <$in>)){
use bytes;
# or print bitches as follows;
# Wide character in print at ./classic.pl line 15, <$in> line 260.
print $out $op->('euc-jp', $line, $check);
}
__END__
It APPEARS to (en|de)code chars -- with lots of problem.
I fed Jcode/t/table.euc, the file that contains all characters defined
in JISX201 and JISX208. Jcode tests it self by converting that file
then back. If (en|de)coder is OK, euc-jp -> utf8 -> euc-jp must convert
the character back. In case of the code above it did not. However
many of the characters appeared converted. Emacs failed to
auto-recognize the character format but when I fed the resulting files
to JEdit with character set explicitly specified, there appears
converted characters.
Then I also tried this one.
#!/usr/local/bin/perl5.7.2
use strict;
use Encode;
use Encode::Tcl;
my ($which, $from, $to) = @ARGV;
my ($op, $icode, $ocode);
if ($which =~ /e/){
$icode = "utf8"; $ocode="encoding('euc-jp')";
}elsif($which =~ /d/){
$icode = "encoding('euc-jp')"; $ocode="utf8";
}else{
die "$0 -[e|d] from to\n";
}
open my $in, "<:$icode", $from or die "$from:$!";
open my $out, ">:$ocode", $to or die "$to:$!";
while(defined(my $line = <$in>)){
use bytes;
print $out $line;
}
__END__
A new style. It does convert but convert differently from the
previous code. Also this
Cannot find encoding "'euc-jp'" at ./newway.pl line 17.
:Invalid argument.
appears for some reasons.
I can only say Encode is far from production level, so far as Japanese
charset is concerned.
Problem #3: How about performance?
It's silly to talk about perfomance before code runs right from the
first place but I could not help checking it out
Encode::Tcl implements conversion by filling lookup-table on-the-fly.
That's what Jcode::Unicode::NoXS does (well, mine uses lookup hash,
though). How's the performance? I naturally benchmarked.
#!/usr/local/bin/perl5.7.2
use Benchmark;
use Encode;
use Encode::Tcl;
use Jcode;
my $count = $ARGV[0] || 1;
sub subread{
open my $fh, 'table.euc';
read $fh, my $eucstr, -f 'table.euc';
undef $fh;
}
timethese($count,
{
"Encode::Tcl" =>
sub { my $decoded = decode('euc-jp', $eucstr, 1) },
"Jcode" =>
sub { my $decoded = Jcode::convert($eucstr, 'utf8',
'euc') },
}
);
__END__
And here is the result.
Benchmark: timing 1 iterations of Encode::Tcl, Jcode...
Encode::Tcl: 1 wallclock secs ( 0.28 usr + 0.00 sys = 0.28 CPU) @
3.57/s (n=1)
(warning: too few iterations for a reliable count)
Jcode: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @
50.00/s (n=1)
(warning: too few iterations for a reliable count)
Benchmark: timing 100 iterations of Encode::Tcl, Jcode...
Encode::Tcl: 1 wallclock secs ( 0.32 usr + 0.00 sys = 0.32 CPU) @
312.50/s (n=100)
(warning: too few iterations for a reliable count)
Jcode: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @
3333.33/s (n=100)
(warning: too few iterations for a reliable count)
Benchmark: timing 1000 iterations of Encode::Tcl, Jcode...
Encode::Tcl: 1 wallclock secs ( 0.38 usr + 0.00 sys = 0.38 CPU) @
2631.58/s (n=1000)
(warning: too few iterations for a reliable count)
Jcode: 1 wallclock secs ( 0.11 usr + 0.00 sys = 0.11 CPU) @
9090.91/s (n=1000)
(warning: too few iterations for a reliable count)
Just as I guessed. The first invocation of Encode::Tcl is way slow
because it has to fill the lookup table. It gets faster as time goes
by. The current implementation of Jcode (with XS) also suffers the
performance problem on utf8 because it first converts the chars to UCS2
then UTF8.
#4; Conclusion
I think I have grokked both in fullness to implement
Encode::Japanese. I know you don't grok Japanese very well (which you
don't have to; I don't grok Finnish either :). It takes more than a
simple table lookup to handle Japanese well enough to make native
grokkers happy. It has to automatically detect which of many charsets
are used, it has to be robust, and most of all, it must be documented in
Japanese :) I can do all that.
I believe Jcode must someday cease to exist as Camel starts to grok
Japanese. With Encode module the day is sooner than I expected and I
want to help you make my day.
If I submit Encode::Japanese, are you going to merge it standard
module?
Dan the Man with Too Many Charsets to Deal With
--
_____ Dan Kogai
__/ ____ CEO, DAN co. ltd.
/__ /-+-/ 2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan
/--/--- mailto: dankogai(_at_)dan(_dot_)co(_dot_)jp / http://www.dan.co.jp/
---------
__/ / Tel:+81 3-5665-6131 Fax:+81 3-5665-6132
PGP Key: http://www.dan.co.jp/ ̄dankogai/dankogai.pgp.asc