Re: use Encode; # on Japanese; LONG!

Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:

Hi jhi,

  My name is Dan Kogai.  I am a writer of Jcode.pm, which converts from
various Japanese charset to others.  With the advent of Encode module
that comes with Perl 5.7.2 and up,  I finally though that the role of
Jcode, Jcode to r.i.p.  When I tested the module however, I found it was
far from it.  Rather, I believe I can help in great deal with the
current implementation.


Excellent ! ;-)


Problem #1: Where is the rest of charset!?

  When perl5.7.2 gets installed, it installs bunch of .enc files under
Encoding/, including good-old euc-jp. but when you

perl5.7.2 -MEncode -e 'print join(",", encodings), "\n";'

  You get

koi8-r,dingbats,iso-8859-10,iso-8859-13,cp37,iso-8859-9,iso-8859-6,iso-8859-1,
cp1047,iso-8859-4,Internal,iso-8859-2,symbol,iso-8859-3,US-
ascii,iso-8859-8,iso-8859-14,UCS-2,iso-8859-5,UTF-8,iso-8859-7,iso-8859-15,
cp1250,iso-8859-16,posix-bc

  Those are only 8-bit chars.


That was a deliberate decision on my part. Including "all" the ASCII-oid
8-bit encodings in their "compiled" form does not use much memory
(as they share at least 1/2 the space for the ASCII part).

The compiled forms of the multibyte and two-byte encodings are
larger. So I envisage -MEncode=japanese (say) to load clusters.

I was at first disappointed but I thought
it over and found Encode::Tcl module which contains no document.  I read
the document over and over and finally found

perl5.7.2 -MEncode -MEncode::Tcl -e 'print join(",", encodings), "\n";'

  That gave me

gb1988,cp857,macUkraine,dingbats,iso2022-jp,iso-8859-10,ksc5601,iso-8859-13,
iso-8859-6,macTurkish,Internal,symbol,macJapan,iso2022,cp1250,posix-
bc,cp1251,koi8-r,7bit-
kr,cp437,cp866,iso-8859-3,cp874,iso-8859-8,macCyrillic,UCS-2,shiftjis,UTF-8,
euc-jp,cp862,7bit-
kana,cp861,cp860,macCroatian,jis0208,cp1254,cp37,iso-8859-9,7bit-
jis,macGreek,big5,cp852,cp869,macCentEuro,iso-8859-1,cp1047,cp863,macIceland,
macRoman,euc-
kr,gsm0338,cp775,cp950,cp1253,cp424,cp856,cp850,iso-8859-16,cp1256,cp737,cp1252,
macDingbats,jis0212,iso2022-kr,cp1006,euc-
cn,cp949,cp855,gb2312,cp1255,iso-8859-4,iso-8859-2,cp1258,jis0201,cp864,US-ascii,
cp936,iso-8859-14,iso-8859-5,iso-8859-7,iso-8859-15,cp865,macThai,HZ,macRomania,
cp1257,gb12345,cp932


Encode::Tcl is SADAHIRO Tomoyuki's fixup/enhancement of the pure perl
version we used for a while before I invented the compiled form.
The Tcl-oid version is slow.

The .enc files are lifted straight from Tcl. It is unclear to me where
the mappings come from.

Modern Encode has C code that processes a compiled form and can compile
ICU-like .ucm files as well as .enc. The ICU form can represent fallbacks
and non-reversible stuff as well.

At that point in the coding it became unclear whether we could use ICU
stuff - I think we have since concluded that we can.


  And I smiled and then wrote a test code as follows;

Problem #2: Does it really work?

  So here is the code #1 that encodes and decodes depending depending on
the option.

#!/usr/local/bin/perl5.7.2

use strict;
use Encode;
use Encode::Tcl;

my ($which, $from, $to) = @ARGV;
my $op = $which =~ /e/ ?  \&encode :
    $which =~ /d/ ?  \&decode : die "$0 [-[e|d]c] from to\n";
my $check = $which =~ /c/;
$check and warn "check set.\n";

open my $in,  "<$from" or die "$from:$!";
open my $out, ">$to"   or die "$to:$!";

while(defined(my $line = <$in>)){
    use bytes;


File IO of encoded or UTF-8 data is very very messy prior to perl5.7
At best 'use bytes' is a hack.

    # or print bitches as follows;
    # Wide character in print at ./classic.pl line 15, <$in> line 260.
    print $out $op->('euc-jp', $line, $check);

}
__END__

  It APPEARS to (en|de)code chars -- with lots of problem.
  I fed Jcode/t/table.euc, the file that contains all characters defined
in JISX201 and JISX208.  Jcode tests it self by converting that file
then back.  If (en|de)coder is OK, euc-jp -> utf8 -> euc-jp must convert
the character back.  In case of the code above it did not.   However
many of the characters appeared converted.  Emacs failed to
auto-recognize the character format but when I fed the resulting files
to JEdit with character set explicitly specified, there appears
converted characters.

  Then I also tried this one.

#!/usr/local/bin/perl5.7.2

use strict;
use Encode;
use Encode::Tcl;

my ($which, $from, $to) = @ARGV;
my ($op, $icode, $ocode);
if    ($which =~ /e/){
     $icode = "utf8"; $ocode="encoding('euc-jp')";
}elsif($which =~ /d/){
     $icode = "encoding('euc-jp')"; $ocode="utf8";
}else{
    die "$0 -[e|d] from to\n";
}

open my $in,  "<:$icode", $from or die "$from:$!";
open my $out, ">:$ocode", $to   or die "$to:$!";

while(defined(my $line = <$in>)){
    use bytes;

      ^^^^^^^^^^  Catastrophic I would guess.
use bytes says "I know exactly what I am doing" and so even though
perl knows better it believes you and fails to UTF-8-ify things
etc.

    print $out $line;

}
__END__

  A new style.  It does convert but convert differently from the
previous code.  Also this

Cannot find encoding "'euc-jp'" at ./newway.pl line 17.
:Invalid argument.


  appears for some reasons.
  I can only say Encode is far from production level, so far as Japanese
charset is concerned.


I would agree.
It would be good to have some test data in various encodings.
This is easy for 8-bit encodings 0..255 is all you need. But for
16-bit encodings (with gaps) and in particular multi-byte encodings
you need a "sensible" starting sample.


Problem #3: How about performance?

  It's silly to talk about perfomance before code runs right from the
first place but I could not help checking it out
  Encode::Tcl implements conversion by filling lookup-table on-the-fly.
That's what Jcode::Unicode::NoXS does (well, mine uses lookup hash,
though).  How's the performance?  I naturally benchmarked.

#!/usr/local/bin/perl5.7.2

use Benchmark;
use Encode;
use Encode::Tcl;
use Jcode;

my $count = $ARGV[0] || 1;

sub subread{
    open my $fh, 'table.euc';
    read $fh, my $eucstr, -f 'table.euc';
    undef $fh;
}

timethese($count,
          {
              "Encode::Tcl" =>
                  sub { my $decoded = decode('euc-jp', $eucstr, 1) },
                  "Jcode" =>
                  sub { my $decoded = Jcode::convert($eucstr, 'utf8',
'euc') },
              }
          );
__END__

And here is the result.

Benchmark: timing 1 iterations of Encode::Tcl, Jcode...
Encode::Tcl:  1 wallclock secs ( 0.28 usr +  0.00 sys =  0.28 CPU) @
3.57/s (n=1)
            (warning: too few iterations for a reliable count)
     Jcode:  0 wallclock secs ( 0.02 usr +  0.00 sys =  0.02 CPU) @
50.00/s (n=1)
            (warning: too few iterations for a reliable count)
Benchmark: timing 100 iterations of Encode::Tcl, Jcode...
Encode::Tcl:  1 wallclock secs ( 0.32 usr +  0.00 sys =  0.32 CPU) @
312.50/s (n=100)
            (warning: too few iterations for a reliable count)
     Jcode:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
3333.33/s (n=100)
            (warning: too few iterations for a reliable count)
Benchmark: timing 1000 iterations of Encode::Tcl, Jcode...
Encode::Tcl:  1 wallclock secs ( 0.38 usr +  0.00 sys =  0.38 CPU) @
2631.58/s (n=1000)
            (warning: too few iterations for a reliable count)
     Jcode:  1 wallclock secs ( 0.11 usr +  0.00 sys =  0.11 CPU) @
9090.91/s (n=1000)
            (warning: too few iterations for a reliable count)

  Just as I guessed.  The first invocation of Encode::Tcl is way slow
because it has to fill the lookup table.  It gets faster as time goes
by.  The current implementation of Jcode (with XS) also suffers the
performance problem on utf8 because it first converts the chars to UCS2
then UTF8.

#4;  Conclusion

  I think I have grokked both in fullness to implement
Encode::Japanese.  I know you don't grok Japanese very well (which you
don't have to; I don't grok Finnish either :).  It takes more than a
simple table lookup to handle Japanese well enough to make native
grokkers happy.  It has to automatically detect which of many charsets
are used, it has to be robust, and most of all, it must be documented in
Japanese :)  I can do all that.
  I believe Jcode must someday cease to exist as Camel starts to grok
Japanese.  With Encode module the day is sooner than I expected and I
want to help you make  my day.
  If I submit Encode::Japanese, are you going to merge it standard
module?


I encourage you to look at Encode/encengine.c - it is a state machine
which reads tables to transform octet-sequences.

It is a lot faster than Encode::Tcl scheme.

I _think_ Encode/compile (which builds the tables) does right thing for
multi-byte and 16-bit encodings but as I have no reliable test data,
viewer or judgement of end result I cannot be sure.

What I would like to see is :

A. A review of Encode's APIs and principles to make sure I have not
   done anything really stupid. Both API from perl script's perspective
   and also the API/mechanism that it expects an Encoding "plugin" to
   provide.

B. "Blessing" of the Xxxxx <-> Unicode mappings for various encodings.
    Are Tcl's "good enough" or should we use ICU's or Unicode's or ... ?

C. Point me at "clusters" of related encodings that are often used
   together and I can have a crack and building "compiled" XS module
   that provides those encodings.

D. Some discussion as to how to handle escape encodings and/or
   heuristics for guessing encoding. I had some 1/4 thought out
   ideas for how to get encengine.c to assist on these too - but
   I have probably forgotten them.


Dan the Man with Too Many Charsets to Deal With

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/