Re: Encode::XS for CJK

Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:

On 2002.01.31, at 17:13, Nick Ing-Simmons wrote:

  Now the problem is escape-based codings such as ISO-2022.


Can you explain the way those work?
I can imagine two ways for decode:
A  - keep going with current sub-encoding till we get a fail,
     then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
     then translate up to that.


　To answer these questions, let's see what the existing utilities do.
Here I will discuss NKF, jcode.pl and my humble Jcode.


jcode.pl  ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/

* 1st appeared in 1992, BEFORE Perl5
* still maintained; still widely used for the same reason cgi-lib.pl is
  used instead of CGI.pm
* Written 100% in perl
* No Unicode support
* Method C?  Just like method B but it uses regex to grab between escape
  boundaries (see Jcode.pm) below


So would I if doing it in perl.

You could bundle several encodings in one XS (the way Encode itself
bundles ASCII, ios-8859-* and koi8).


  I know.  But it is bulky


As a quick hack - Tried bundling :

                'euc-jp.ucm',
                'jis0201.enc',
                'jis0212.enc',
                'jis0208.enc',
                'shiftjis.enc',

The resulting XS's string table was only slightly larger than the one
for euc-jp.ucm on its own. (But time to compile was much longer.)

and another problem is that Tcl has a
different notion of 'Escape' (like euc_jp_0212, which is not exactly an
escape but an extension)


Tcl has both E (escape) encoding and X (eXtension) encoding as type fields.
I don't remember that from tcl/tk ...

which needs to be corrected for the practical
use.

If any of the bundled encodings have similar sequences of code points
then we will get overall table size reductions too.

In the limit one could have Encode::CJK, but perhaps
Encode::JP / Encode::CN / Encode::KR makes more sense ???


  Right.  From a user's point of view distinct package space for each
(human) language is better.  But again, this can be implemented like

Encode::EUC (does all euc-based conversion)
Encode::JP  (Wrapper module that calls Encode::EUC and Encode::ISO2022)
Encode::KR
Encode::ZN

  and so forth.
  Actually even more table reduction can be done between SHIFT_JIS and
EUC.  They are all based upon JISX0208 (and 0201 and 0212) so simple
calculation
converts one another.

Ah.

We could ship things pre-compiled (with origianal .ucm's gzipped, or
provide a way to extract a .ucm from the compiled form).
Also the compile process is all in perl and has not really been tunned.
It spends a lot of time trying to find common "strings" (which gets
tables
down in size so is a win.)


  Right.  How we do that we still need more experiments but this is what
should be done....

Dan

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/