Re: Encode::XS for CJK

On 2002.01.31, at 17:13, Nick Ing-Simmons wrote:

  Now the problem is escape-based codings such as ISO-2022.


Can you explain the way those work?
I can imagine two ways for decode:
A  - keep going with current sub-encoding till we get a fail,
     then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
     then translate up to that.

　To answer these questions, let's see what the existing utilities do.Here I will discuss NKF, jcode.pl and my humble Jcode.


NKF (Network Kanji Filter)      ftp://ftp.ie.u-
ryukyu.ac.jp/pub/software/kono/

* 1st appeared in 1987.  Still maintained.
* Handles EUC-JP, JIS (ISO-2022-JP) and SHIFT_JIS.
* No Unicode support to the date.  This is understandable because other
  "Legacy" encodings needs no conversion table since they are all based
  upon JISX2xx.
* Stream based.  No buffer allocation and such (later changed when NKF.pm
  was added to the distribution.  But even this case NKF.xs just does
  buffer handling and nkf(1) does no in-memory conversion).
* Uses method B for ISO-2022 (or my ungetc() !).

jcode.pl  ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/

* 1st appeared in 1992, BEFORE Perl5
* still maintained; still widely used for the same reason cgi-lib.pl is
  used instead of CGI.pm
* Written 100% in perl
* No Unicode support
* Method C?  Just like method B but it uses regex to grab between escape
  boundaries (see Jcode.pm) below

Jcode.pm  http://www.openlab.gr.jp/Jcode/

* 1st appeared in 1999
* Unicode support added (XS and NoXS both supported)
* object model
* internal routines based upon jcode.pl
* and here is the piece of sub that does JIS -> EUC conversion

sub jis_euc {
    my $thingy = shift;
    my $r_str = ref $thingy ? $thingy : \$thingy;
    $$r_str =~ s(
                 ($RE{JIS_0212}|$RE{JIS_0208}|$RE{JIS_ASC}|$RE{JIS_KANA})
                 ([^\e]*)
                 )
    {
        my ($esc, $str) = ($1, $2);
        if ($esc !~ /$RE{JIS_ASC}/o) {
            $str =~ tr/\x21-\x7e/\xa1-\xfe/;
            if ($esc =~ /$RE{JIS_KANA}/o) {
                $str =~ s/([\xa1-\xdf])/\x8e$1/og;
            }
            elsif ($esc =~ /$RE{JIS_0212}/o) {
                $str =~ s/([\xa1-\xfe][\xa1-\xfe])/\x8f$1/og;
            }
        }
        $str;
    }geox;
    $$r_str;
}

You could bundle several encodings in one XS (the way Encode itself
bundles ASCII, ios-8859-* and koi8).

I know. But it is bulky and another problem is that Tcl has adifferent notion of 'Escape' (like euc_jp_0212, which is not exactly anescape but an extension) which needs to be corrected for the practicaluse.

If any of the bundled encodings have similar sequences of code points
then we will get overall table size reductions too.

In the limit one could have Encode::CJK, but perhaps
Encode::JP / Encode::CN / Encode::KR makes more sense ???

Right. From a user's point of view distinct package space for each(human) language is better. But again, this can be implemented like


Encode::EUC (does all euc-based conversion)
Encode::JP  (Wrapper module that calls Encode::EUC and Encode::ISO2022)
Encode::KR
Encode::ZN

  and so forth.

Actually even more table reduction can be done between SHIFT_JIS andEUC. They are all based upon JISX0208 (and 0201 and 0212) so simplecalculation

converts one another.

We could ship things pre-compiled (with origianal .ucm's gzipped, or
provide a way to extract a .ucm from the compiled form).
Also the compile process is all in perl and has not really been tunned.

It spends a lot of time trying to find common "strings" (which getstables

down in size so is a win.)

Right. How we do that we still need more experiments but this is whatshould be done....

Dan