perl-unicode

Re: use Encode; # on Japanese; LONG!

2002-01-11 17:04:23
Nick,

  Here you are at last.

Excellent ! ;-)

I grokked Encode even further and now I grokked that I need to set some road map before I move forward....
  Here is a list of what I am not sure.

Portablity: make Encode portable with pre-perl 5.6?
        
That needs a complete rewrite of the current code; Encode today is too CORE:: dependent (such as the use of utf8:: subs). Still it's worth it and with proper #ifdef's I think I can make even XS portable. My opnion is to make Encode available both as part of core and independent module like so many popular ones as libnet, DB_File, Storable, to name a few.
Remember there are still lots of sites without 5.6 with good reasons.

Conversion Table: where should we store that?

Encode today saves it as separate files. This is a normal approach for a modern programmer to save data and codes separately module writers may disagree; They may be happier if they can browse actual data via 'perldoc -m' This reminds me of the fact that Encode.pm contains multiple packages such as Encode::Encodings. I was at first lost when I 'perldoc Encode::Encodings'.

That was a deliberate decision on my part. Including "all" the ASCII-oid
8-bit encodings in their "compiled" form does not use much memory
(as they share at least 1/2 the space for the ASCII part).

As a programmer I say that's fair. As a native user of non-roman script I say CJK is once again discriminated. It would be nice if it shows all currently available character sets without loading it -- or ASCII and nothing else by default.

The compiled forms of the multibyte and two-byte encodings are
larger. So I envisage -MEncode=japanese (say) to load clusters.

Once again it is programatically correct and politically incorrect. IMHO Encode loads nothing but ASCII and utf8 by default to be fair.


Encode::Tcl is SADAHIRO Tomoyuki's fixup/enhancement of the pure perl
version we used for a while before I invented the compiled form.
The Tcl-oid version is slow.

Yes, it is but it works. Also complied form is so far only available to 8-bit charsets.

The .enc files are lifted straight from Tcl. It is unclear to me where
the mappings come from.

I believe they (I mean Tclers) just converted the forms at ftp://ftp.unicode.org/Public/MAPPINGS/ to their taste.... Oh shoot! I just checked the URI above and found EASTASIA is missing now!

Modern Encode has C code that processes a compiled form and can compile
ICU-like .ucm files as well as .enc. The ICU form can represent fallbacks
and non-reversible stuff as well.

..ucm is much easier on my eyeballs, though somewhat bulky.

At that point in the coding it became unclear whether we could use ICU
stuff - I think we have since concluded that we can.

jhi answered that one but I am not sure if we make ICU standard for perl encoding exchange....

File IO of encoded or UTF-8 data is very very messy prior to perl5.7
At best 'use bytes' is a hack.

I know. To be honest with you file IO semantics (and IO handle) is one of my least favorite part of the beast (But I agree this is one of the oldest guts of perl. I started using perl because awk didn't let me open multiple files at once :).



while(defined(my $line = <$in>)){
    use bytes;
      ^^^^^^^^^^  Catastrophic I would guess.
use bytes says "I know exactly what I am doing" and so even though
perl knows better it believes you and fails to UTF-8-ify things
etc.

Is there a straight interface that switches byte semantics and utf8 RUN TIME?
I just noticed a script like the one above just needs that.

$toencode and eval {use bytes;};  # too hairy!

  appears for some reasons.
I can only say Encode is far from production level, so far as Japanese
charset is concerned.

I would agree.
It would be good to have some test data in various encodings.
This is easy for 8-bit encodings 0..255 is all you need. But for
16-bit encodings (with gaps) and in particular multi-byte encodings
you need a "sensible" starting sample.

Yes. As a writer of Jcode I know that only too well. Japanese is not hard to learn to speak; Japanese encoding is not. There are AT LEAST 4 encodings you have to deal with (euc-jp, shiftjis, iso-2022-jp, and Unicode). Actually the situation of Japanese encoding is tougher than other East Asian languages because Japan started computing before others did. Others din't have to make the same mistake we did. Oh well....

I encourage you to look at Encode/encengine.c - it is a state machine
which reads tables to transform octet-sequences.

  I did.  Would you make your tabstop to 4 :)?

It is a lot faster than Encode::Tcl scheme.

I _think_ Encode/compile (which builds the tables) does right thing for
multi-byte and 16-bit encodings but as I have no reliable test data,
viewer or judgement of end result I cannot be sure.

  It does but it still doesn't cut escape-based codings like iso-2022.

What I would like to see is :

A. A review of Encode's APIs and principles to make sure I have not
   done anything really stupid. Both API from perl script's perspective
   and also the API/mechanism that it expects an Encoding "plugin" to
   provide.

Yes. Thanks to the API encoders can be written very portable. Here is Encode::Jcode that I wrote in 3 minutes that worked.

use strict;
use Jcode;
use Encode qw(find_encoding);
use base 'Encode::Encoding';
use Carp;

sub add_encodings{
    for my $canon (qw(euc-jp iso-2022-jp shiftjis)){
        my $obj = bless { Name => $canon }, __PACKAGE__;
        $obj->Define($canon);
    }
}

sub import{
    add_encodings();
}

my %canon2jcode = (
    'euc-jp'      => 'euc',
    'shifjis'     => 'sjis',
    'iso-2022-jp' => 'iso_2022_jp',
);

use Data::Dumper;
sub encode{
    my ($self, $string, $check) = @_;
    my $name = $canon2jcode{$self->{Name}};
    no strict 'refs';
    return jcode($string, 'utf8')->$name;
}

sub decode{
    my ($self, $octet, $check) = @_;
    my $name = $canon2jcode{$self->{Name}};
    return jcode($octet, $name)->utf8;
}

1;

The problem is Encode itself is not portable enough to be indepent module....

B. "Blessing" of the Xxxxx <-> Unicode mappings for various encodings.
    Are Tcl's "good enough" or should we use ICU's or Unicode's or ... ?

  IMHO Tcl's is good enough TO START.  But implementation.  Hmm....


C. Point me at "clusters" of related encodings that are often used
   together and I can have a crack and building "compiled" XS module
   that provides those encodings.

Another good question is how much to XS. Even Jcode comes with NoXS module for those environments where you can't build XS, such as ISP's server, MacOS and Windows...

D. Some discussion as to how to handle escape encodings and/or
   heuristics for guessing encoding. I had some 1/4 thought out
   ideas for how to get encengine.c to assist on these too - but
   I have probably forgotten them.

Well, encode guessing appears not as needed in other languages as Japanese. Most other either have 'OLD' (pre-Unicode) and 'New' (Unicode). China is a good example; they virtually have gb2312 and Unicode and that's it. As for Japanese, just check Internet Explorer and check the charset menu. Only Japanese has 'Auto Detect'.
  Here is how Jcode 'Auto Detect's character code.  Purely in perl.

sub getcode {
    my $thingy = shift;
    my $r_str = ref $thingy ? $thingy : \$thingy;

    my ($code, $nmatch, $sjis, $euc, $utf8) = ("", 0, 0, 0, 0);
    if ($$r_str =~ /$RE{BIN}/o) {       # 'binary'
        my $ucs2;
        $ucs2 += length($1)
            while $$r_str =~ /(\x00$RE{ASCII})+/go;
        if ($ucs2){      # smells like raw unicode
            ($code, $nmatch) = ('ucs2', $ucs2);
        }else{
            ($code, $nmatch) = ('binary', 0);
         }
    }
    elsif ($$r_str !~ /[\e\x80-\xff]/o) {       # not Japanese
        ($code, $nmatch) = ('ascii', 1);
    }                           # 'jis'
    elsif ($$r_str =~
           m[
             $RE{JIS_0208}|$RE{JIS_0212}|$RE{JIS_ASC}|$RE{JIS_KANA}
           ]ox)
    {
        ($code, $nmatch) = ('jis', 1);
    }
    else { # should be euc|sjis|utf8
        # use of (?:) by Hiroki Ohzaki 
<ohzaki(_at_)iod(_dot_)ricoh(_dot_)co(_dot_)jp>
        $sjis += length($1)
            while $$r_str =~ /((?:$RE{SJIS_C})+)/go;
        $euc  += length($1)
while $$r_str =~ /((?:$RE{EUC_C}|$RE{EUC_KANA}|$RE{EUC_0212})+)/go;
        $utf8  += length($1)
            while $$r_str =~ /((?:$RE{UTF8})+)/go;
        $nmatch = _max($utf8, $sjis, $euc);
carp ">DEBUG:sjis = $sjis, euc = $euc, utf8 = $utf8" if $DEBUG >= 3;
        $code =
            ($euc > $sjis and $euc > $utf8) ? 'euc' :
                ($sjis > $euc and $sjis > $utf8) ? 'sjis' :
                    ($utf8 > $euc and $utf8 > $sjis) ? 'utf8' : undef;
    }
    return wantarray ? ($code, $nmatch) : $code;
}

Well, I need to get some sleep now....

Dan the Man with Too Many Charsets To Deal With