perl-unicode

Re: use Encode; # on Japanese; LONG!

2002-01-12 19:23:13
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
Nick,

  Here you are at last.

Excellent ! ;-)

  I grokked Encode even further and now I grokked that I need to set
some road map before I move forward....
  Here is a list of what I am not sure.

Portablity: make Encode portable with pre-perl 5.6?
      
  That needs a complete rewrite of the current code;  Encode today is
too CORE:: dependent (such as the use of utf8:: subs).  Still it's worth
it and with proper  #ifdef's I think I can make even XS portable.
  My opnion is to make Encode available both as part of core and
independent module like so many popular ones as libnet, DB_File,
Storable, to name a few.
Remember there are still lots of sites without 5.6 with good reasons.

That _may_ be worth while, but Encode is targeted at Unicode and 5.6
was 1st to have that. (IMHO Jarkko has done so much good work on 5.7 
Unicode that even 5.6 does not really suffice.)

And solid encode would be motivation to upgrade.

But I have no objection to a back port - but *PLEASE* can we get 
the mainline version really solid before doing that?


Conversion Table: where should we store that?

  Encode today saves it as separate files.  This is a normal approach
for a modern programmer to save data and codes separately module writers
may disagree;  They may be happier if they can browse actual data via
'perldoc -m'

I have no real objection to Encode::Xxxxx having a .pm files with pods
and perhaps even the .ucm data.  

  This reminds me of the fact that Encode.pm contains multiple packages
such as Encode::Encodings.  I was at first lost when I 'perldoc
Encode::Encodings'.

That was a deliberate decision on my part. Including "all" the ASCII-oid
8-bit encodings in their "compiled" form does not use much memory
(as they share at least 1/2 the space for the ASCII part).

  As a programmer I say that's fair.  As a native user of non-roman
script I say CJK is once again discriminated.  It would be nice if it
shows all currently available character sets without loading it -- or
ASCII and nothing else by default.

Fair point - I see no reason not to at least list the encodings available.
But until implemented and tested they are not really available :-(


The compiled forms of the multibyte and two-byte encodings are
larger. So I envisage -MEncode=japanese (say) to load clusters.

  Once again it is programatically correct and politically incorrect.

Both programatic an political corrections welcome.

IMHO Encode loads nothing but ASCII and utf8 by default to be fair.

Ah - but ASCII is also a politically incorrect bundling of all the 
iso-8859-* and some windows code pages...



Encode::Tcl is SADAHIRO Tomoyuki's fixup/enhancement of the pure perl
version we used for a while before I invented the compiled form.
The Tcl-oid version is slow.

  Yes, it is but it works.  Also complied form is so far only available
to 8-bit  charsets.

I do not believe that to be the case. I have compiled the Big5 and GB?????
sets (because they were largest).


The .enc files are lifted straight from Tcl. It is unclear to me where
the mappings come from.

  I believe they (I mean Tclers) just converted the forms at
ftp://ftp.unicode.org/Public/MAPPINGS/ to their taste.... Oh shoot!  I
just checked the URI above and found EASTASIA is missing now!

Yes it wandered off a while back ...


Modern Encode has C code that processes a compiled form and can compile
ICU-like .ucm files as well as .enc. The ICU form can represent
fallbacks
and non-reversible stuff as well.

.ucm is much easier on my eyeballs, though somewhat bulky.

As I expect them to be pre-compiled - either as shipped or at build time
we could zip them if we had zlib/Compress::Zlib or a gzip.


At that point in the coding it became unclear whether we could use ICU
stuff - I think we have since concluded that we can.

jhi answered that one but I am not sure if we make ICU standard for perl
encoding exchange....

Not the ICU program(s) just their tables - other "definitive" open-sources
welcome (Linux iconv has some merit to my non-expert eyes).


File IO of encoded or UTF-8 data is very very messy prior to perl5.7
At best 'use bytes' is a hack.

I know.  To be honest with you file IO semantics (and IO handle) is one
of my least favorite part of the beast (But I agree this is one of the
oldest guts of perl.  I started using perl because awk didn't let me
open multiple files at once :).



while(defined(my $line = <$in>)){
    use bytes;
      ^^^^^^^^^^  Catastrophic I would guess.
use bytes says "I know exactly what I am doing" and so even though
perl knows better it believes you and fails to UTF-8-ify things
etc.

Is there a straight interface that switches byte semantics and utf8 RUN
TIME?

You should not need to if we have UTF-8 semantics right.

You can (in theory) switch encodings on file handles at run time,
and change (I think) assumed "locale-encoding" on a (lexical?) basis.

I just noticed a script like the one above just needs that.

$toencode and eval {use bytes;};  # too hairy!

  appears for some reasons.
  I can only say Encode is far from production level, so far as
Japanese
charset is concerned.

I would agree.
It would be good to have some test data in various encodings.
This is easy for 8-bit encodings 0..255 is all you need. But for
16-bit encodings (with gaps) and in particular multi-byte encodings
you need a "sensible" starting sample.

  Yes.  As a writer of Jcode I know that only too well.  Japanese is not
hard to learn to speak; Japanese encoding is not.  There are AT LEAST 4
encodings you have to deal with (euc-jp, shiftjis, iso-2022-jp, and
Unicode).  Actually the situation of Japanese encoding is tougher than
other East Asian languages because Japan started computing before others
did.  Others din't have to make the same mistake we did.  Oh well....

I encourage you to look at Encode/encengine.c - it is a state machine
which reads tables to transform octet-sequences.

  I did.  Would you make your tabstop to 4 :)?

Sorry - Jarkko feel free to C<indent> encengine.c if I don't get there
first.


It is a lot faster than Encode::Tcl scheme.

I _think_ Encode/compile (which builds the tables) does right thing for
multi-byte and 16-bit encodings but as I have no reliable test data,
viewer or judgement of end result I cannot be sure.

  It does but it still doesn't cut escape-based codings like iso-2022.

Need help on those - in principle it can at least spot the coding violations
that next escape should provoke and so the Encode::Tcl like escape handler
should be able to use it to handle the inner encodings????


What I would like to see is :

A. A review of Encode's APIs and principles to make sure I have not
   done anything really stupid. Both API from perl script's perspective
   and also the API/mechanism that it expects an Encoding "plugin" to
   provide.

  Yes.  Thanks to the API encoders can be written very portable.  Here
is Encode::Jcode that I wrote in 3 minutes that worked.

use strict;
use Jcode;
use Encode qw(find_encoding);
use base 'Encode::Encoding';
use Carp;

sub add_encodings{
    for my $canon (qw(euc-jp iso-2022-jp shiftjis)){
        my $obj = bless { Name => $canon }, __PACKAGE__;
        $obj->Define($canon);
    }
}

sub import{
    add_encodings();
}

my %canon2jcode = (
    'euc-jp'      => 'euc',
    'shifjis'     => 'sjis',
    'iso-2022-jp' => 'iso_2022_jp',
);

use Data::Dumper;
sub encode{
    my ($self, $string, $check) = @_;
    my $name = $canon2jcode{$self->{Name}};
    no strict 'refs';
    return jcode($string, 'utf8')->$name;
}

sub decode{
    my ($self, $octet, $check) = @_;
    my $name = $canon2jcode{$self->{Name}};
    return jcode($octet, $name)->utf8;
}

1;

  The problem is Encode itself is not portable enough to be indepent
module....

B. "Blessing" of the Xxxxx <-> Unicode mappings for various encodings.
    Are Tcl's "good enough" or should we use ICU's or Unicode's or ... ?

  IMHO Tcl's is good enough TO START.  But implementation.  Hmm....


C. Point me at "clusters" of related encodings that are often used
   together and I can have a crack and building "compiled" XS module
   that provides those encodings.

  Another good question is how much to XS.  Even Jcode comes with NoXS
module for those environments where you can't build XS, such as ISP's
server, MacOS and Windows...

Which is why I want to get "clusters" and XS-ness in "core perl".


D. Some discussion as to how to handle escape encodings and/or
   heuristics for guessing encoding. I had some 1/4 thought out
   ideas for how to get encengine.c to assist on these too - but
   I have probably forgotten them.

  Well, encode guessing appears not as needed in other languages as
Japanese.  Most other either have 'OLD' (pre-Unicode) and 'New'
(Unicode).  China is a good example; they virtually have gb2312 and
Unicode and that's it.
  As for Japanese, just check Internet Explorer and check the charset
menu.  Only Japanese has 'Auto Detect'.
  Here is how Jcode 'Auto Detect's character code.  Purely in perl.

sub getcode {
    my $thingy = shift;
    my $r_str = ref $thingy ? $thingy : \$thingy;

    my ($code, $nmatch, $sjis, $euc, $utf8) = ("", 0, 0, 0, 0);
    if ($$r_str =~ /$RE{BIN}/o) {       # 'binary'
        my $ucs2;
        $ucs2 += length($1)
            while $$r_str =~ /(\x00$RE{ASCII})+/go;
        if ($ucs2){      # smells like raw unicode
            ($code, $nmatch) = ('ucs2', $ucs2);
        }else{
            ($code, $nmatch) = ('binary', 0);
         }
    }
    elsif ($$r_str !~ /[\e\x80-\xff]/o) {       # not Japanese
        ($code, $nmatch) = ('ascii', 1);
    }                           # 'jis'
    elsif ($$r_str =~
           m[
             $RE{JIS_0208}|$RE{JIS_0212}|$RE{JIS_ASC}|$RE{JIS_KANA}
           ]ox)
    {
        ($code, $nmatch) = ('jis', 1);
    }
    else { # should be euc|sjis|utf8
        # use of (?:) by Hiroki Ohzaki 
<ohzaki(_at_)iod(_dot_)ricoh(_dot_)co(_dot_)jp>
        $sjis += length($1)
            while $$r_str =~ /((?:$RE{SJIS_C})+)/go;
        $euc  += length($1)
            while $$r_str =~
/((?:$RE{EUC_C}|$RE{EUC_KANA}|$RE{EUC_0212})+)/go;
        $utf8  += length($1)
            while $$r_str =~ /((?:$RE{UTF8})+)/go;
        $nmatch = _max($utf8, $sjis, $euc);
        carp ">DEBUG:sjis = $sjis, euc = $euc, utf8 = $utf8" if
$DEBUG >= 3;
        $code =
            ($euc > $sjis and $euc > $utf8) ? 'euc' :
                ($sjis > $euc and $sjis > $utf8) ? 'sjis' :
                    ($utf8 > $euc and $utf8 > $sjis) ? 'utf8' : undef;
    }
    return wantarray ? ($code, $nmatch) : $code;
}

Well, I need to get some sleep now....

Dan the Man with Too Many Charsets To Deal With
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

<Prev in Thread] Current Thread [Next in Thread>