Re: New iso2022jp.pl

From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: Re: New iso2022jp.pl
Date: Tue, 23 Jul 2002 22:46:14 -0500

Note: I did add a MHonArc::UTF8 module that does allow conversion
of message text to UTF-8, but text clipping is a problem as with
iso-2022-jp.


I've not checked the UTF-8 related codes in MHonArc yet, but
I think clipping UTF-8 string is very easy if we use perl
5.8 or later.
I attached a sample code to this mail.

Is using UTF-8 acceptible for Japanese users?


This is a difficult question.

As far as we mention browsing, most recent browsers like
  Netscape 4.79 on FreeBSD (with Linux emulation)
  Mozilla 1.0   on FreeBSD
  IE 5.50.x     on Windows ME
support UTF-8, though it seems some browsers like lynx does
not support it.


However, I think Namazu (and kakasi, which is widely used
with Namazu to process Japanese text) does not support UTF-8
(I'll check this in detail later).
This might be a big problem for us.

A possible interim solution is to add yet another resource that
allows you to specify the "clip" routine used in resource variable
text clipping.  Therefore, you can specify the iso-2022-jp clip
function for iso-2022-jp message archives.  There could also be
a UTF-8 aware function for UTF-8 strings.


Good.
In fact, this is what I'm thinking.


P.S.
There's no ';' the end line 646 in mhrcvars.pl:
   646                      $ret = join('', @chars[0 .. $len-1])

-- 
Takashi P.KATOH

#!/tmp/perl/bin/perl

use Encode;

while (<>) {    # encoding of input file must be UTF-8
    chomp;

    # to tell perl that $_ is UTF-8-encoded string
    $_ = Encode::decode_utf8($_);

    foreach $len (0 .. 9) {
        $ret = $_;
        
        # taken from MHonArc (mhrcvars.pl)
        my @chars = $ret =~ /(\&[^;\s]*;|.)/g;
        if (scalar(@chars) < $len) {
            $ret = join('', @chars);
        } else {
            $ret = join('', @chars[0 .. $len-1]);
        }

        # convert perl's internal code to UTF-8
        print encode_utf8("$_:$len\t$ret\n");
    }
}