From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: Re: New iso2022jp.pl
Date: Tue, 23 Jul 2002 22:46:14 -0500
Note: I did add a MHonArc::UTF8 module that does allow conversion
of message text to UTF-8, but text clipping is a problem as with
iso-2022-jp.
I've not checked the UTF-8 related codes in MHonArc yet, but
I think clipping UTF-8 string is very easy if we use perl
5.8 or later.
I attached a sample code to this mail.
Is using UTF-8 acceptible for Japanese users?
This is a difficult question.
As far as we mention browsing, most recent browsers like
Netscape 4.79 on FreeBSD (with Linux emulation)
Mozilla 1.0 on FreeBSD
IE 5.50.x on Windows ME
support UTF-8, though it seems some browsers like lynx does
not support it.
However, I think Namazu (and kakasi, which is widely used
with Namazu to process Japanese text) does not support UTF-8
(I'll check this in detail later).
This might be a big problem for us.
A possible interim solution is to add yet another resource that
allows you to specify the "clip" routine used in resource variable
text clipping. Therefore, you can specify the iso-2022-jp clip
function for iso-2022-jp message archives. There could also be
a UTF-8 aware function for UTF-8 strings.
Good.
In fact, this is what I'm thinking.
P.S.
There's no ';' the end line 646 in mhrcvars.pl:
646 $ret = join('', @chars[0 .. $len-1])
--
Takashi P.KATOH
#!/tmp/perl/bin/perl
use Encode;
while (<>) { # encoding of input file must be UTF-8
chomp;
# to tell perl that $_ is UTF-8-encoded string
$_ = Encode::decode_utf8($_);
foreach $len (0 .. 9) {
$ret = $_;
# taken from MHonArc (mhrcvars.pl)
my @chars = $ret =~ /(\&[^;\s]*;|.)/g;
if (scalar(@chars) < $len) {
$ret = join('', @chars);
} else {
$ret = join('', @chars[0 .. $len-1]);
}
# convert perl's internal code to UTF-8
print encode_utf8("$_:$len\t$ret\n");
}
}