Re: New

2002-07-24 01:20:42
From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: Re: New
Date: Tue, 23 Jul 2002 22:46:14 -0500
Note: I did add a MHonArc::UTF8 module that does allow conversion
of message text to UTF-8, but text clipping is a problem as with

I've not checked the UTF-8 related codes in MHonArc yet, but
I think clipping UTF-8 string is very easy if we use perl
5.8 or later.
I attached a sample code to this mail.

Is using UTF-8 acceptible for Japanese users?

This is a difficult question.

As far as we mention browsing, most recent browsers like
  Netscape 4.79 on FreeBSD (with Linux emulation)
  Mozilla 1.0   on FreeBSD
  IE 5.50.x     on Windows ME
support UTF-8, though it seems some browsers like lynx does
not support it.

However, I think Namazu (and kakasi, which is widely used
with Namazu to process Japanese text) does not support UTF-8
(I'll check this in detail later).
This might be a big problem for us.

A possible interim solution is to add yet another resource that
allows you to specify the "clip" routine used in resource variable
text clipping.  Therefore, you can specify the iso-2022-jp clip
function for iso-2022-jp message archives.  There could also be
a UTF-8 aware function for UTF-8 strings.

In fact, this is what I'm thinking.

There's no ';' the end line 646 in
   646                      $ret = join('', @chars[0 .. $len-1])

Takashi P.KATOH


use Encode;

while (<>) {    # encoding of input file must be UTF-8

    # to tell perl that $_ is UTF-8-encoded string
    $_ = Encode::decode_utf8($_);

    foreach $len (0 .. 9) {
        $ret = $_;
        # taken from MHonArc (
        my @chars = $ret =~ /(\&[^;\s]*;|.)/g;
        if (scalar(@chars) < $len) {
            $ret = join('', @chars);
        } else {
            $ret = join('', @chars[0 .. $len-1]);

        # convert perl's internal code to UTF-8
        print encode_utf8("$_:$len\t$ret\n");
<Prev in Thread] Current Thread [Next in Thread>