namazu-users-en
[Top] [All Lists]

Re: mhonarc.pl modifications with MHonArc-2.6.3

2003-04-28 10:15:23
On April 28, 2003 at 14:14, Makoto Fujiwara wrote:

I have started to use MHonArc-2.6.3 very recently.
There were some default changes on handling 2 bytes characters.
But if I have the line
<CharsetConverters>
iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>
I am getting the behaviors as the previous version (2.5.x or before).

The change to the defaults were made to provide consistancy in the
default handling behavior of character sets in v2.6.  The change
in iso-2022-jp default handling is highlighted in the MHonArc
release notes.

MHonArc now understands MIME mail, (not very recently), sounds great, 
thanks Earl, and I don't need to have 
   /usr/local/bin/nkf -me 
pre-processing for the input. 

I'm confused by this statement since MHonArc has understood MIME
for a long time.  I'm assuming you are refering to the additional
character encoding support included in v2.6.

(1) Internal multi-byte-chars:
One problem not totally related to the Namazu was: with original
mhonarc code, if I have a multi-byte strings defined for a variable
in .mhonarc.mrc file, the output will be the mixture of ISO-2022-JP
and EUC-JAPAN.  
 There are two assumption for this observation:

(a) I will process the article with Namazu and Namazu needs 
<CharsetConverters> defined with str2html type processing,
not with "#x86FB;" type encoding.

(b) multi-byte chars value in .mhonarc.mrc will be processed by
Perl/MHorArc, needs be not shift-lock 7 bit type charset.  I used 
EUC-JAPAN defining variables in .mhonsrc.mrc file.

To solve this (1) charset mixture problem,
I have currently using Jcode::convert(\$_,'euc') in iso-2022-jp.pl
and processing all the text in EUC-JAPAN.

 I will post this part to mhonarc-users Mailing List later probably.

If I may try to clarify, you were using one encoding in your mhonarc
resource file but you have mail messages that use a different encoding.
Is am correct in my clarification?

Mixed encodings will always be a problem.  In MHonArc v2.6, you do
have the ability to normalize different encodings into one encoding.
For example, if you use EUC in in resource page layout, you could
have MHonArc encode all messages into EUC when processed.  See
the TEXTENCODE resource for details.

Now, if you edit mhonarc resource files in one encoding, but want the
data to be mapped to another encoding, then you should filter your
resource file.  For example, you edit your resource file in EUC and
then you post-process it to ISO-2022-JP before passing it to mhonarc.

(2) filter/mhonarc.pl
MHonArc retains MIME B-Encoding on Subject: and From: info in the line
as:
/<!--X-Subject: ([^-]+) -->/) {
and mhonarc.pl returns encoded text in the fields value.

You can avoid the encoding by utilizing the TEXTENCODE resource
in MHonArc.  TEXTENCODE will cause the data to be pre-decoded
and stored in the encoding you specify.  It is best when mapping
everything to UTF-8, but it can be used to map to any encoding.

You may also want to look at the DECODEHEADS resource.

However, you do touch upon a general problem for archives that
do not use TEXTENCODE and there is non-ASCII encoded data in
the Subject.  A potential general solution is to utilize Perl's
Encode module within the namazu filter to decode the text
data to the designated encoding in namazu.rc.

Since Encode is only available in Perl 5.8 and later, multi-module
checks could be made (similiar to how MHonArc 2.6 does charset
processing) or just document the issue as a limitation for
those using older versions of Perl.

So I have modifications in mhonarc.pl so that it returns
the string after 'MIME::Base64::decode'd + euc conversion.

This mod needs two more external resouces, Jcode.pm and MIME::Base64.pm.

(I am not saying this is the good solution, but just telling I have
this kind of problem and avoided by this patch.)

Good catch.  You are right in implying that your patch is not
necessarily the best solution.  It's main problem is that it only
solves your particular need and not the general problem.

A proper patch could key off the namazu.rc lang setting and
try to general map the non-ASCII encoded data to the given locale.

Since this problem is not unique to MHonArc (i.e. Namazu can index
regular mail and news posts), a general decoding routine should be
made available to namazu filters that decodes non-ASCII encoded data
in mail headers.  It is worth noting that other mail headers beyond
the Subject: header can include non-ASCII encoded data.

--ewh

<Prev in Thread] Current Thread [Next in Thread>