[bugs #11187] incorrectly parsing UTF-8 encoded messages

2004-12-01 17:01:00
This mail is an automated notification from the bugs tracker
 of the project: MHonArc.

[bugs #11187] Full Item Snapshot:

URL: <>
Project: MHonArc
Submitted by: Egmont Koblinger
On: Thu 12/02/04 at 00:04

Category:  Character Sets
Severity:  5 - Average
Item Group:  Incorrect Behavior
Resolution:  None
Privacy:  Public
Assigned to:  None
Status:  Open
Platform Version:  Linux
Perl Version:  5.8.5
Component Version:  2.6.10
Fixed Release:  

Summary:  incorrectly parsing UTF-8 encoded messages

Original Submission:  I use mhonarc without any configuration file, just simply
the command "mhonarc -outdir outdir indir" whereas "indir"
only contains one file with one single message encoded in
UTF-8. (Both the subject and the body contain UTF-8 encoded
accented letters, the subject uses quoted-printable, the
body's transfer encoding is 8-bit).

The output html files are quite strange. For each UTF-8
byte sequence only the first byte is taken into account
and it is converted to a html escape. For example, the
Euro sign (U+20AC, UTF-8: E2 82 AC) will appear in the html
output as "&#E2;" and then 82 and AC are skipped, processing
goes on with the next Unicode character.

In MHonarc/ line 153 there's a switch to check
whether perl is new enough to support UTF-8. If it isn't,
then manual processing of UTF-8 character takes place.
Forcing the "non-UTF-8-aware perl" branch of the "if"
statement (that is, changing the "if ($] >= 5.006)" to
"if (0)" repairs the problem, in this case the output will
be the expected "&#20AC;".

I don't think it matters, but I have LANG=hu_HU (latin2
locale) and no other LC_* variables set. However, UTF-8
locales are also available on my system.

For detailed info, follow this link:

  Message sent via/by Savannah

To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the