[bugs #11187] incorrectly parsing UTF-8 encoded messages

2004-12-03 13:37:44
This mail is an automated notification from the bugs tracker
 of the project: MHonArc.

[bugs #11187] Latest Modifications:

Changes by: 
                Earl Hood <earl(_at_)earlhood(_dot_)com>
                Fri 12/03/2004 at 20:41 (US/Central)

            What     | Removed                   | Added
          Resolution | None                      | Fixed
       Fixed Release |                           | CVS

------------------ Additional Follow-up Comments ----------------------------
Fix checked into CVS.

[bugs #11187] Full Item Snapshot:

URL: <>
Project: MHonArc
Submitted by: Egmont Koblinger
On: Thu 12/02/2004 at 00:04

Category:  Character Sets
Severity:  5 - Average
Item Group:  Incorrect Behavior
Resolution:  Fixed
Privacy:  Public
Assigned to:  None
Status:  Open
Platform Version:  Linux
Perl Version:  5.8.5
Component Version:  2.6.10
Fixed Release:  CVS

Summary:  incorrectly parsing UTF-8 encoded messages

Original Submission:  I use mhonarc without any configuration file, just simply
the command "mhonarc -outdir outdir indir" whereas "indir"
only contains one file with one single message encoded in
UTF-8. (Both the subject and the body contain UTF-8 encoded
accented letters, the subject uses quoted-printable, the
body's transfer encoding is 8-bit).

The output html files are quite strange. For each UTF-8
byte sequence only the first byte is taken into account
and it is converted to a html escape. For example, the
Euro sign (U+20AC, UTF-8: E2 82 AC) will appear in the html
output as "&#E2;" and then 82 and AC are skipped, processing
goes on with the next Unicode character.

In MHonarc/ line 153 there's a switch to check
whether perl is new enough to support UTF-8. If it isn't,
then manual processing of UTF-8 character takes place.
Forcing the "non-UTF-8-aware perl" branch of the "if"
statement (that is, changing the "if ($] >= 5.006)" to
"if (0)" repairs the problem, in this case the output will
be the expected "&#20AC;".

I don't think it matters, but I have LANG=hu_HU (latin2
locale) and no other LC_* variables set. However, UTF-8
locales are also available on my system.

Follow-up Comments

Date: Fri 12/03/2004 at 20:41       By: Earl Hood <ehood>
Fix checked into CVS.

Date: Fri 12/03/2004 at 20:11       By: 0 <None>
Yes, your patch is definitely nicer than my one. I told you
I'm beginner in perl :-)
Thanks for the fix!

Date: Fri 12/03/2004 at 18:42       By: Earl Hood <ehood>
The sample patch provided is not applicable for 5.6.x since
the Encode module is only available for 5.8.x and later.

After some searching, it appears that adding the "U0"
specifier to unpack makes things work.  I do not know fully
understand why unpack requires this to get things to work,
but it appears to fix the problem.

I've attached a patch to this report.  It will be checked
into CVS after I can resolve some connectivity problems
with the CVS server.

Date: Thu 12/02/2004 at 19:28       By: 0 <None>
Sample patch follows that fixes the problem. It's just
a case study to show what the problem is, depending on the
Encode module may not be nice and I have no idea whether
it's supported in older perls. (Note that I'm absolute
beginner in perl.)

The problem is that when unpack is executed in line 159
(according to the original 2.6.10 source) then its
parameter ($1) is just a sequence of bytes and perl has
no idea that it should be interpreted as utf8. Hence I
guess it interprets it according to latin1 and that's why
unpack doesn't do what we need. Before using unpack we
have to tell perl "hey that's an utf8 string".

Date: Thu 12/02/2004 at 00:58       By: Egmont Koblinger <egmont>
I attach a test case. This doesn't only happen for one
particular message but rather for every message I write
with mutt using UTF-8 encoding so it's not a problem to
generate a publicly visible test case.
Both the subject and the body contain the following string:
"asdf" then "e acute" (both latin1 and 2) then "e grave"
(only latin1) then "o doubleacute" (only latin2) then an
euro sign (neither latin1 nor latin2) followed by "jkl;".

The input directory contains the message. The output-actual
directory was generated with mainstream mhonarc 2.6.10 using
"mhonarc -outdir output-actual input". Similarly
output-expected was generated with mhonarc patched as
described above. All this packed into a single tarball.

Date: Thu 12/02/2004 at 00:36       By: Earl Hood <ehood>
Can submitter please zip up sample message and send it to the author's address 
for evaluation?  Or you can attach the bundle  to this bug report if it is okay 
that the email message is readable by the public.

Please also provide sample correct and incorrect conversion of the message.

File Attachments

Date: Fri 12/03/2004 at 18:42  Name: mhonarc-utf8-CharEnt.patch  Size: 346B   
By: ehood
UTF-8 to entity ref patch that works for Perl 5.6.x and 5.8.x;item_file_id=1938

Date: Thu 12/02/2004 at 19:28  Name: mhonarc-utf8.patch  Size: 516B   By: None
sample fix;item_file_id=1936

Date: Thu 12/02/2004 at 00:58  Name: mhonarc-utf8.tar.gz  Size: 2.65KB   By: 

For detailed info, follow this link:

  Message sent via/by Savannah

To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the