mhonarc-dev

[bugs #11187] incorrectly parsing UTF-8 encoded messages

2004-12-01 17:33:06
This mail is an automated notification from the bugs tracker
 of the project: MHonArc.

/**************************************************************************/
[bugs #11187] Latest Modifications:

Changes by: 
                Earl Hood <earl(_at_)earlhood(_dot_)com>
'Date: 
                Thu 12/02/2004 at 00:36 (US/Central)

------------------ Additional Follow-up Comments ----------------------------
Can submitter please zip up sample message and send it to the author's address 
for evaluation?  Or you can attach the bundle  to this bug report if it is okay 
that the email message is readable by the public.

Please also provide sample correct and incorrect conversion of the message.






/**************************************************************************/
[bugs #11187] Full Item Snapshot:

URL: <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11187>
Project: MHonArc
Submitted by: Egmont Koblinger
On: Thu 12/02/2004 at 00:04

Category:  Character Sets
Severity:  5 - Average
Item Group:  Incorrect Behavior
Resolution:  None
Privacy:  Public
Assigned to:  None
Status:  Open
Platform Version:  Linux
Perl Version:  5.8.5
Component Version:  2.6.10
Fixed Release:  


Summary:  incorrectly parsing UTF-8 encoded messages

Original Submission:  I use mhonarc without any configuration file, just simply
the command "mhonarc -outdir outdir indir" whereas "indir"
only contains one file with one single message encoded in
UTF-8. (Both the subject and the body contain UTF-8 encoded
accented letters, the subject uses quoted-printable, the
body's transfer encoding is 8-bit).

The output html files are quite strange. For each UTF-8
byte sequence only the first byte is taken into account
and it is converted to a html escape. For example, the
Euro sign (U+20AC, UTF-8: E2 82 AC) will appear in the html
output as "&#E2;" and then 82 and AC are skipped, processing
goes on with the next Unicode character.

In MHonarc/CharEnt.pm line 153 there's a switch to check
whether perl is new enough to support UTF-8. If it isn't,
then manual processing of UTF-8 character takes place.
Forcing the "non-UTF-8-aware perl" branch of the "if"
statement (that is, changing the "if ($] >= 5.006)" to
"if (0)" repairs the problem, in this case the output will
be the expected "&#20AC;".

I don't think it matters, but I have LANG=hu_HU (latin2
locale) and no other LC_* variables set. However, UTF-8
locales are also available on my system.


Follow-up Comments
------------------


-------------------------------------------------------
Date: Thu 12/02/2004 at 00:36       By: Earl Hood <ehood>
Can submitter please zip up sample message and send it to the author's address 
for evaluation?  Or you can attach the bundle  to this bug report if it is okay 
that the email message is readable by the public.

Please also provide sample correct and incorrect conversion of the message.












For detailed info, follow this link:
<http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11187>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/



---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV