Steven wrote:
I routinely use mhfixmsg to clean up incoming messages, using this command
in a shell script invoked through procmail:
mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 \
-reformat -fixcte -fixboundary -noreplacetextplain \
-fixtype application/octet-stream -noverbose -file - \
-outfile $destination < $source
original message:
Veuillez ne pas r=E9
This should decode to the following (represented in UTF-8):
Veuillez ne pas ré
...but mhfixmsg turns that into
Veuillez ne pas ré
(I truncated the examples to focus on the first errant conversion, see below.)
My questions are then:
1) Is this a bug in mhfixmsg, or am I just using it incorrectly?
2) If the former, is there further information I can supply to help track
this down, or further tests I can conduct on the message in question?
3) ...or if the latter, what am I doing wrong, and what should I be doing
instead?
Good questions, and thank you for your detailed report.
Looking at the first 8-bit character in the excerpt, E9 in iso8859-1,
that should have been converted to C3A9 in UTF-8. iconv correctly does
that:
$ printf '\xE9' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
00000000 c3 a9 |..|
Instead, it got converted to C383C2A9. I'm not sure why. I expect
that your environment is close enough to:
$ iconv --version
iconv (GNU libc) 2.34
$ locale
LANG=en_CA.utf8
LC_CTYPE="en_CA.utf8"
LC_NUMERIC="en_CA.utf8"
LC_TIME="en_CA.utf8"
LC_COLLATE="en_CA.utf8"
LC_MONETARY="en_CA.utf8"
LC_MESSAGES="en_CA.utf8"
LC_PAPER="en_CA.utf8"
LC_NAME="en_CA.utf8"
LC_ADDRESS="en_CA.utf8"
LC_TELEPHONE="en_CA.utf8"
LC_MEASUREMENT="en_CA.utf8"
LC_IDENTIFICATION="en_CA.utf8"
With this small example:
$ cat 3
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="mime-boundary"
Content-Transfer-Encoding: 8bit
--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=iso-8859-1
=E9
--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=iso-8859-1
é
--mime-boundary--
I see correct conversion of the quoted-printable E9 to UTF-8 C3A9:
$ mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8
-reformat -fixcte -fixboundary -noreplacetextplain -fixtype
application/octet-stream -noverbose -file - -out - < 3 | hexdump -C |
egrep a9
000000c0 65 74 3d 22 55 54 46 2d 38 22 0a 0a c3 a9 0a 0a |et="UTF-8"......|
Does adding -verbose to your mhfixmsg invocation provide any clues?
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, decode text/plain; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 1, decode text/html; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, convert iso-8859-1 to UTF-8
David