nmh-workers
[Top] [All Lists]

Re: mhfixmsg character set conversion

2022-02-04 07:22:53
Steven wrote:

I routinely use mhfixmsg to clean up incoming messages, using this command
in a shell script invoked through procmail:

   mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 \
            -reformat -fixcte -fixboundary -noreplacetextplain \
            -fixtype application/octet-stream -noverbose -file - \
            -outfile $destination < $source

original message:

   Veuillez ne pas r=E9

This should decode to the following (represented in UTF-8):

   Veuillez ne pas ré

...but mhfixmsg turns that into

   Veuillez ne pas ré

(I truncated the examples to focus on the first errant conversion, see below.)

My questions are then:

1) Is this a bug in mhfixmsg, or am I just using it incorrectly?

2) If the former, is there further information I can supply to help track
   this down, or further tests I can conduct on the message in question?

3) ...or if the latter, what am I doing wrong, and what should I be doing
   instead?

Good questions, and thank you for your detailed report.

Looking at the first 8-bit character in the excerpt, E9 in iso8859-1,
that should have been converted to C3A9 in UTF-8. iconv correctly does
that:

$ printf '\xE9' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
00000000  c3 a9                                             |..|

Instead, it got converted to C383C2A9.  I'm not sure why.  I expect
that your environment is close enough to:

$ iconv --version
iconv (GNU libc) 2.34

$ locale
LANG=en_CA.utf8
LC_CTYPE="en_CA.utf8"
LC_NUMERIC="en_CA.utf8"
LC_TIME="en_CA.utf8"
LC_COLLATE="en_CA.utf8"
LC_MONETARY="en_CA.utf8"
LC_MESSAGES="en_CA.utf8"
LC_PAPER="en_CA.utf8"
LC_NAME="en_CA.utf8"
LC_ADDRESS="en_CA.utf8"
LC_TELEPHONE="en_CA.utf8"
LC_MEASUREMENT="en_CA.utf8"
LC_IDENTIFICATION="en_CA.utf8"

With this small example:

$ cat 3
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="mime-boundary"
Content-Transfer-Encoding: 8bit

--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=iso-8859-1

=E9

--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=iso-8859-1

&#233;

--mime-boundary--

I see correct conversion of the quoted-printable E9 to UTF-8 C3A9:

$ mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8
-reformat -fixcte -fixboundary -noreplacetextplain -fixtype
application/octet-stream -noverbose -file - -out - < 3 | hexdump -C |
egrep a9
000000c0  65 74 3d 22 55 54 46 2d  38 22 0a 0a c3 a9 0a 0a  |et="UTF-8"......|

Does adding -verbose to your mhfixmsg invocation provide any clues?
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, decode text/plain; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 1, decode text/html; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, convert iso-8859-1 to UTF-8

David


<Prev in Thread] Current Thread [Next in Thread>