nmh-workers
[Top] [All Lists]

Re: Bug reported regarding Unicode handling in email address

2021-06-12 05:19:29
Hi Ken,

Probably the best way to do that is using mhbuild directives.
That is, you can today do stuff like:

#<text/plain; charset=utf-8
[... utf-8 text here ...]
#<text/plain; charset=iso-8859-1
[... iso-8859-1 text here ...]
#<text/html; charset=utf-8
[... HTML text here ...]

The input to mhbuild can be that, it's true, though a text editor might
only handle it in the C locale.  And then nmh treats a NUL byte as end
of string, e.g. charset=ucs-2le doesn't work.  Worse than just
truncating the UCS-2LE input, it causes corruption in earlier parts in
this experiment.

    $ cat build
    #! /bin/bash

    (
        printf '%s\n' \
            'subject: Test.' \
            '' \
            'Disappears.' \
            '#<text/plain; charset=iso-8859-1' \
            $'Fiat: $ \xa3' \
            '#<text/plain; charset=ucs-2le'
        iconv -t ucs-2le <<<'† Footnote.'
    ) >draft
    sed -n l draft
    echo

    cp draft mimed
    mhbuild -list -realsize -headers -verbose mimed
    echo

    sed -n l mimed
    $
    $ ./build
    subject: Test.$
    $
    Disappears.$
    #<text/plain; charset=iso-8859-1$
    Fiat: $ \243$
    #<text/plain; charset=ucs-2le$
 ¹     \000F\000o\000o\000t\000n\000o\000t\000e\000.\000$
    \000$

     msg part  type/subtype              size description
       0       multipart/mixed             99
                 boundary="----- =_aaaaaaaaaa0"
         1     text/plain                  34
                 charset="UTF-8"
         2     text/plain                   3
                 charset="ucs-2le"

    subject: Test.$
    MIME-Version: 1.0$
    Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"$
    Content-ID: 
<21398(_dot_)1623492782(_dot_)0(_at_)orac(_dot_)inputplus(_dot_)co(_dot_)uk>$
    Content-Transfer-Encoding: 8bit$
    $
    ------- =_aaaaaaaaaa0$
    Content-Type: text/plain; charset="UTF-8"$
    Content-ID: 
<21398(_dot_)1623492782(_dot_)1(_at_)orac(_dot_)inputplus(_dot_)co(_dot_)uk>$
    Content-Transfer-Encoding: 8bit$
    $
 ²  ain; charset=iso-8859-1$
    Fiat: $ \243$
    $
    ------- =_aaaaaaaaaa0$
    Content-Type: text/plain; charset="ucs-2le"$
    Content-ID: 
<21398(_dot_)1623492782(_dot_)2(_at_)orac(_dot_)inputplus(_dot_)co(_dot_)uk>$
    $
 ³     $
    $
    ------- =_aaaaaaaaaa0--$
    $ 

1. sed happily displays the NUL bytes in the draft.

2. The ‘Disappears’ part in the draft has vanished.  The Fiat part
starts with part of the preceding directive.  Altering the length of the
UCS-2LE part changes how far back this part erroneously starts;
I suspect some pointer subtraction.

3. All that makes it into the UCS-2LE part is the three spaces which
represent the first three-quarters of the U+2020 dagger and its
following U+0020 space.

This isn't a complaint, just passing on the observation having made the
effort.

-- 
Cheers, Ralph.

<Prev in Thread] Current Thread [Next in Thread>