nmh-workers
[Top] [All Lists]

Re: [nmh-workers] nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

2019-02-15 08:20:12
ken wrote:
The  �" � around `Blind-Carbon-Copy' should be \(lq and \(rq, or the
equivalent strings for consistency with the style used at start of the
paragraph.

So, in a mostly unrelated note ... I couldn't help noticing that Ralph
used guillemets ( � �) in one of his messages on this thread (way to push
non-US-ASCII characters, Ralph!), and after a series of replies to his note
things devolved into classic mojibake.  And since hopefully most everyone
on this thread is an nmh user, I wanted to understand why, because really
that shouldn't have happened.


Mea Culpa.  I haven't fully worked through the bug or the fix, but
rest assured, the problem isn't with nmh.

My replies and forwarded message drafts are constructed by a script
that predates replyfilter.  It does things like add attribution ("ken
wrote:"), my .sig, and the bulk of the body with the " > " indents.
It includes the original headers if forwarding, but not when replying, 
and also adjusts the current headers based on what folder I'm in, for
things like Reply-to: and Fcc:.

I haven't done full debugging yet, but looking quickly I see that the
body content is created by:
            mhshow -form mhl.null -type text/plain -file $original_text  |
                utf_clean |
                remove_part_markers_and_quote

where $original text is the path to the message being replied to.

The function remove_part_markers_and_quote() runs sed to get rid of
the "part markers" that mhshow emits:
    remove_part_markers_and_quote()
    {
        # delete part markers entirely if they're the whole line,
        # otherwise just remove that part of the line.
        # and because we're already running sed, add the leading ' > '
        sed -e '/^\[\*@\(\[ part .* \]\)@\*\]$/d' \
            -e 's/\[\*@\(\[ part .* \]\)@\*\]//' \
            -e 's/^/ > /'
    }

But utf_clean() is the culprit, I believe -- it's there to remove a
few really annoying binary characters that my fonts don't display
correctly.  But it does so with a fairly large and indiscriminate
hammer, completely ignoring the current encoding.
    utf_clean() 
    {
        #eliminate utf hard non-printing space:  <U+200B> or \u200B
        #also eliminate A0, which is non-breaking space in iso-8859
        sed -e 's/\xe2\x80\x8e/ /g' \
            -e 's/\xe2\x80\x8b//g' \
            -e 's/\xa0/ /g' \
            -e 's/\xc2/ /g'
    }

I'll work on this, and also take a look at replyfilter to see if
I can't get it to do more of the heavy lifting.

paul




I went back to the raw archives (ftp://lists.gnu.org/nmh-workers/2019-02)
because the mailing list software will sometimes translate stuff into
base64 encoding when it sees non-ASCII characters.  And, well, I hate to
assign blame, but I think it's a bit unavoidable ... please, don't anyone
take this as a personal attack, I am just trying to understand how we
could do better.

Ralph's original note containing the guillemets (Message-Id
<20190214173028.F065921521@orac.inputplus.co.uk>) was text/plain, a
character set of utf-8, and encoded using quoted-printable.  The
characters were encoded properly using quoted-printable, specifically
they were listed as =C2=AB and =C2=BB.

Valdis was the first reply to that (Message-ID
<22277.1550182604@turing-police.cc.vt.edu>), and HIS email was text/plain,
character set iso-8859-1, and encoded using quoted-printable.  He quoted
Ralph's message, and the guillemets were encoded as =AB and =BB.  Which seems
correct to me.

Paul Fox replied to Valdis's note (Message-Id
<20190214221828.EC0805184393@grass.foxharp.boston.ma.us>), and THAT note
was text/plain, character set UTF-8, encoded using quoted-printable ...
but it seems like this was the start of where things went off the rails.
The original line in Valdis's email was (in raw form):

   > The =AB=22=BB around ...

But in Paul's note it ended up as (extra > added in the reply)

   > > The  =AB" =BB around 

This is NOT correct.  First, there is an extra space in front of
the encoded bytes.  Secondly, they're not valid UTF-8; they're the
ISO-8859-1 bytes.  So I am guessing whatever Paul used to quote the reply
didn't translate the ISO-8859-1 characters properly into UTF-8.

However, whatever Mark Bergman uses for email actually made an intelligent
decision.  When he replied to Paul's note, those invalid UTF-8 characters
got converted to the Unicode Replacement Character (U+FFFD), which was
sent out as =EF=BF=BD (utf-8, quoted-printable).

Further muddying the waters ... when Ralph replied to Mark's email,
those Unicode Replacement Characters somehow got converted back to
the correct guillemets (=C2=AB and =C2=BB).  Which means Ralph has
perhaps the most intelligent reply quoting program ever and he should
immediately share it as it would revolutionize AI, or he went back and
manually corrected it when he replied to Mark's note.  I'm 50/50 on
which one of those scenarios is more likely.

If anyone involved with this email thread wants to pipe up with some
more explanation on what exactly they used to compose their email
replies, I would love to hear it.  No judgements; I just want to know
how nmh could help everyone do better.  Like, do we need to include
better tools for composing reply messages?  Well, duh, the answer to
that is "yes", and I think replyfilter does ok here but obviously we
need to do better.  But if we're SENDING something that is not valid
UTF-8, should we be smarter and flag it?  People were upset when we
refused to send out 8-bit characters when your locale was US-ASCII (I
mean, REALLY?  I couldn't believe it), so I don't know what makes sense.
Sending out invalid UTF-8 just seems wrong to me.

--Ken

-- 
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers


=----------------------
paul fox, pgf@foxharp.boston.ma.us (arlington, ma, where it's 33.6 degrees)


-- 
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
<Prev in Thread] Current Thread [Next in Thread>