nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] What should nhm do with busted Subject: lines?

2015-11-04 18:08:50
So I got an e-mail from an Outlook abuser that had some UTF-8 smart
quote characters in the Subject: line - sans RFC2047 encoding, just
bare UTF-8 characters, naked as the day they were typed, plonked in the
middle of the line.

What *should* nmh do here (given that we don't have a way to tell it
was UTF-8 versus an ISO8859-N or 2022 or what-have-you)?

Technically ... those are legal nowadays.  See RFC 6532.  That's a
message/global message.

What should we do?  We should deal with it.  I think we might not do so
well right now.  Okay, fine, what does 'deal with it' mean?  Well ...
technically the only valid 'raw' 8-bit characters in headers are UTF-8.
But I am aware that some busticated MUAs still send raw 8-bit data in
other character sets.

I see two possible sets of ways to deal with it better:

1) Assume any unencoded 8-bit characters in email headers are UTF-8.  Treat
   as UTF-8, which means converting to local character set if necessary.
   If it turns out those bytes are not UTF-8, then either they'll fail
   character conversion or end up as mojibake on a user's terminal (well,
   they'll probably end up as the UTF-8 invalid character).

2) Do 1), except check first to see if all of the 8-bit sequences are
   valid UTF-8 encoding (it's possible for an arbitrary sequence of
   8-bit characters to be a valid UTF-8 encoded sequence, but very unlikely).
   If it's all valid, treat as 1).  Otherwise use substitution characters
   for everything 8-bit.

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>