So I got an e-mail from an Outlook abuser that had some UTF-8 smart
quote characters in the Subject: line - sans RFC2047 encoding, just
bare UTF-8 characters, naked as the day they were typed, plonked in the
middle of the line.
What *should* nmh do here (given that we don't have a way to tell it
was UTF-8 versus an ISO8859-N or 2022 or what-have-you)?
Technically ... those are legal nowadays. See RFC 6532. That's a
message/global message.
What should we do? We should deal with it. I think we might not do so
well right now. Okay, fine, what does 'deal with it' mean? Well ...
technically the only valid 'raw' 8-bit characters in headers are UTF-8.
But I am aware that some busticated MUAs still send raw 8-bit data in
other character sets.
I see two possible sets of ways to deal with it better:
1) Assume any unencoded 8-bit characters in email headers are UTF-8. Treat
as UTF-8, which means converting to local character set if necessary.
If it turns out those bytes are not UTF-8, then either they'll fail
character conversion or end up as mojibake on a user's terminal (well,
they'll probably end up as the UTF-8 invalid character).
2) Do 1), except check first to see if all of the 8-bit sequences are
valid UTF-8 encoding (it's possible for an arbitrary sequence of
8-bit characters to be a valid UTF-8 encoded sequence, but very unlikely).
If it's all valid, treat as 1). Otherwise use substitution characters
for everything 8-bit.
--Ken
_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers