Re: [ietf-822] utf8 messages

I haven’t been very involved with EAI, so please indulge my attempt to
restate the problem to see if I understand it correctly: a 6532 message is
*implicitly* of type message/global, whereas a non-6532 message is implicitly
of type message/rfc822.  That implicit state is specified is inferred from the
presence or absence of 8-bit characters.


With the proviso that those characters have to be valid utf-8 and have
to appear in specific parts of a header.

It's also the case that the SMTPUTF8 marker provided by EAI is not a reliable
indicator of a RFC 6532 message: It may be set for a message/rfc822 message
that happens to have had EAI addresses in the envelope at some point, and it
may be clear for a random 8bit message that happens to meet RFC 6532 criteria.
Indeed, the SMTPUTF8 transport flag's purpose is to try and keep EAI material
from getting into places where it won't be understood, not to identify RFC 6532
messages.

It's even possible for nonconformant messages containing non-utf-8 8bit to
arrive with the SMTPUTF8 bit set. For example someone might forward such a p    
message to someone with an EAI address.

 Brandon and others have problems because there are non-conformant messages
with 8-bit characters that are not 6532 messages, most often because they
use Windows-1250 instead of UTF-8.  Is that a correct restatement?

How does Google (or anyone else) tell that a message is cp-1250 instead of
UTF-8?  Can we specify a clear algorithm for detection?


I can't speak to how Google does it, but the way we handle it is to apply a set
of heuristics. Obvious ones include a check for utf-8 syntax validity (the
large the amount of 8bit text, the less likely it will meeting utf-8 syntax
rules and end up being something else). Note that isolated 8bit characters like
an accented "e" never qualify as utf-8. Checks for certain escape sequences,
will, if found, identify various iso-2022-? variants with high reliability. And
so on.

But I doubt there's a One True Algorithm for any of this. Moreover, the
heuristics require adjustment as the message population shifts. (What we're
seeing is that utf-8 is slowly taking over.)

If so, how about automatic translation to UTF-8?


We offer that, but I understand why people are reluctant: It loses information.

Failing that, how about encapsulating the whole thing in a new content-type
(perhaps “message/windows-still-sucks”)?  I’d rather come up with a
recommended best practice for handling the non-conformant messages than try to
get all the existing conformant implementations to add something new.


That's a variant of what I'm suggesting: Instead of marking the good messages,
mark the bad ones.

On the other hand, I think Brandon was merely suggesting a new header field
that would serve as a way for conformant messages to bypass the heuristics.  I
would be supportive of such a heuristic if it were better defined than
MIME-Version was.  But even so, we’d still need the best practice guidelines
for handling non-UTF-8 8-bit messages, so my inclination would be to start
there. — Nathaniel


Good point. We recently wrote a document tht discussed how to handle invalid
addresses using heuristics. A document discussing how to handle invalid 8bit
would certainly be possible, and by getting the various issues documented
it's possible best practices would emerge.

The real question is whether there's the interest and energy to do it. Speaking
personally, having just agreed to cochair DMARC, I'm essentially booked.

                                Ned

_______________________________________________
ietf-822 mailing list
ietf-822(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf-822