ietf-822
[Top] [All Lists]

Re: [ietf-822] utf8 messages

2014-08-13 23:19:27
Let me try one more time, since something isn't making it through.

I have three messages.  One message has an entirely 7bit header with 2047
encoded subject.  Another message is a 6532 message, with the subject in
utf8.  A third message is has a cp-1250 8bit subject.  There are two 8bit
bytes in the subject in both of the last two messages, and in the cp1250
case, those two bytes happen to also be a valid utf8 character.

We want to be able to parse all three of those and do so correctly.  We
know the third type is technically invalid, but we see millions of such
messages every day, dropping all of those would be a dis-service to our
users.  We currently see way more of such messages than we do of 6532
messages... though in practice, the most common charset now is utf-8, so I
guess those are now the same as 6532 messages that have leaked.

I thought I understood the problem you were attempting to solve, but now I'm
totally confused, because this seems to hqve nothing to do with additional
labeling of legitimate EAI messages at all.

You say you have to deal with invalid messages with 8bit in headers. You say
that there's a trend towards these using utf-8 rather than some other charset.
You say that EAI messages are in the distinct minority. And finally, you say
there are issues with your heuristics misidentifying the charset.

Given that EAI messages are currently in the minority, your first order of
business clearly needs to be work on those heuristics. Beyond that, it seems to
me that your focus needs to be on calling out the details nonstandard stuff
you're doing, rather than creating openings for silly states with the
standard stuff.

More specifically, when you receive an invalid message that needs or has
undergone heuristic processing, why not just label it as such? This way
there's a clear indicator that the message has issues and that there
may be problems interpreting it.

This label is actually orthogonal to marking the message as an EAI message.
If your heuristics say, which high probability, that this is an EAI
message, then you probably want to set the EAI bit so that other things
will treat it as such. But the additional label tells you that the EAI
label came about implicitly rather than explicitly.

I also note that existing stock of messages containing invalid 8bit in the
headers are not EAI messages by definition. And you can check this by looking
at the timestamps in the message, message metadata, or both. So the lack of
these labels on old messages is a nonissue.

You can also use the label to write down the heuristics you have, or will
apply. Or whatever other contextual details exist that aren't stored
anywhere else that may assist in the handling of the message.

What am I missing here?

                                Ned

_______________________________________________
ietf-822 mailing list
ietf-822(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf-822

<Prev in Thread] Current Thread [Next in Thread>