Re: Message-IDs - Another Fine Mess

3. Now I have just found another feature/bug.


I can't speak to news, but this is an issue that email software has had to
deal with for almost two decades now.

Consider the following three msg-ids, all syntactically correct in RFC 2822:

A.   <Joe_Doe(_at_)[127(_dot_)0(_dot_)0(_dot_)1]>
B.   <"Joe_Doe"@[127.0.0.1]>
C.   <"Joe\_Doe"@[127\.0\.0\.1]>

Question. Are those three semantically the same in RFC 2822?


Yes they are.

Read 3.2.5:

   Semantically, neither the optional CFWS outside of the quote
   characters nor the quote characters themselves are part of the
   quoted-string; the quoted-string is what is contained between the two
   quote characters.

And that clearly makes A and B semantically equivalent (well, you
_might_ just argue that the syntax of msg-id does not actually mention
quoted-string, but that is sophistry).

And now read 3.2.2:

   Where any quoted-pair appears, it is to be interpreted as the text
   character alone.  That is to say, the "\" character that appears as
   part of a quoted-pair is semantically "invisible".

And that clearly makes B and C semantically equivalent.


Yes, so A, B, and C are all semantically equivalent. The clear implication,
then, is that normalization is necessary if you want to perform proper
semantic comparisons.

Now I suspect this is a Bad Thing even in Email (though I am not sure
that any of the Email Standards makes any official use of the msg-id).


I fail to see what's Bad about it. Sure, normalization is a pain, but the clear
trend is to do more and more of it, not less. Normalization forms for Unicode
are such a joy...

But in Netnews it would lead to GROSS interoperability problems.

So there is the problem. First of all, could the ietf-822 people please
confirm that the problem is genuine, even in Email (or else explain why
it isn't)?


It is only a problem in the sense that there's an operation you need to perform
before you can compare id strings. Code I've written to do threading and such
as far back as the mid 80's does this. Perhaps I've been completely dense all
this time or something and I've done a bit more work then was necessary, but
this has always struck me as an obvious part of both address and message id
handling.

The only time I've seen problems in this area is with multiple spaces and
folding. The rules for how to handle such foldings aren't obvious, and even if
you follow them there's enough variability in the field that it is impossible
to recover from the actions of other agents. Of course anyone who puts
multiple spaces in an address or an id and expects them to work is playing
with fire, but some people seem to like getting burned.

The rest of this message is concerned with how it might be fixed in
Usefor (RFC 2822 now being cast in concrete). The ietf-822 people may
stop readin now, but are welcome to continue and comment if they wish
:-) .

Note first of all that the two bits of semantics quoted above from
RFC 2822 apply also within Usefor. That would have been true in any
case but, for the removal of all doubt, I have now explicitly written
them in, mainly because I need to rely on them for the semantics of
parameters.

I see two solutions. One is Brute Force (and involves sophistry to
boot). The other is syntactic (it just excludes all quoting that is not
strictly essential). I am not particularly impressed by either solution,
so would welcome suggestions.


Assuming you feel its necessary to "solve" this "problem", I think a
syntactic solution is preferable.

                                Ned