more on Multiple classes of mail

On Wed, Apr 07, 2004 at 09:43:30AM -0700, Harry Katz wrote:
| My purpose here is to highlight that validating the From line ought to
| be our base case because that's the identity the end user principally
| sees.  Yes, list expansion is one of those special cases where the 2822
| >From can't be validated.  In this case, as the Caller ID spec proposes,
| the Sender header should be examined.  Most mailing list servers add a
| Sender header identifying the list owner as the sender.  (This list
| we're on for example inserts "Sender: 
owner-ietf-mxcomp(_at_)mail(_dot_)imc(_dot_)org").

These discussions tend to escalate quickly to the bugaboo of "breaking"
things --- whether those things be alias expansion / .forwarding (with
2821 checking) or mailing lists (with 2822 checking).

I suggest that the emphasis on "breakage" is alarmist, and reflects
unnecessary polarization between good and bad.

Instead, perhaps we can borrow another concept from the postal service:
different classes of mail.  This lets us treat each class differently,
with different expectations.  I believe that most of us already do
analyze incoming email according to a complex set of rules, and adjust
expectations accordingly.

For example, if a message
  - claims to be from my mother's hotmail account
  - is addressed to me in the "To" header
  - actually originated from a hotmail server
then I know it's probably really from her to me.  (First class!)

But if a message
  - claims to be from my mother's hotmail account
  - is addressed to someone I don't know
  - did not originate from a hotmail server
then it's probably forged spam.  (Fifth class!)

Unless, of course, the message
  - claims to be from my mother's hotmail account
  - is addressed to a mailing list that we're both on
  - did not originate from a hotmail server
  - but did originate from the mailing list's server
then it's a mailing list message.  (Second class.)

I suggest that we power users are familiar enough with email to analyze
messages using these sometimes complex heuristics.  But the average
end-user is not.  Often, less technically savvy friends will show me a
message, and I'll look at the headers and figure out what's really going
on; then I'll tell them "okay, this was spam pretending to be from ebay"
or "this was not spam and it actually was from ebay".

Once we've analyzed a message, we put it into a "believability" class.
I suggest using these "believability" classes as a basis for formally
distinguishing between different classes of email.  The heuristics may
depend on the return-path, on the 2822 From, on the 2822 Sender, and on
the relationship between all three.

Once we've formally distinguished between classes of mail, this kind of
analysis can be done by machine.  Under today's email paradigm, machine
analysis often run into difficulty.  Often the difficulty is because the
AI doesn't have enough data to work with.  Adding sender authentication
information to DNS can be seen as one way to give these AIs more
information.  Another way might be for the end-users to tell their MUAs
or mailsystems which mailing lists they're on.  And what forwarding
addresses send mail to them.  And so on.

All this analysis is aimed at grading messages: the most primitive
systems sort messages into "spam" vs "not spam".  But maybe mail systems
could sort messages into "first-class" and "second-class" and "mailing
lists you agree that you're subscribed to" and "mailing lists you're not
subscribed to" and "greeting cards" and so on.

If we created such a framework, I believe most users of today's email
would make the necessary changes to operate under it.  For example, some
forwarders may do not wish to rewrite the envelope or add any kind of
Resent-* header.  This new multiple-class system would say to them:
under this scheme, your mail will generally be classified as
third-class, UNLESS the final recipient has explicitly told their
mailsystem to expect mail from your servers, OR UNLESS you rewrite the
return-path, OR UNLESS you do X or Y or Z that the new paradigm
recommends.

Once these choices are made clear, sensible people can make tradeoffs
with full knowledge of the consequences.  And we can get away from
polarizing language like "but proposal X will break some traditional
behaviour Y".  Instead, a multiple-classes system simply casts a certain
kind of behaviour in a certain light.

The important thing is to ensure that if someone is uncomfortable with
being cast in that light, they should be able to do something to put
themselves in a different light.  "Doing something" might include "buy
an accreditation" or "having a good reputation with SpamHaus" or "being
in the .tm domain" --- it should be something easy for good guys to do
and hard for bad guys.

In practice, the only thing most sender domains have to do is add the
kinds of records we've already been talking about.  That adds enough
information for receivers to do the rest.  We just need to specify the
classes.  We can start by saying when a message MAY be rejected at SMTP
time.  Then we can describe how receivers MAY classify messages into
first, second, third, fourth, etc.  That provides a common vocabulary
and a shared understanding.

All end-user to end-user mail should go into the "first-class" bucket.
First-class should also include "quality" communication from ebay.com to
its users.  We should make it easy for good guys and hard for bad guys
to get first-class status.  As long as there is some kind of
accountability, detailed status can be established by other methods,
including accreditation and reputation.  Accountability can be
established by any one of the sender authentication protocols we've
discussed so far, including SPF, Caller-ID, DK, etc.  The first example
I gave above, where my mother sends mail from her hotmail account to me
directly, should be first-class.  So should a message from eBay to me.

Second-class mail might describe mailing list messages.  For instance,
maybe the 2822 "From:" does not match the return-path or "Sender" field,
but the return-path does match the "Sender" field.  Either way, some
aspect of the message has passed some degree of accountability checking,
but it doesn't fulfill all the criteria of a first-class message.  If
the recipient knows they're subscribed to the mailing list, and the
mailing list has a consistent Sender: header, then it's good enough.

Below these classes are messages where the "From:" and "Sender:" and
return-path do not match, AND the recipient does not recognize anything
about the message --- for instance, it's not a mailing list they're
subscribed to.

Regular spam can be described as a mailing list to which the end-user
did not voluntarily subscribe (and where the sender has no preexisting
relationship, etc.)

So maybe we can call "regular spam" fifth-class mail, and worms and
viruses become sixth-class mail, and joe-jobs become seventh-class, and
so on.

It seems to me that the operational distinction between a mailing list
message and spam is the recipient's willingness to acknowledge that he's
subscribed to that mailing list, and the mailing list's ability to prove
a preexisting consensual relationship.  This is information that
end-users need to bring to the table.  For instance, whenever I
subscribe to a mailing list, I edit my .procmailrc and tell it to watch
out for a List-ID header.  (One day MUAs might do that on behalf of
end-users.)  Right now, the dominant end-user experience puts the onus
on the mail system, not the user, to figure out the difference between
mailing list messages and spam.  I think it is reasonable to start
putting a bit more of that burden on the end-users --- "what mailing
lists did you subscribe to, anyway?"

As far as the user experience is concerned, the distinction between
second-class and third-class mail is the end-user's acknowledgement that
they subscribed to the list.

This is just a very rough outline.  I'm not committed to any of the
specifics.  Maybe the heuristics I use for believability are different
from your heuristics.  But maybe we can agree on a rough system for
classification that everybody could be happy with.  The following
algorithms might be one way to start:

IF: 2822 "From" == 2822 "Sender" == 2821 return-path
 && (SPF || Caller-ID) pass
 && (To: || CC:) == recipient address
 && reputation system recognizes the sender (eg. whitelisting)
THEN: it's first class.

IF: (SMIME || PGP) signed
 && 2822 "From" matches the signature
 && reputation system recognizes the sender (eg. whitelisting)
THEN: it's first class.

IF: 2822 "Sender" == 2821 return-path
 && (SPF || Caller-ID) pass
 && List-ID shows up
 && recipient acknowledges they're subscribed to the mailing list
THEN: it's a first-class mailing list.

IF: 2822 "Sender" == 2821 return-path
 && (SPF || Caller-ID) status unknown
 && List-ID shows up
 && recipient acknowledges they're subscribed to the mailing list
THEN: it's a second-class mailing list.

IF: 2822 "From" != 2821 return-path
 && "To:" and "CC:" do not match a known recipient address
 && message appears to be a mailing list
 && recipient does not recognize the mailing list
THEN: it's third class, possibly spam

IF: the reputation system does not like the sender
THEN: it's fourth class, probably spam

etc etc.