Re: After a 450, queue or try next MX?

On Wed, 30 Aug 2006, ned+ietf-smtp(_at_)mrochek(_dot_)com wrote:


We eventually settled on an intermediate strategy. Temporary failures
before the MAIL FROM caiuse us to retry using the next MX, failures at
or after the MAIL FROM do not. We have found that this works pretty well
overall.

You are effectively making a distinction between errors related to the
destination host and errors relating to the particular message.


Not really. The distinction is between problems that tend to affect a single
host and problems that tend to affect the ability to deliver regardless of the
host you happen to be talking to. The specific message rarely has anything to
do with it - the factors that tend to block delivery completely are things like
a systemic failure of some infrastructure service the servers all depend on
like DNS or directory or antispam or antivirus, or a problem with the client
being blacklisted or otherwise held in bad odor by the servers, or a network
issue where TCP connections get whacked consistently. (This, BTW, is why
maintaining some amount of cache state about how remote systems are behaving is
effective - if there really were lots of message-specific failure modes this
strategy would be seriously counterproductive.)

What might actually help is attempting to classify network level errors better.
There's little point in trying every one of some regional ISP's 18 gazillion
servers if the problem is someone put a backhoe blade through a critical bit of
fiber. The question you're trying to answer is when you have a total connection
failure is it likely to be specific to the host or is it likely to affect
connections to all hosts?

At least part of the logic here is that when you're talking about problems
that show up after you're connected to the host you're already waaay out
on the tail of the failure case curve.

Perhaps it
would be a further improvement to use enhanced status codes to identify
errors at or after MAIL FROM that also relate to the host rather than the
message.


None of the status code sets we have are aligned to be used this way. I've
actually tried this, and also conducted a paper experiment of how this would 
play knowing what codes are returned by our server. The gains were marginal at
best, and while I didn't try this at sufficient scale to know for sure what
the downsides are, there were some indications that it might make things worse,
not better.

According to the enhanced status code design, you should just be able to
check that the second digit is a 3, but unfortunately the design has not
been followed consistently. For instance, x.3.4 (message too big) is a
per-message error not a per-host error (it should probably be an x.6.z
code not an x.3.z code), and x.4.5 (mail system congested) is a per-host
error not a per-message error (it should be an x.3.z code).


Yes, these are the sorts of problems that make me think this isn't a very
good idea.

                                Ned