Re: After a 450, queue or try next MX?

On Wed, Aug 30, 2006 at 03:49:40PM -0400, Hector Santos wrote:

In our system, we only go to the next record when there is a connection
failure. Otherwise we follow the wishes of 45x or 550.  45x to try again
LATER, not within the same transaction attempt where you have a list of MX
to try.

Which definately sounds reasonable, but would cause unnecessary delays
when the MX itself has a problem.


The strategy of trying all MXes to exhaustion can also cause delays. Consider
the (surprisingly common) case of a domain with lots of MXes, each with a bunch
of A records, none of them reachable. In such a case an SMTP client can spend a
ton of time trying and failing to deliver the mail - in the process consuming
resources that could have been spent getting other messages out quicker. This
is such a problem that I've seen lots of setups where traffic to sites that
frequently fail in this way is segregated to a separate outgoing queue or
queues so it cannot impact other traffic.

Consider "450  I am having a problem reaching the DMZ"

Of course that would not be the real error message, I am providing an
example where the next MX may have no such problem and thus where it
is better to do it different than you are doing now.


As I said in my previous response, been there, tried that, found it created
a different set of problems.

In short, there are no real rules on how a system molds it retries. But
there are some common strategies.  See RFC 1123. It talks about strategies
that I think are pretty common.

Agreed, but they are still open to differences in interpretation.

two MX hosts:
  0 mail1.example.com
  1 mail2.example.com

I could argue that one single delivery attempt involves trying both hosts.


Indeed you can.

You are assuming one failed connection to mail1.example.com means one
failed delivery attempt.  I see no such evidence.

RFC1123 again repeats the statement about trying all MXes, in order, until
a delivery attempt succeeds.


Note the words here. "Delivery attempt" != "delivery". If you want to be
pedantic about it, the question is how far do you have to get before what you
did consistitutes an "attempt". My personal view is that having your connection
blown off isn't quite enough to count as an "attempt", but having your RCPT TO
rejected with a 4yz error is.

It does not say "until a connection to the SMTP daemon succeeds".

two MX hosts:
  0 mail1.example.com
  1 mail2.example.com

Suppose mail1.example.com returns "450 failed for whatever reason".  If
you now decide to wait 5 minutes (or 1, or 10) you still MUST(!) try mail2
the next time.  This is the only way I can interpret RFC1123 section 5.3.4
(reliable mail transmission).


I think the context of that 4yz error is crucial in determining whether or not
you've tried hard enough. I certainly don't read section 5.3.4 as
saying I have to try additional MXes when the failure happens at hte RCPT TO.

                                Ned