----- Original Message -----
To: "Hector Santos" <winserver(_dot_)support(_at_)winserver(_dot_)com>
Cc: "IETF-SMTP" <ietf-smtp(_at_)imc(_dot_)org>
Sent: Sunday, January 04, 2004 2:09 PM
Subject: Re: RFC 2821 Address Resolution
Remember that round-robin is for the *server cluster*'s
convenience, not the client's. It's so that *overall*, clients
are evenly distributed across the round-robin list and don't all
pile on the first one.
So an inquiring client should "randomize" it (again) to minimize the
possibility that two or more current client snapshots are not going to hit
the same servers at the same time?
In fact, a single client can't even tell round-robin is in use
unless you ask several times (I once had the luck of doing a 'dig'
of a address I *knew* was round-robined from the server listed
in the NS (so any caching at my end didn't matter), and I had to
ask *five times* before I got a different list. I just happened to
be the N'th querie 4 times in a row, so it had round-robined back
to where I was)...
This was what was happening with gethostbyname(). A direct A record lookup
yield a round-robin result, but the Windows socket function resolved it from
its cache first returning the same address over and over again until it
timed out and fetched a new set of records.
That is when I realized that the MX query "may" return the A records so this
should be the opportunity to resolve the address if available. No need for
the gethostbyname() call in this case.
Yes, that's sub-optimal if you cached a list that has the downed
host first in a list of 4, but hopefully only 1/4 of the hosts
end up in that situation.
For aol.com, I'm noticing it is usually 2 of the 4 that are down (no A
record in the MX query) with round-robin results.
As I said, not guaranteed. Your need to look these A records up
You are allowed to optimize and not look them up *if* the A-less
MX'x are secondaries and as a result may not actually get
referenced. Wait till you need them. If you're really clever,
you'll fire the lookup when your original list is "getting low",
so the DNS lookups happen in parallel with the last few tries of
A records, balancing "don't make un-needed queries unless you
have to" with "don't stall waiting for queries".
Ha! This is the exact "clever" logic I am currently testing out! :-) Seems
to produce a legitimate final list. This worked with the test domains like
AOL.COM that happen to have equal preference.
But what do you do with multi-MX where one or more is down and have lower
preference numbers than the rest?
RFC2821, section 5, says: ..........
So you're required to exhaust your list of mx1/mx2 A records
first, then try the mx4 A records. You're required to randomize
whether you try mx1 or mx2 first. However, the spec does *not*
say what to do if you have multiple A records from multiple MX:
1) It's unclear if you should try the first mx1, the first mx2,
the second mx1, the second mx2, or first try all the mx1 and then
all the mx2.
Group them and then randomized?
2) It's unclear if you encounter a duplicate address you've tried
already on ONE mx if you should remove it from the list, or
re-try if it's the other MX.
I'm removing it from the list currently. If the server is not responding,
why try it again during the current send mail
If the Protocol Police are going to write a ticket for this one,
we need to clean it up so the Protocol Lawyers and Protocol Judge
don't spend all afternoon arguing. ;)
I hope I didn't open a can of worms or add more work for anyone :-)
Hector Santos, Santronics Software, Inc.