ietf-smtp
[Top] [All Lists]

Re: RFC 2821 Address Resolution

2004-01-04 14:31:45
On Sun, 04 Jan 2004 15:32:09 EST, Hector Santos said:

So an inquiring client should "randomize" it (again) to  minimize the
possibility that two or more current client snapshots are not going to hit
the same servers at the same time?

You need to randomize the list of equal-weight MX's, but leave the order
of A records for each MX alone.

In fact, a single client can't even tell round-robin is in use
unless you ask several times (I once had the luck of doing a 'dig'
of a address I *knew* was round-robined from the server listed
in the NS (so any caching at my end didn't matter), and I had to
ask *five times* before I got a different list. I just happened to
be the N'th querie 4 times in a row, so it had round-robined back
to where I was)...

This was what was happening with gethostbyname(). A direct A record lookup
yield a round-robin result, but the Windows socket function resolved it from
its cache first returning the same address over and over again until it
timed out and fetched a new set of records.

No, you're seeing something different.  I in fact queried the NS *directly*,
but they had a list of 5 or so round-robin'ed addresses, I'd do the query, some
(5*N)-1 queries would arrive, my next query would show up, and it would be at
the same place again. Same thing happened a few more times, till I got in at
some query number other than (5*N)-1. ;) Kind of like a 'Wheel of Fortune'
wheel, and hitting "Lose Your Turn" several times in a row... :)

Yes, that's sub-optimal if you cached a list that has the downed
host first in a list of 4, but hopefully only 1/4 of the hosts
end up in that situation.

For aol.com, I'm noticing it is usually 2 of the 4 that are down (no A
record in the MX query) with round-robin results.

"No A record" is different than "down".  I mean the case where you get back
a list of 4 round-robin'ed A's, and the first one is down for maintenance, so 
you keep
trying that one and then re-trying onto the second A record.  And you keep on
doing that till the TTL on the entry expires and you refetch - even though you'd
have gotten out of that situation (with 75% probability) if you had ignored the 
TTL
and re-fetched...

What's happening with AOL is that they have a LARGE farm of servers that won't
fit into a single DNS reply packet, so they give you the complete MX list, and 
hopefully
enough A records that you'll find a working one without having to do either a
lookup for the A records for the other MX's, or set up a TCP connection to get 
the
full DNS reply (with the 3 packet overhead at the front, the FIN/ACK at the 
end, and
all the other ugly overhead-sucking stuff).

Watch AOL for an extended period (the better part of a day) - you'll notice that
they slowly rotate MX and A records into and out of the list, so they can rely 
on
sites keeping older but still valid addresses in their cache while they start 
advertising
new ones - this allows them to *effectively* have a larger set of servers than 
they
could actually advertise in a single-packet UDP DNS response...

But what do you do with multi-MX where one or more is down and have lower
preference numbers than the rest?

It's pretty clear you need to exhaust all the options for each lower-number MX
before moving on.  So you do all the mx=0, then mx=1, then mx=2 and so on.

1) It's unclear if you should try the first mx1, the first mx2,
the second mx1, the second mx2, or first try all the mx1 and then
all the mx2.

Group them and then randomized?

Certainly *not* - you're required to preserve per-host ordering of the A
records (as you don't KNOW if they're round-robined, or if they are
preference-ordered - the first could have an OC12 and the 3rd or 4th A
record correspond to a DS3).

The question is basically "which is better, the second-best address for the
first MX, or the best address for the second one" - and given that the
"first" and "second" are required to be randomized, the best address of
the second is a better bet, even though the language in the RFC *hints*
at running all the A's for one, then the other.

2) It's unclear if you encounter a duplicate address you've tried
already on ONE mx if you should remove it from the list, or
re-try if it's the other MX.

I'm removing it from the list currently.  If the server is not responding,
why try it again during the current send mail session?

Some further thought indicates it probably requires checking if your retry
timeout for a given host has expired. If you retry a given host every 30
minutes, and allow a 2-minute timeout, if the "duplicate" host is over 15
A-records from the last time you tried, you need to retry...

A more subtle hole I've seen MTA's fall into - do the MX and A record lookups,
get a list of 20 or so A records with a 30-min TTL.  Start trying, 2 minute
timeout on each.  When you get to the 16th and fail to notice it's stale and
needs to be looked up again.

Figuring out what to try out of the NEW list of 20, which may or may not
include the same A records as the first time, or partial overlap, and some that
you've tried at the beginning and are set for retry, and some that are recent,
and some you've expired and haven't tried, and some that are new, is left as an
exercise for the masochist  A case could be made that if you've been trying so
long that the TTL's have expired, it's time to Just Give Up, queue it, and go
have a beer or something and try later, even though you have untried addresses.

The usual non-standard-compliant way to work around this is to only allow a
short timeout (5-10 seconds or so) for the *FIRST* try at the TCP 3-packet
handshake, and if that fails, retry after 30 mins or so.  Combining this with a
MTA-wide "recent status" table allows you to fairly quickly detect that once
again, AOL/Yahoo/Hotmail/whoever has gone belly up and not strangle yourself on
retries (if ONE piece of mail couldn't reach any Hotmail server, the next piece
2.5 minutes later probably can't either, so you're better off not even trying
to deliver the second, and stashing both of them in a queue and retry them both
in 30 mins...)

If the Protocol Police are going to write a ticket for this one,
we need to clean it up so the Protocol Lawyers and Protocol Judge
don't spend all afternoon arguing. ;)

I hope I didn't open a can of worms or add more work for anyone :-)

If we didn't want this work, we'd not be subscribed  - this is a 100% volunteer
effort, except for the IETF Secretariat who are not on this list anyhow. :)

Attachment: pgp0or1nQCJnS.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>