[Top] [All Lists]

Re: 2821bis consideration - New 2nd attempt Retry Strategy recommendation

2007-11-16 18:23:29

Hi SM,

Please note that I am not disagreeing with your points. With skepticism, upon customer "wish list" request and the tracking of how that request did not die, I explored GL and found it to be "do-able." With the fine tuning to minimize impact, it can work without disrupting operations.

Once GL was part of the picture, it was fairly obvious now why operators were previously reporting these strange rejects with no explanation and confused observation of the eventual delivery. Hence the variable table was added, not just for the GL operators but for those who were hitting GL systems.

As for turning on the GL feature for our own support system, that was a even tougher decision since as a small company, we can't afford any missed sales or customer support emails. But it was turned on and carefully followed. In fact, since there was unsureness of false positives, rather than reject at RCPT TO as I believe many do, we implemented it as part of the DATA filter hook system with a dynamic response at that point.

That allowed us to store a copy of the message for review to see how effective it was and/or more importantly, to see if "good messages" were lost due to the "good sender" not retrying again.

I can tell ya that the latter was a non-issue and that was sold me on this GL concept. If there was even a small percentage showing that "good intention" systems had broken SMTP retry logic, odds are very high, I would have nixed this project and explained to our customers why. This is not to say there were individual incidences where a "good intention" message system did not retry. But that soon became a funny moral reason for supporters to yell at those: "FIX YOUR SMTP SOFTWARE - YOU ARE ACTING LIKE A SPAMMER." If I recall, this was mostly an issue with systems old PHP scripts with one shot mail send or notification logic, but were failing not a GL, but with not properly handling multiple response lines. So in most cases, it wasn't GL itself, but some other reason, but they looked at GL as the reason.

Anyway, with a web-base GL tool, this gave the operator an easy way to view stats and check all current GL 1st rejects message content to help give them (and us) confidence of this obscure idea working or not. This helped sell it. Its funny I should note, remember, these are operators, early on some suggested that we add a click button to move the current message into the accepted mail inbound quue for import. But I explain, thats would be a good idea if we saw good systems not retrying. I think today, they are convince of that. Just let it run and forget about it. Don't see there looking at the web GL stats and rejects table listings and begin to doubt if a partiticulr new mail that looks good would eventually come in again and get delivery. Guaranteed! It will drive you nuts. :)

The 5 mins was carefully decided upon, mainly because I don't particular like the idea of going against 2821 recommendations. But the market overrule that issue. In the end, our default variable table is:


Note: attempt1 is really the 2nd attempt, since the rescheduling code is based off the current count, "msgQueue->nTotalAttempts"

Finally, on the GL receiver side, our default is a 55 second block and a 2 day grace period to send the retry.

I probably should of use, 3 days since our original defaults (non-variable) was once per hour, 72 attempts or 3 days. And if you follow 2821 recommendation, it suggests 4-5 days. With 30 mins intervals it yields an awful amount of 240-300 retries.

But I can't recall off hand the reason two days was selected for the default GL grace period. Maybe I was thinking that if spammers were using the RFC as a guideline or the GL specs of 4 days, then all they had to do is wait 3 days to retry.

Finally, for the 451 code itself, yeah, I didn't think it was ideal, but I do think that given all our choices, the GL author made the right decision. Assuming the author is an operator mostly, reading RFC 2821, he sees three examples of 45x with literals:

      450 Requested mail action not taken: mailbox unavailable
         (e.g., mailbox busy)
      451 Requested action aborted: local error in processing
      452 Requested action not taken: insufficient system storage

With the possible erroneous presumption the literals are set in stone for the reject reason, then among the three, 451 is arguably preferred over 450 and 452.

But it should not matter from an SMTP technical standpoint because the SMTP sender must use 45z for its retry considerations, regardless of what z is.

I will say, that I did consider using 451 as a trigger for the altered shorter 2nd attempt interval. But our outbound mail code a 45x response and I didn't want to change for reasons that it might not be 451 but 450, 452 or some other 45x value.


Hector Santos, CTO

Hector Santos wrote:

SM wrote:

the Greylist specs shows  a 4.7.1 extended code:

      451 4.7.1 Please try again later

I believe that the reply code mentioned in that whitepaper is incorrect. The extended code is correct. I recommend using "450 4.7.1 Text" when the temporary failure is due to a policy decision.

Incorrect in what way? Inappropriate perhaps from a "operator/policy" statement? Functionally or Technically? Compatibility? If its a compatibility problem, then it needs to be reconsidered.

As a general rule, I would use 30 minutes as receivers reading RFC 2821 will expect that.

Sure, but all receivers need to be ready for anything, including the possibility of "more sophisticated and variable strategies" as it was insightfully stated in 2821. :-)

So I don't think its would be a technical problem.