ietf-smtp
[Top] [All Lists]

Proposal for Adjusted DATA Timeout

2008-05-23 05:22:58

Hi,

I am wondering if writing a I-D or BCP is worth the effort here and your comments are welcome.

Basically, with the advent of larger emails and the direction of mail sophisticated mail receivers performing DATA pre-response callouts to process the message before determining what the response code will be, there is a greater potential for client timeout issues, duplicate resends and messages and of course, wasteful bandwidths and overheads.

Summary of my proposal:

    Clients should consider adjusting their DATA termination
    state timeout based on the size of the message they are
    sizing and just use a low 5 minutes across the board for
    all payload mail sizes. If the client is not using the
    recommended 10 minute timeout [RFC 2821], it should
    consider possible receiver lengthy processing resend and
    duplicate message issues are increasingly possible, thus
    the client SHOULD adjust the DATA termination timeout
    as follows:

      Use 5 minutes for 5 megabytes or less.
      Use 10 minutes for over 5 megabytes.

    Or use some block transfer rate calculation as it proceeds
    to determine what the timeout will be when complete.

    Overall, Using a constant 5 minutes is TOO low for
    large file transfers. It needs to be adjusted.


Background:

The issue is 100% highlighted in 1998 two page RFC 1047 "DUPLICATE MESSAGES AND SMTP".

  INTRODUCTION

  ....

  It may be hard to believe that this problem is the cause of many
  duplicate messages.  Intuitively, one might expect that the time
  spent in the state between the final dot and its accepting 250 reply
  is quite small.  In practice, however, this period is often quite
  long; long enough that timeouts by the sending mailer (or possibly
  network failures) are quite common.  Observations by the author
  suggest that this synchronization problem may be the second leading
  cause of duplicate messages on the Internet (second to mail loops).

  ....

  Many mailers delay responding to the final dot because they are doing
  sophisticated processing of the message, in an attempt to confirm
  that they can deliver the message.


RFC 2821(bis) has a 10 minute recommendation:

   DATA Termination: 10 minutes.

   This is while awaiting the "250 OK" reply.  When the receiver gets
   the final period terminating the message data, it typically
   performs processing to deliver the message to a user mailbox.  A
   spurious timeout at this point would be very wasteful and would
   typically result in delivery of multiple copies of the message,
   since it has been successfully sent and the server has accepted
   responsibility for delivery.  See section 6.1 for additional
   discussion.

The reference to 6.1 states:

   To avoid receiving duplicate messages as the result of timeouts, a
   receiver-SMTP MUST seek to minimize the time required to respond to
   the final <CRLF>.<CRLF> end of data indicator.  See RFC 1047 [28] for
   a discussion of this problem.

which points to RFC 1047.

We ran into the exact issue highlighted in RFC 1047 where a customer has a local set of AVS rule based processing policies with lengthy callouts after the DATA termination is received and before the response code is sent. In this case, the callouts maxed out 5 minutes because it detects the client dropped the connection.

However, our SMTP server continues to send the 250 and because it still some RFC 821 behavior, as RFC 1047 described:

   RFC-821 (on page 22) states that unless the receiving mailer is
   completely unable to process a message it should accept the message
   and acknowledge any errors in processing in a separate message or
   messages sent back to the originator of the message.  As a result,
   receiving mailers should be able to acknowledge the final dot as soon
   as the message has been safely put in a non-volatile (e.g., disk)
   queue for further processing.  Fast acceptance of a message does not
   violate RFC-821.

In short, our server issues the 250 and signals the router to process the mail, and logs the event for the operator to see.

The acceptance of this message appears to violates 2821 (showing 2921bis which is 99% the same with a few changes):

   4.1.1.10.  QUIT (QUIT)

   ...

   The receiver MUST NOT intentionally close the transmission channel
   until it receives and replies to a QUIT command (even if there was an
   error).  The sender MUST NOT intentionally close the transmission
   channel until it sends a QUIT command and SHOULD wait until it
   receives the reply (even if there was an error response to a previous
   command).  If the connection is closed prematurely due to violations
   of the above or system or network failure, the server MUST cancel any
   pending transaction, but not undo any previously completed
   transaction, and generally MUST act as if the command or transaction
   in progress had received a temporary error (i.e., a 4yz response).

   The QUIT command may be issued at any time.  Any current uncompleted
   mail transaction will be aborted.


So we are now debating if this is good or bad.

The customer received the message. His complaint is that the mail client sending large emails are trying again with the same thing happening. However, our dupe processor is catching it so it isn't a problem of getting new duplicate mail, just a processing overhead problem.

If we followed QUIT to the letter in 2821 even after a successful DATA termination was received, and delay the router processing until a QUIT is finally issued, otherwise CANCEL the transaction, then the customer will never receive the message in the first place. And this was exactly what he expressed when we found out it was not canceling the message and indicated we might have to fix that:

    Customer comment:

    Well, unfortunately, your proposed fix just might prevent
    ANY large emails rom being received.

    I'd rather get "23" of the same email than "0" of them.
    I really just want 1, of course.

Anyway, the callout issue can be adjusted per customer but I think there is a conflict of the standard recommends 10 minutes with "buts" in it.

I can understand older days and many systems used POST SMTP processing but we all know they are suffering from major blow back problems. So the direction is to apply DATA level callouts to provide dynamic SMTP level rejection capabilities. This is a godsend for our customers and no way will be removed. Better notes will be provided regarding lengthy callouts taking more than 5 minutes, but I think we should also teach our SMTP clients to adjust to todays changing times.

Even RFC 1047 concludes that mailers should be aware of this and not
just use a low 5 minutes across the board, especially when sending larger email payloads which will obviously add processing time at the today's aware AVS receivers:

    Finally, some mailers allow remote mailers only a minute or two to
    acknowledge the final dot before timing out and trying again.  Given
    the increasing round-trip times on the Internet, and that some
    processing after the final dot is required, the timeout for reply to
    the final dot should probably be at least 5 minutes and a timeout of
    10 minutes would not be unreasonable.

I can understand how some may thing "10 minutes" is too long, so the proposal is to adjust it based on the size of the file being transfered.

Comments?

--
Sincerely

Hector Santos, CTO
http://www.santronics.com
http://santronics.blogspot.com