[fetchmail] Re: Fast UIDL thoughts

Quoting from Matthias Andree's mail on Sat, Nov 01, 2003 at 10:13:11PM +0100:

fetchmail 6.2.5 brought a new fastuidl option, which added a lot of
baggage and deems itself unsafe (it falls back to "slowuidl" for few
mails or in flush mode) in some places.


For few mails, it falls back to 'slowuidl' because
- linear search is faster than binary search.
- getting UIDs of all mails with a single command is faster than that
  of few mails with multiple commands.

Note that the 'fastuidl' and the 'fetchsizelimit' options were added
keeping in mind a mailbox having 10000 mails where the overheads of
getting all sizes and UIDs right at the start is a waste of bandwidth
when using 'keep' and/or when there is a socket error. These options
however are not very useful when the mailbox is small.

'fastuidl' is disabled with 'flush' only for safety. However, we could
reverse the logic: disable 'flush' when 'fastuidl' is being used!
Since, 'fastuidl' is *not* used every tenth poll (by default), this
would mean that 'flush' would be actually used only every tenth poll.
If this seems reasonable, I could work on this.

CURRENT STATE

1. The current UIDL assumes that new messages show up at the end of the
   list.

2. We can know how many messages are on the server through STAT.  Let
   these be in a variable "totalmsgs".

3. We further know how many UIDLs we have seen, by counting the ids. Let
   these be in a variable "knownids".

FURTHER CONDITIONS

(may be met, haven't looked)

* Any DELE must be accompanied by dropping the corresponding IDs from
  the file AT THE SAME TIME. This ID dropping MUST NEVER be deferred, as
  a server is free to reuse the UID (see RFC-1939) - the entity to which
  the UID had referred is now gone!


On socket errors, servers normally do not update the state of the
mailbox (RFC 1939!). Mails are deleted only on a successful logout.

SUGGESTION


[I have referred to this method as "linear range search" method.]

* In FastUIDL mode, request UIDL for the interval
  [knownids ; totalmsgs].


In POP3, one cannot give ranges in commands (RFC 1939). You can get
either uid of one mail or uids of all mails. As you had mentioned
earlier, pipelining could speed up the processing. (The same was the
issue with 'fetchsizelimit' which has to be forced to 1 for POP3.)

- If ALL returned UIDs except the first are unseen, we know we have all
  mail. Poll completed.

- If ANY of the 2nd to last returned UIDs is known to us, we know that
  at least one is in the interval [ 1 ; knownids ] - either we skipped
  one (temporary error) or the server inserted the new UID into the
  middle of the list, rather than appending it to the end of the list
  (*1). Request UIDL 1-totalmsgs.

- If the 1st returned UID (which we should know) is unseen or mismatches
  the last previously seen UID, we know a previously seen mail has been
  deleted and our counts are off. Request UIDL 1-totalmsgs.


This entire method can be incorporated in binary search also!

- If knownids+1 does not match the first unseen mail (as obtained from
  binary search), force 'slowuidl' in the next poll.

- If any mail after the first unseen mail is seen, force 'slowuidl' in
  the next poll.

However, please read the DISADVANTAGES in the "linear range search"
method that you mention!

ADVANTAGES

* we don't need to probe (binary search) into the "known" range, and we
  can fetch all UIDLs and sizes (through LIST) for new mails in one
  round trip, rather than ld(totalmsgs) + (totalmsgs - knownids) round
  trips.


As mentioned above, it won't be one round trip unless pipelining is
supported. What you are potentially saving here is ld(totalmsgs) only
in the worst case.

* still the speed-up we got from the binary search


Whether a speed-up has occurred depends on how many mails are new. If
there are too many new mails, binary search will get the first unseen
mail using fewer UIDs.

Note that binary search does not require all the UIDs of unseen mail
right at the start. Your "linear range search" does. This will lead to
a loss of bandwidth with resource control options like 'fetchlimit' or
'expunge' where all unseen UIDs are not used in the same poll/session.

Note that binary search itself could be speeded up by using a skew
factor! Instead of using try=(first+last)/2, one could use
try=(first*0.1+last*0.9). This could lead to downloading of even fewer
seen UIDs!

DISADVANTAGES of "linear range search"

- 'fastuidl' will not work with 'limit'. Here, there will always be a
  mismatch once a mail is skipped due to 'limit'.

- If there was a temporary error in the previous poll for a particular
  mail, it will be more likely that the same error will occur for the
  same mail in the next poll. Such a mail will lead to a mismatch of
  ids leading to unnecessary traffic.

The only serious issue is that of new mails being inserted in rather than
appended to the mailbox. To counter this, 'fastuidl' is not done on
every poll. 'slowuidl' is done once every 10 polls just to get the
inserted mails. If the problem occurs in every poll, one could just
disable 'fastuidl' through the option. Note that 'fastuidl' is *not*
disabled by default due to the problem mentioned at the start for
large mailboxes.

-- 
Sunil Shetye.