fetchmail-friends
[Top] [All Lists]

[fetchmail] Re: POP3 LAST vs UIDL

2003-10-13 00:08:25
Quoting from Matthias Andree's mail on Sun, Oct 12, 2003 at 02:10:04PM +0200:
1. the linear list leads to O(n^2) complexity for looking up a single
   UID is probihibitively expensive. I reduced some of the (function
   call) overhead by making the recursive function iterative instead,
   which is a linear speed-up of three, but it doesn't fix the problem
   that with --keep, fetchmail takes many seconds to find out what mail
   it has seen and not.

I have submitted a patch for fast uidl which does a binary search in
every poll (except the first one, where it does a linear search just
to get all the UIDs) in daemon mode. Please try it out.

2. Eric is chary about touching UID code, and he's probably right, it's
   delicate equipment.

Well, I am willing to simplify the current code. In fact, the fast
uidl patch also fixes a few bugs in the UID code.

   As suggested before, I'll repeat that there should be one UID file
   per account - that is, (user, server) tuple, so we don't need to
   worry about swapping and saving. Saves memory as well. There are
   several approaches: * use a data base (BerkeleyDB, GDBM), * use a
   flat text file, but read it into a hash or rbtree.

Well, if you want IMAP UIDs also, then the tuple should include
mailbox name also! I am not sure what filename convention you are
looking at. Though, the memory saving part is right, the filenames
could be lengthy if you generate filenames based on the tuple. Also,
what if some more information is to be included in the tuple?

I suggest the following format for each id file (the filename will be
ignored in this format). Each id file will have a parameter section
which the relevant information about the mailbox.

server=my.mailserver.com
user=myaccountname
protocol=imap
mailbox=themailboxname          # IMAP only, the mailbox is not "INBOX"
uidvalidity=themailboxid        # IMAP only, if this changes clear the uid list
slowuidl=yes                    # POP3 only, the ids are based on slow uidl

This should be followed by ids with timestamps (the time at which the
mail was downloaded; this could be then used for the "delete after X
days") and parameters specific to each id.

id1 1066026931
id2 1066026932
id3 1066026933

The optional parameters could include information specific to some
ids.

id4 1066026934 deleted=yes      # this mail was deleted, but did not
                                # get expunged, possibly due to socket error.

id5 1066026935 errors=3         # there was an error in downloading
                                # this mail in the last 3 polls. If
                                # the count reaches a limit (say 5),
                                # this mail will be skipped over. Such
                                # mails can then be downloaded by
                                # increasing the limit through a
                                # command line option.

id6 1066026936 size=1000000     # this mail of size 1 Mb was
                                # skipped over because of 'limit'.
                                # This mail will be skipped over
                                # unless the 'limit' option is changed
                                # to allow this mail.

Also, once LAST is removed, the issue of TOP-vs-RETR is meaningless.

Sort of. TOP still has one potential use: aid filtering. Assume we're
running with antispam list or a potential future "policy" extension
(that would be a program that is shown the mail headers and then says
ACCEPT, BOUNCE or DISCARD) program(1). IMAP4 allows to retrieve headers
only, and so does POP3. With "TOP 1234 0", we'd peek at the headers,
pass these to the policy extension and if it says BOUNCE or DISCARD, we
can drop the mail without saying RETR. Such a mechanism is very useful
when a mail virus or spam can be told from the header already. Of
course, such bandwidth reduction is only useful if the mail is large
enough that download time outweighs transit time(2).

Well, I did not mean from the header peeking point of view. TOP is an
optional command in RFC 1725. So, fetchmail should always use 'RETR n'
instead of 'TOP n some-large-number' when retrieving the header+body,

I am not sure if the pop3_slowuidl() code should be kept. Is there
anyone using the slow uidl method which downloads the headers of all
mails and uses the message-id as the UID?

I have been running "proto pop3 uidl" for ages. ALL my upstream servers,
without any exception, support UIDL. That is a standard service users
should expect, if their ISP doesn't support POP3 + UIDL, the user should
complain - or switch ISP.

Anyone for slow uidl (by default)? I recommend that it should be used
only when an option like 'slowuidl' is specified.

-- 
Sunil Shetye.