Re: [fetchmail]fetchmail fork?

On Tue, May 25, 2004 at 10:00:44AM +0100, Brian Candler wrote:

This is easily done by checking the list of UIDs on each side, and noting
which messages are new and which have gone, and propagating changes
appropriately. Once we've retrieved a message we can always calculate the
MD5 hash of the headers, so we can have an optional duplicate-removal system
in the core to prevent the same message propagating twice even if its UID
changes. For POP3, we would not support 'LAST' at all; anyone who wants to
leave-mail-on-server on a box which doesn't support UIDL would have to use
this, or else use POP3 in the way it was intended (i.e. read and delete).


Actually, the MD5 (of certain headers or the whole message) can be more
important than just duplicate-removal; it can be the key which associates a
particular message on server A with the corresponding message on server B.
So:

- We use UID as a way to detect "new" or "removed" mail in a mailbox
- We use the MD5 hash as a way to decide whether to copy a file from
  A to B, or to locate the corresponding message to delete from B

Not only does this solve the problem with loss of UIDVALIDITY on an IMAP
server, it also solves the problem if fetchmail's own state file is lost,
when syncing two IMAP servers.

For example our state file could contain:

host    folder  uid     md5
----    ------  ---     ---
srvA    inbox   1234.1  f7173b688dbc8bb5261c485c227b8218
srvB    inbox   704.12  f7173b688dbc8bb5261c485c227b8218

After connecting to both servers and bringing the UID lists up to date,
let's say we find a new entry on A:

srvA    inbox   1234.2

We download this message, and calculate its MD5 checksum:

srvA    inbox   1234.2  ae16ed8abf51dbc239c743365ede7f51

Since no message already exists on srvB with this MD5, we copy it there.

Equally, if we noticed that a UID entry on srvA had vanished, then we'd find
any messages with the same MD5 in srvB's list, and delete them.

So after this, the state file looks like:

srvA    inbox   1234.1  f7173b688dbc8bb5261c485c227b8218
srvA    inbox   1234.2  ae16ed8abf51dbc239c743365ede7f51
srvB    inbox   704.12  f7173b688dbc8bb5261c485c227b8218
srvB    inbox   704.13  ae16ed8abf51dbc239c743365ede7f51

Now, in the event that the state file is lost, fetchmail will download both
messages from both servers. But it will find the MD5's are the same, so will
not cross-copy from one to the other. (Hmm, this assumes it does all the
downloading of new messages before it does the copying, which could require
a large temporary disk usage. It could just pull headers I suppose.)

The assumption behind this is that you only ever want one copy of a message
in a folder. If server A has two identical copies of the message, only one
will find its way to server B; and if server B has two identical copies of a
message, and you delete it from server A, both copies will be deleted from
server B.

Is that acceptable? (I think it's OK as long as it's scoped to a single
folder). Otherwise, instead of holding MD5's, we'd need to hold some direct
link between messages' UIDs, e.g. srvA 1234.1 <--> srvB 704.12

There are other interesting things you can do with plugins in this
architecture. For example, you could have IMAP folders 'spam' and 'notspam',
and you use an output plugin which updates your Bayes filters.

Cheers,

Brian.