Re: Attempts at establishing harmful conventions


On Dec 01 2004, Bruce Lilly wrote:

On Tue November 30 2004 21:46, Laird Breyer wrote:

Unfortunately, retrieving/analysing full bodies are expensive
operations, for example think of an IMAP server where the MUA works
with headers only and downloads bodies only if absolutely necessary.


IMAP has a number of features that operate on the server's
message store as directed by the client, including searching
capability.  There's also Sieve-based filtering.


IMAP is great for many things, but statistical filtering probably
isn't among them. A typical Bayesian filter carries around one (or
more) word list(s) with frequencies and other quantities, one per
word. Such a word list is updated in complex ways whenever "learning"
occurs, and typical current implementations produce word list sizes
anywhere from a few hundred k to tens of megabytes or more. It must
also be available for fast retrieval, because incoming mail has many
words, and most words occur once, making size reductions difficult.

It is certainly conceivable that an IMAP server would implement a
statistical component, but there are drawbacks, among them:

1) The training data to construct and maintain word lists would be
restricted to mail accessible by the IMAP server, e.g. only the
folders it manages for the user.

2) If a user has several mail accounts, with a different Bayesian
filter set up on each, all the filters would be trained and maintained
independently, with varying rates of success. It is much easier to let
a single point learn and filter mail. Statistically, this also gives a
bigger dataset to work with, and probably therefore better filtering
accuracy.

Laird.

p.s. I stand corrected on the Comments field.