Re: Attempts at establishing harmful conventions


On Dec 01 2004, Bruce Lilly wrote:


I don't think it's fair to single out IMAP; the issues that you mention
are far more germane to a protocol like POP3, where there are no
mechanisms for server-based searching/sorting/filtering or for
retrieval of message components.  IMAP has mechanisms which
could be extended to handle statistical filtering etc., with the
same set of benefits that are characteristic of IMAP's other
features (e.g. access to the same message store -- and in the
case of Bayesian filtering, to the same set of data -- from any
client anywhere).  POP3, on the other hand, doesn't have the
basic server-side processing as IMAP that would be a prerequisite
to implementation of specific server-side processing such as
statistical filtering.


I seem to be taken as more critical than I intended today, both on this and
the Outlook Express comment I made to Keith just now. My intention
was to give specific concrete examples rather than generalities.

IMAP is not the issue. As a mail system, it's far superior to POP3,
except in simplicity. I wanted to illustrate an issue with Bayesian
filtering which was a response to Steve Dorner's comment about
detecting a change of topic within a thread. Perhaps a more abstract
description is more appropriate.

Bayesian analysis works best with full messages (or the human readable
text parts at any rate). For reasons of performance and
responsiveness, many mail user agents often separate the mail headers
from the body. This allows for example to quickly display messages
alphabetically by subject, which is an example of an
operation which need only involve the mail headers. On-the-fly
Bayesian analysis is not well suited for operations which naturally
only involve mail headers, so as part of the user interface, it has
certain natural limitations. Precomputed Bayesian analysis can go stale
when many new untypical messages are received.  

(Which is not to say Bayesian analysis is slow,
my own implementation [dbacl] handles about 100 messages per second on an
oldish Pentium 3/500 Mhz. But that still means a minute and a half for
10,000 messages).

The complexity of a Bayesian system is similar to the complexity of
implementing and maintaining a full text index, except a full text
index is universal while a Bayesian system works much better if it is
user specific.  In particular, this means that while a full text index
is easily distributed, a statistical database works best when centralized
as near the user as possible, who would ideally carry his wordlists
with him. 

Which is why I thought IMAP illustrates the difficulties quite well.

I mentioned POPFile at some point, but it really is a hack due to its
location as a POP3 proxy rather than a MUA plugin, which forces the
authors to essentially reimplement a MUA for managing/learning message
categorizations. BTW, they also have an IMAP module, which tries
to keep in sync with changes happening on the server, but again purely for
managing/relearning message categorizations, with consequent network traffic.

While I'm plugging the competition, I might as well mention the following
internet survey the author is doing regarding how much spam people get:

http://getpopfile.org/cgi-bin/start.cgi

Laird.