[Asrg] Re: Feature selection for Bayesian filters

Cormac O'Brien wrote:

What type of features do people use for bayesian filters? Most people talk
about looking at word tokens. Does anyone use character tokens? I would be
interested to know whether using words or using characters would give the
best filtering results.

By character tokens, do you mean each individual character of messages?It's an interesing idea to simplify the filtering of bloated HTML spamwith excessive numbers of a images and colored text, but that canalready be filtered easily. For spam that uses normal words and onlymentions a URL or email address once, this is not likely to beeffective. The only difference in character distribution that wouldreally be noticeable would be an increased proportion of uppercasecharacters noted by Graham.Another alternative would be for filter users or developers to dump inany non-word 'feature' they can think of that would have a highprobability of spam or not and ask their bayesian filters to consider italong with the normal content tokens. From what I've heard, there is away to use SpamAssassin rules as meta-token in their bayesian enginerather just to create a score.


Philip Miller



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg