ietf-asrg
[Top] [All Lists]

[Asrg] Re: Feature selection for Bayesian filters

2003-07-02 08:44:40
Cormac O'Brien wrote:

What type of features do people use for bayesian filters? Most people talk
about looking at word tokens. Does anyone use character tokens? I would be
interested to know whether using words or using characters would give the
best filtering results.

By character tokens, do you mean each individual character of messages? It's an interesing idea to simplify the filtering of bloated HTML spam with excessive numbers of a images and colored text, but that can already be filtered easily. For spam that uses normal words and only mentions a URL or email address once, this is not likely to be effective. The only difference in character distribution that would really be noticeable would be an increased proportion of uppercase characters noted by Graham. Another alternative would be for filter users or developers to dump in any non-word 'feature' they can think of that would have a high probability of spam or not and ask their bayesian filters to consider it along with the normal content tokens. From what I've heard, there is a way to use SpamAssassin rules as meta-token in their bayesian engine rather just to create a score.

Philip Miller



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>
  • [Asrg] Re: Feature selection for Bayesian filters, Philip Miller <=