Cormac O'Brien wrote:
What type of features do people use for bayesian filters? Most people talk
about looking at word tokens. Does anyone use character tokens? I would be
interested to know whether using words or using characters would give the
best filtering results.
By character tokens, do you mean each individual character of messages?
It's an interesing idea to simplify the filtering of bloated HTML spam
with excessive numbers of a images and colored text, but that can
already be filtered easily. For spam that uses normal words and only
mentions a URL or email address once, this is not likely to be
effective. The only difference in character distribution that would
really be noticeable would be an increased proportion of uppercase
characters noted by Graham.
Another alternative would be for filter users or developers to dump in
any non-word 'feature' they can think of that would have a high
probability of spam or not and ask their bayesian filters to consider it
along with the normal content tokens. From what I've heard, there is a
way to use SpamAssassin rules as meta-token in their bayesian engine
rather just to create a score.
Philip Miller
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg