procmail
[Top] [All Lists]

Re: Chinese-spam filter

2000-02-12 19:38:03
Walter Dnes wrote:

 >     :0BD
 >     * -1^1 .
 >     *  2^1 =[0-9A-F][0-9A-F]
 >     * 20^1 [################################] #160-191 
 >     * 20^1 [################################] #192-223 
 >     * 20^1 [################################] #224-255 
 >     * 20^1 =[A-F][0-9A-F]
 >     { CNSPAM=spam2 }

  As Ruud mentions in a later email, this is my idea, so here are
my reasons for doing what I did.

First, why limit this to the body?

Second, why test for proportions. A message from a Francophone
colleague may well set this off.

Perhaps better would be to look for long (four character)
sequences of high-bit characters, as Era suggested in
http://www.xray.mpe.mpg.de/mailing-lists/procmail/2000-01/msg00040.html
on January 5. This has the possible advantage of being less expensive
to run, and certainly is easier to follow. This is as accurate in
identifying candidates as the scoring recipe, and likely has a lower
false positive rate.

In any case, don't fool yourself into thinking that this identifies
Chinese spam. It identifies the proportion (or the presence of a
sequence, if you go with Era's suggestion) of high-bit characters in
a message body. The message may be Chinese spam, or a picture of your
mother's new parakeet, or a data file for an important research project.
Procmail can't tell which it is.

-- 
Rik Kabel          Old enough to be an adult              
rik(_at_)netcom(_dot_)com

<Prev in Thread] Current Thread [Next in Thread>