procmail
[Top] [All Lists]

Re: Chinese-spam filter

2000-02-12 13:47:58
On Fri, 11 Feb 2000 16:56:26 +0200 (EET), era eriksson <era(_at_)iki(_dot_)fi>
wrote:

 >     :0BD
 >     * -1^1 .
 >     *  2^1 =[0-9A-F][0-9A-F]
 >     * 20^1 [################################] #160-191 
 >     * 20^1 [################################] #192-223 
 >     * 20^1 [################################] #224-255 
 >     * 20^1 =[A-F][0-9A-F]
 >     { CNSPAM=spam2 }

  As Ruud mentions in a later email, this is my idea, so here are
my reasons for doing what I did.

I don't understand why you penalize quoted-printable =AF pairs
with two extra points.
  I'm not sure if the following explanation is clear enough, but
I'm doing a "normalized" character count.  The filter is a two-step
process...
  1) it counts the "total number of characters"
  2) it then subtracts 20 times the count of high-bit characters
     in the range 160..255. This allows 5% safety margin in case
     for the occasional "Copyright"/"Trademark"/"Registered"
     symbol.  If the safety margin is exceeded, the score is
     positive, and the filter activates.

  So how do you define "one character" in step 1) above?  The
actual character CHR(160) counts as one character.  In a spam that
has been autoconverted to "Quoted-Printable" CHR(160) will show up
as the string "=A0".  So the "total character count" 
 * -1^1 .
will give it a weighting of *THREE* characters... oops.

  Meanwhile the high-bit counter
 * 20^1 =[A-F][0-9A-F]
will give it the same weighting as *ONE* (not three) high-bit
characters.  I'm not penalizing.  I'm merely compensating for
the fact that one high-bit character CHR(160) is expanded into
three low-bit characters.  The two steps...
 >     * -1^1 .
 >     *  2^1 =[0-9A-F][0-9A-F]
combine to give the string "=A0" the same weighting (i.e. -1) in 
he total count as the actual character CHR(160).  Otherwise, quoted
printable emails would need up to 3 times as many highbit characters
to trigger the filter as non-quoted-printable.

-- 
Walter Dnes <waltdnes(_at_)waltdnes(_dot_)org> http://www.waltdnes.org
SpamDunk Project procmail spamfilters.
A picture is worth a thousand words; unfortunately,
it consumes the bandwidth of ten thousand words.

<Prev in Thread] Current Thread [Next in Thread>