Re: Chinese-spam filter

On Sat, 12 Feb 2000 21:26:34 -0500 (EST), Rik Kabel <rik(_at_)netcom(_dot_)com>
wrote:

First, why limit this to the body?

  Because many of my samples of Chinese spam have almost entirely
English headers, the count of low-bit characters in the headers
will swamp the count of high-bit characters in body, and might let
Chinese spam through.  I'm in a real-life situation which might be
described as a worst-case scenario...
  - I'm keeping my old account at Interlog as a backup.  So there
    is one set of headers at Interlog.
  - I've set that account to forward email to waltdnes(_at_)waltdnes(_dot_)org
    which is *NOT* an ISP.  It's a DNS entry in the servers at
    DomainDirect, a subsidiary of TuCOWS.
  - DomainDirect redirects the email to my actual ISP.
  In other words, I have 2 extra sets of headers, entirely in
English.

Second, why test for proportions. A message from a Francophone
colleague may well set this off.

  a) I currently don't correspond with anybody in French.

  b) My other recipes have trapped spam from France, because of
     the rule of flagging any non-white-listed (mailing lists,
     etc.) email that is not addressed

     ^To: .*<list of valid addresses I accept email for>

     I am not kidding, it was actually addressed...

     A: waltdnes(_at_)interlog(_dot_)com

  c) Code page 437 (aka "us-ascii") has most of its French
     accented characters in the ascii range 128..159, which my
     current filter avoids

  d) I outlined in an earlier message to Liviu Daia in this same
     thread, how my filter can be tweaked to accommadate non-English
     (not just French) email which uses some high-bit characters.

  e) The 5% safety margin will help avoid false positives.  If you
     want to boost it, the formula is to replace "20^1" with "N^1"
     where N = 1/(allowable high-bit ratio).  To allow 4%, make
     N = 1/(.04) = 25.  To allow 10%, make N = 1/(.10) = 10

In any case, don't fool yourself into thinking that this identifies
Chinese spam. It identifies the proportion (or the presence of a
sequence, if you go with Era's suggestion) of high-bit characters
IN A MESSAGE BODY

  ^^^^^^^^^^^^^^^^^ [My emphasis, WD]

The message may be Chinese spam, or a picture of your mother's
new parakeet, or a data file for an important research project.
Procmail can't tell which it is.

  Any email to me with binary files embedded *IN THE MESSAGE BODY*
absolutely deserves to be trashed.  Binary data files, zip files,
pictures (GIF, JPEG, etc) are supposed to come *AS ATTACHMENTS*.
MIME, BASE64, UUENCODE, etc encoding is all done as low-bit
characters.  This is due to the historical fact that smtp was
originally unable to handle high-bit characters.  The sequence is...

  1) Sender's mail client encodes binary file to low-bit characters
  2) Sender's client sends email (body+attachment) to SMTP server
  3) Sender's SMTP server sends email to my ISP's SMTP server
  4) My ISP's SMTP server hands off email (body+attachment) to
     procmail, *WHICH SEES THE ENCODED ATTACHMENT AS ALL LOW-BIT
     CHARACTERS*
  5) Procmail delivers the email to my inbox on my ISP's server
  6) I dial in, log on and download the email to my mail client
  7) *ONLY AT THIS STAGE* is the low-bit-encoded attachment
     decoded back into the original binary file format on my PC.

  If anything, a large encoded attachment to a Chinese spam will
swamp the count of high-bit characters, pushing it below the 5%
threshold, and getting past the filter.  False positives are the
least of my worries.  If you doubt me, try this experiment...
  - Set your procmail filter to copy to a separate folder, any
    messages with a special subject like "Binary attachment test"
  - Send an email to yourself with an attachment, and the subject
    that your filter is set to look for
  - Open up the diverted email *NOT WITH A MAIL CLIENT* but with
    a text editor, e.g. vi.  You will see what procmail sees,
    namely a bunch of low-bit gibberish as the attachment.

-- 
Walter Dnes <waltdnes(_at_)waltdnes(_dot_)org> http://www.waltdnes.org
SpamDunk Project procmail spamfilters.
A picture is worth a thousand words; unfortunately,
it consumes the bandwidth of ten thousand words.