On Sat, 12 Feb 2000 21:26:34 -0500 (EST), Rik Kabel <rik(_at_)netcom(_dot_)com>
wrote:
First, why limit this to the body?
Because many of my samples of Chinese spam have almost entirely
English headers, the count of low-bit characters in the headers
will swamp the count of high-bit characters in body, and might let
Chinese spam through. I'm in a real-life situation which might be
described as a worst-case scenario...
- I'm keeping my old account at Interlog as a backup. So there
is one set of headers at Interlog.
- I've set that account to forward email to waltdnes(_at_)waltdnes(_dot_)org
which is *NOT* an ISP. It's a DNS entry in the servers at
DomainDirect, a subsidiary of TuCOWS.
- DomainDirect redirects the email to my actual ISP.
In other words, I have 2 extra sets of headers, entirely in
English.
Second, why test for proportions. A message from a Francophone
colleague may well set this off.
a) I currently don't correspond with anybody in French.
b) My other recipes have trapped spam from France, because of
the rule of flagging any non-white-listed (mailing lists,
etc.) email that is not addressed
^To: .*<list of valid addresses I accept email for>
I am not kidding, it was actually addressed...
A: waltdnes(_at_)interlog(_dot_)com
c) Code page 437 (aka "us-ascii") has most of its French
accented characters in the ascii range 128..159, which my
current filter avoids
d) I outlined in an earlier message to Liviu Daia in this same
thread, how my filter can be tweaked to accommadate non-English
(not just French) email which uses some high-bit characters.
e) The 5% safety margin will help avoid false positives. If you
want to boost it, the formula is to replace "20^1" with "N^1"
where N = 1/(allowable high-bit ratio). To allow 4%, make
N = 1/(.04) = 25. To allow 10%, make N = 1/(.10) = 10
In any case, don't fool yourself into thinking that this identifies
Chinese spam. It identifies the proportion (or the presence of a
sequence, if you go with Era's suggestion) of high-bit characters
IN A MESSAGE BODY
^^^^^^^^^^^^^^^^^ [My emphasis, WD]
The message may be Chinese spam, or a picture of your mother's
new parakeet, or a data file for an important research project.
Procmail can't tell which it is.
Any email to me with binary files embedded *IN THE MESSAGE BODY*
absolutely deserves to be trashed. Binary data files, zip files,
pictures (GIF, JPEG, etc) are supposed to come *AS ATTACHMENTS*.
MIME, BASE64, UUENCODE, etc encoding is all done as low-bit
characters. This is due to the historical fact that smtp was
originally unable to handle high-bit characters. The sequence is...
1) Sender's mail client encodes binary file to low-bit characters
2) Sender's client sends email (body+attachment) to SMTP server
3) Sender's SMTP server sends email to my ISP's SMTP server
4) My ISP's SMTP server hands off email (body+attachment) to
procmail, *WHICH SEES THE ENCODED ATTACHMENT AS ALL LOW-BIT
CHARACTERS*
5) Procmail delivers the email to my inbox on my ISP's server
6) I dial in, log on and download the email to my mail client
7) *ONLY AT THIS STAGE* is the low-bit-encoded attachment
decoded back into the original binary file format on my PC.
If anything, a large encoded attachment to a Chinese spam will
swamp the count of high-bit characters, pushing it below the 5%
threshold, and getting past the filter. False positives are the
least of my worries. If you doubt me, try this experiment...
- Set your procmail filter to copy to a separate folder, any
messages with a special subject like "Binary attachment test"
- Send an email to yourself with an attachment, and the subject
that your filter is set to look for
- Open up the diverted email *NOT WITH A MAIL CLIENT* but with
a text editor, e.g. vi. You will see what procmail sees,
namely a bunch of low-bit gibberish as the attachment.
--
Walter Dnes <waltdnes(_at_)waltdnes(_dot_)org> http://www.waltdnes.org
SpamDunk Project procmail spamfilters.
A picture is worth a thousand words; unfortunately,
it consumes the bandwidth of ten thousand words.