procmail
[Top] [All Lists]

Re: How to do high-bit characters in procmail regexp ?

1999-12-02 03:58:19
On Thu, 02 Dec 1999 03:02:07 -0500, Walter Dnes
<waltdnes(_at_)waltdnes(_dot_)org> wrote:
  With all the foreign-language/foreign-character-set spam hitting
now, I figure that a good test for use by English- speaking people
is to flag any email with lots of high-bit characters. So how do we
do it? Here's a try. Note that [X-Y] would really have
byte=>CHR(128) in place of "X" and byte=>CHR(255) in place of "Y".
I don't think they'll transmit to well, so I'm doing pseudocode
here.
 :0HB
 * -40
 * 1^1 [X-Y]
 | formail -A "X-Reject: High-bit character set in email"

The correct syntax here would be -40^0 on the first condition. Other
than that, I think this ought to work, although I would perhaps prefer
something which counts the high-bit characters as a percentage of the
whole. In Finnish text, for example, accented characters can easily be
several per cent of a message and a long message might easily be more
than 4000 characters (although I find typical messages to be in the
range 1500-3000 bytes, headers included).

Another problem is that people might MIME-encode their messages. I get
some amount of Quoted-Printable Chinese spam. My University's Sendmail
setup automatically converts that to 8-bit text, so I never actually
see the QP message, but I imagine most sites don't have a setup like
that.

/* era */

-- 
 Too much to say to fit into this .signature anyway: <http://www.iki.fi/era/>
  Fight spam in Europe: <http://www.euro.cauce.org/> * Sign the EU petition