procmail
[Top] [All Lists]

A recipe for filtering foreign characters

2000-01-17 16:24:32
  Due to possible problems with high-bit text being destroyed
in transit, I'm appending it as a file attachment.  Upload as
binary to a unix system, and unzip on the unix system.  If
anybody can't get their hands on unix unzip, I can email in
other compressed formats.  The following is a pseudo-code
representation of my filter.  The "#" are replaced by high-bit
characters in the real thing.  The first row is CHR(128)..CHR(159).
I haven't seen anything in that region, but the filter watches
anyways.  The next rows are CHR(160)..CHR(191), CHR(192)..CHR(223),
and CHR(224)..CHR(255) respectively.  The last row looks for
"quoted-printable" versions of high-bit characters...

 :0BDfh
 * -1^1 .
 * 19^1 [################################]
 * 19^1 [################################]
 * 19^1 [################################]
 * 19^1 [################################]
 * 57^1 =[89A-F][0-9A-F]
 | formail -A "X-Reject: Too many foreign charcters."

  If an email is more than 5% unwanted characters, it is flagged.
If you want to immediately divert it to a file, get rid of the
"fh" flags, and replace the reference to formail with the name of
the junkmail file, like so...

 :0BD
 * -1^1 .
 * 19^1 [################################]
 * 19^1 [################################]
 * 19^1 [################################]
 * 19^1 [################################]
 * 57^1 =[89A-F][0-9A-F]
 junkmail

  Here's the logic (If you're unfamiliar with procmail "scoring"
read "man procmailsc").
  -> "* -1^1 ." - count the number of characters in the body and
                  subtract from the accumulator.
  -> next 4 lines - add 19 to the accumulator for each forbidden
                    character in the body.
  -> 6th line     - add 57 to the accumulator for each group of 3
                    consecutive characters in the body that form a
                    "quoted-printable" character in the range
                    "=80" through "=FF".

  If the final result is positive, the action at the bottom of the
recipe is executed.

 *BUT WHAT ABOUT LEGITIMATE NON-ENGLISH EMAIL*.  I assume you're
talking about European languages that have some accents.  You can
add lines to subtract 19 for each acceptable high-bit character and
and 57 for the quoted-printable version of that character.  This
will offset false matches in the recipe.

-- 
Walter Dnes <waltdnes(_at_)waltdnes(_dot_)org>
http://www.waltdnes.org <SpamDunk Project procmail spamfilters>

Attachment: FOREIGN.ZIP
Description: Binary data

<Prev in Thread] Current Thread [Next in Thread>
  • A recipe for filtering foreign characters, Walter Dnes <=