procmail
[Top] [All Lists]

RE: Base64spam documentation

2005-09-03 09:14:48


-----Original Message-----
From:Louis Proyect
Sent: Friday, September 02, 2005 11:45 AM

I have just received 2 emails that snuck through the recipe that Gary
supplied. The header and body for one of them can be found at:
http://www.columbia.edu/~lnp3/base64spam.htm

Any suggestions on how to drive a stake through its heart?


As a follow-up, matching strings in their base64 encoded form
isn't as trivial as I'd first suggested.  Base64 encodes 3 8-bit
characters into a sequence for 4 64-character sequences.  Thus,
the beginning of the original string will appear at the beginning
of a base64 4-tuple only if the string begins at an even 3 character
boundary. As a heuristic, we can check to see if the substring
beginning at positions 2 and 3 begins a base64 encoded string, but
we give up confidence that we have an exact match.  For example,
we can match for "St0cks", "t0cks", and "0cks" and still be fairly
confident we've matched our spam string, but if the distinguishing
feature had been in the first two characters, we're lost.  Note that
if our original string or its substrings begingging at offset 1 and 2
does not have a length that is an even multiple of 3, we have to throw
away mod(length,3) chars away off the end as well.

Here's a Perl script that accepts a string of strings, and outputs
a string of procmail matching strings that attempt to match the
original strings as they might occur in a base64 encoding.

#!/usr/bin/perl -w
use strict;
use MIME::Base64;
my %pat_seen = ();
while (<>)
  {
    chomp;
    my $slen = length($_);
    if ($slen < 5)
      {
        print STDERR "too short: $_\n";
        next;
      }
    # either the first, second, or third character can begin
    # a base64 string.  Try them all, but consider that if
    # we start with the second character, it reduces the
    # matched length by 1, so we should score it less.
    # Further the last one or two characters won't match
    # exactly unless the original string has a length that
    # is a multiple of 3.  Score less than exact matches lower.
    for my $i (0..2)
      {
        my $s = substr($_,$i);
        my $len = length($s);
        my $rem3 = $len % 3;
        my $c = encode_base64($s,0);
        my $score = ($slen - $i - $rem3*0.25)/$slen ;
        $c = substr($c,0,int($len/3)*4+$rem3);
        next if $pat_seen{$c}++;
        printf "* %0.2f^1 %s\n", $score, $c;
      }
  }

Given a file with the words:
Penny-stocks
Penny stocks
st0ck
St0ck

The script produces:

* 1.00^1 UGVubnktc3RvY2tz
* 0.88^1 ZW5ueS1zdG9ja3
* 0.81^1 bm55LXN0b2Nrc
* 1.00^1 UGVubnkgc3RvY2tz
* 0.88^1 ZW5ueSBzdG9ja3
* 0.81^1 bm55IHN0b2Nrc
* 0.90^1 c3QwY2
* 0.75^1 dDBja
* 0.60^1 MGNr
* 0.90^1 U3QwY2

The weightings below 1.00 reflect a lower confidence that
the string is actually the subject string.  Rolling this
into a simple test script:

DEFAULT=/dev/null
SENDMAIL
LOGFILE=`rm -f t.log; echo t.log`
RM_T_SPAM=`rm -f t.spam`
LOGABSTRACT=No
VERBOSE=Yes
 
:0 B:
* ^Content-Type: text/html
* ^Content-Transfer-Encoding: base64
* -2.0^0
* 1.00^1 UGVubnktc3RvY2tz
* 0.88^1 ZW5ueS1zdG9ja3
* 0.81^1 bm55LXN0b2Nrc
* 1.00^1 UGVubnkgc3RvY2tz
* 0.88^1 ZW5ueSBzdG9ja3
* 0.81^1 bm55IHN0b2Nrc
* 0.90^1 c3QwY2
* 0.75^1 dDBja
* 0.60^1 MGNr
* 0.90^1 U3QwY2
t.spam

We find that it will match an example penny stock email whose body
is base64 encoded:


procmail: Score:      -2      -2 ""
procmail: Score:       0      -2 "UGVubnktc3RvY2tz"
procmail: Score:       0      -2 "ZW5ueS1zdG9ja3"
procmail: Score:       0      -2 "bm55LXN0b2Nrc"
procmail: Score:       0      -2 "UGVubnkgc3RvY2tz"
procmail: Score:       0      -2 "ZW5ueSBzdG9ja3"
procmail: Score:       0      -2 "bm55IHN0b2Nrc"
procmail: Score:       1       0 "c3QwY2"
procmail: Score:       0      +0 "dDBja"
procmail: Score:       2       2 "MGNr"
procmail: Score:       0       2 "U3QwY2"
procmail: Locking "t.spam.lock"
procmail: Assigning "LASTFOLDER=t.spam"
procmail: Opening "t.spam"

Obviously, simply un-encoding the message body and testing against
that, would be simpler and more reliable.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>