procmail
[Top] [All Lists]

Re: Base64spam eliminated!

2005-09-23 08:54:03
On Fri, Sep 23, 2005 at 08:56:36AM -0400, Louis N Proyect wrote:

On Fri, 23 Sep 2005, Dallman Ross wrote:
it is not difficult to find base64 stuff with procmail.  I do
it in my own .procmailrc, and it's not at all one of my more
complex or nuanced recipe sets.  What I have coded in has a
whole bunch of private dependencies (private vars), however, so
I can't all that easily just copy and paste it in here for your
consumption.  If I have time at some point and nobody else does
it first, maybe I'll get around to posting a sample solution.

Dallman, the thing I could never understand is why a simple
filter on the presence of "Content-Transfer-Encoding: base64"
failed to work. I would love to see how to identify base64
without looking for this. Or maybe my filter was not set up
right? That is a distinct possibility.

Ruud showed some ways, though we can improve them some (to
avoid false pozzes) with just a bit more effort.  Unfortunately,
I'm short on time.

However, I suspect you looked in the header only when the string
often will appear in the body, but not the header.  That is, you
get a multipart mime message, and one or more of the parts is
base64-encoded.

So a good algorithm would look, anchored at line-left (I noticed
Ruud forgot that "^" in one place in his follow-up), in either
the header or the body, for the phrase.  Then, we have to
deal with irregular whitespace if we're going to catch all
comers.  Tabs or spaces, multiple or single, even line-spans
are all RFC-compliant.  The tightest code will mirror the
possibilities for mime-subpart "headers" (pseudo-headers
in the mail body, controlling the mime part) in the RFCs.

Then, as Ruud alluded, we'd want to make sure a companion header
is nearby, to confirm that this is really a mime subpart we've
caught and not just some discussion of base64 in the procmail
list, for example.

Finally, we'd need to look in the body of mailer daemons, too,
since many times spam is sent from a zombie acting like a
daemon, whether actually a bounce or merely a simulation of
one to get past our filters.

I have some code in Virus Snaggers(tm) that you could look
at for ideas.  I do all those things above, there.

1) Does the header indicate a content-Type commensurate with
   a possible base64-encoding?  If so, continue to 3.

2) If not 1, is this a daemon?  If not, exit the ruleset;
   this isn't going to be base64-encoded.

3) If we're here, we either already know, or still suspect
   from the Content-Type in the header or the fact of its
   being a daemon, that the mail is or might be base64.  Do
   we find the constellation of mime-subpart headers that
   will confirm this hypothesis?

   a. Have we tested with appropriate line-start anchors
      and any conceivably valid whitespace permutations in
      those lines?

   b. Have we found pseudo-headers in the mime-subpart,
      identifiable by various lines placed serially?  This
      eliminates (as well as we can) false positives based
      on mere body text describing base64 code.

Since I know you're a programmer with some years' experience,
I believe you could construct this algorithm with some trial
and error.  Assembling the tools to "sandbox" or test your
code while you're working on it is an important early step.
That's been discussed here many times, by Sean Straw, by
Ruud, by me, and by others over the years.  The way Sean does
it is also in the URL found in his .sig.

The way I decided which Content-Type headers could possibly
be used with base64 in some subpart was, I simply examined
thousands of email messages (mostly spam), using grep, collecting
information about distributions, etc.  Trial-and-error.  (But
one can also just read the RFCs.  I'm better at empirical
research than reading theory, myself, however, so that
rules my approach to problem-solving.)

The way I decided that Content-Type was a useful, indeed,
necessary, thing to be looking at, was exactly the same
way: I examined thousands of messages with base64 and checked
which headers were frequently there, or which (this is a
neat way to approach experimentation) could never be there
in such a situation.  For example, if the message is not
a daemon, and if the header contains a Content-Type field that
says "text/plain", or contains *no* Content-Type field
at all, *then it's not (RFC-compliant) a message with
any base64-encoding!  (I think.  I'm not going and checking
anything I'm saying right now; this is all from memory.)

-- 
dman (Virus Snaggers(tm): <http://vsnag.spamless.us> )

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>