procmail
[Top] [All Lists]

Re: regex syntax question

2004-03-02 20:05:53
At 21:14 2004-03-02 +0000, Alan Clifford wrote:
I have a 10% rule on that one:

# 10% of chars are &#
:0 BD
* -1^1 .
* 10^1 [&#]
action as spam line here

Uh, unless the entire message is encoded this way (well, more like 50% of it), you're unlikely to get much of a hit here - the technique seems to be used by spammers for URLs. To top it off, you're flagging &# as individual characters rather than as a pair, so programming lists are sure to run into trouble. # is used as a comment char in shell scripts and perl, as well as a decorative separator (i.e. in large blocks). & is a logic and bitshift operator in many (programming) languages.

I ran a scan against my captured spew from february. Several hundreds of spams (hey, most of it gets avoided via DNSBLs, otherwise, I'd have >10K spams a month), but ONLY the following matched the construct (not the weighting, just ANY match for an HTML ordinal escape, as a character pair):

4:3271  (used to signify bullets in an HTML list)
92:8544 (* an honest-to-goodness-ordinalized spam URL)
1:6308  (furrin character)
139:1687 (* ordinalized random characters of nearly every word in the body of the message)
3:5200 (ordinals for unicode characters like elipses)
1:1963 (x3, ditto)

Note that this doesn't include checking the legit messages, and the majority of the above hits are for legitimate use (even if found in spam messages). This doesn't provide a large enough sampling, but judging from the above, the following would seem to establish a reasonable weighting:

:0
* 1^1 .
* 100^1 &#
{
        #action
}

Note that besides ascii, ordinalized codes can be hex (  for instance), or may specify a unicode character (… for instance).


What sort of performance diff you you realize with the D flag here when you're not checking anything where case would be significant?

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>