procmail
[Top] [All Lists]

regexp thoughts for future procmail version

2003-03-23 13:29:53

I use plussed addresses for some things. Recently, I added a BCC detector to my spam ruleset. From recent discussions about "spammishness", some members should already be aware that I simply allow some characteristics to contribute some amount towards the likelyhood that a message is spam.

Basically, in my mainline ~/.procmailrc, I extract various commonly manipulated or compared headers, Subject/To/From/Sender/Envelope To and From. This beats the heck out of extracting them multiple times throughout the whole procmailrc each time they may be needed.

So, let's say that the following is in .procmailrc:

        :0
        * ^X-Envelope-To: *<\/[^>]*
        {
                ENVTO=$MATCH
        }

(in case you're wondering, there is also a spammishness test for an absense of this header, which indicates multiple local recipients, but that works in conjunction with another test, not the point of this post)

Elsewhere in the mainline, there's an includerc of a file which extracts a simple listname component. Suffice it to say, LISTNAME is either NULL, or contains a string identifying the root name of a discussion list.

Now, in my spam file (or, at the moment, the sandbox):

# No cleartexted recipients matching the X-Envelope-To
# (AND not a list, where that would be very normal).
# NOTE: Basically, this means we were BCC'd, which itself is perfectly
# valid, but also very commonly used in spam.
:0
* LISTNAME ?? ^^^^
* $! ^(To|Cc):.*${ENVTO}
{
SPAMNOTES="${SPAMNOTES}SPAM: Advisory - no non-list cleartext recipient matching X-Envelope-To${NL}"
        SPAMMISHNESS="${SPAMMISHNESS}+45"
}

All is good and well so long as ENVTO isn't a plussed address. And really, we're letting it slide in that the address doesn't have dots escaped either. If it's plussed, it'll never match.

I can fudge it, in an ugly way, by:

ENVTO=`echo "$ENVTO" | sed -e "s/\+/\\\+/g" -e "s/\./\\\./g"`

Though this invocation is problematic when the shell is BASH due to how it manages pipelining. It is also a serious waste of CPU. Unnecessary calls can be mitigated somewhat by checking the variable for the presence of some characters (on the LHS, it isn't expanded as a regexp):

:0
* ENVTO ?? (\.|\+)
{
        ENVTO=`echo "$ENVTO" | sed -e "s/\+/\\\+/g" -e "s/\./\\\./g"`
}

(obviously, if one were looking to escape other regexp operators, support for those would be added - these are the only ones I envision having to deal with in a valid address though)


My idea: in a future procmail rev, wouldn't it be useful to have a built-in variable expansion syntax which auto-escapes the variable content? A "start regexp" and "stop regexp" token are not feasable, because the stop regexp token might actually be a token within the expanded variable itself. What if:

* To:.*${{SOMEVARIABLE}}

were to escape that variable so that it would match as a literal?

If users carefully inspect their procmailrc files, you'll probably fine one or two of those special recipes which extract a literal value and reuse it in a condition, where it'll be interpreted as a regexp.

In the meantime, does anyone have any pointers on escaping of variables containing regexp operators (esp bash friendly syntax or wholly internal-to-procmail solutions not involving shells).
---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>