procmail
[Top] [All Lists]

Re: regex syntax question

2004-03-02 12:56:58
At 09:37 2004-03-02 -0500, Curtis Maurand wrote:
external images and pages are also usually valid links.  The latest way
that I've seen it done is by completely obfuscating the link by encoding it as
"http:&#47&#47..." which translates to

ordinal encoding. There's also BASE64 encoding of the body, deliberate MIME quoted-printable (or "overquoted-printable" <g>), use of IP addresses instead of domains...

> Header check first:
>
>   :0
>   * ^some-header-test
>   * B ?? some-body-test
>   some-action

too many rules for this to be efficient.  I need to be less
discriminitory.

There's a significant logic error in that script anyway, that being that it assumes you're looking for the text as two separate occurrences - once in the header and once in the body, instead of one occurrence in either place (in which case, the scoring one makes the most sense).

I haven't checked, but a sandbox, several *LARGE* messages, would confirm that a simple scored method:

* HB ?? some-test

using a simple expression would actually check the header first, and therefore bail quickly if the text is in fact there. However, on multiple expression checks (or running of external apps), the engine is likely going to scan the entire message for each expression before moving along to the next expression. Using maximal scoring and two separate condition lines - one for the header, and the next for the body, would be a lot more efficient for large messages, but if you're invoking an external process (such as grep), you have that invocation overhead to contend with.

I'll look these up.  I've had to turn off rbl checking in most things as
one of the rbls started reporting everything as bad.

I have a program "megagrep" (a compiled C++ program) which is intended to take the place of the following form of grep invocation:

        grep [-i] -w -f somefile

'cept that it performs some domain-token and email optimizations on the input data WRT to -w. This allows a datafile that contains:

        domain.com

to properly match:

        user(_at_)domain(_dot_)com
        user(_at_)host(_dot_)domain(_dot_)com
        from host.domain.com [1.2.3.4]

but NOT trip up on:

        user(_at_)domain(_dot_)com(_dot_)otherdomain
        domain.community.com

which a simple wordsearch would do.

This program loads the file into an AVL tree (auto balanced, very optimized for searching), and then walks through the input data, tokenizing it as it goes, and looking those tokens up in the AVL. Say there are one million records (which for this purpose, is quite a lot) in the AVL, and 200 word tokens in the headers. Assuming there are no hits in there, each word requires examining just 20 records in the data tree (which is a pure memory operation once the tree is loaded), so worst case, the headers are checked in 4000 record examinations (or a few more than that assuming that some of the word tokens are subsequently broken into sub-tokens: host.domain.com -> domain.com). Seems like a lot of operations, but it really isn't. The most intensive bit is loading the initial wordlist.

I had at one time considered loading the message data into an AVL of it's own, which would handily eliminate duplicates, but that just trades a few searches in the keyword AVL for searches in the message AVL, and doesn't buy much.

Another optimization, which I might actually decide to perform, is to note shortest and longest keywords as inserted into the keyword AVL, then, when parsing the message, before searching the AVL for a specific token, see if the token in hand is shorter or longer than any of the keywords in the AVL (to simple, non-lookup checks at that point), and if so, we know that it isn't in the AVL, and can progress to the next check. This would be of particular benefit say if you find yourself looking up a line of BASE64 text, which could be skipped easily without lookup overhead.

> > I have
> > hundreds of domain names that spamassassin is just not catching.

Sounds as if SA is either untrained, or has a limitation. If you use that tool, why not raise this issue with the developers there? Seems as if you're going to use it, it'd make more sense to see that it works properly rather than reinventing the wheel.

I can understand people who don't use SA (myself included) choosing to go through the motions to write stuff to intercept spam, but those who use SA would be better off getting SA improved. At least attempt to develop your fix within the SA framework.

> There are domain-names anywhere in a message: in the Received headers,
> in email addresses, in URLs, etc. etc. Which ones do you mean?

Encoded stuff is a PITA. Might make a lot of sense to pipe the message to a filter to recursively decode it (base64, QP, ordinals, etc) before performing body scans. Additionally, elevating the "spammishness" based on the presence of different encoding tricks would be useful.

Note that since SA reportedly deals with decoding (as I've heard - I don't use it, and this list isn't an SA support group), seems like integrating your solution within SA would be much easier.

Ce.|3brex, Fi0ri'c3t, T'ram(_at_)do|, U|tr(_at_)`m, L3v|'tra, Pr0p3.cia, 
A:cyc|0vir,
Pr0z:@c, P(_at_)x:il, Bu:sp(_at_)r

This stuff is notoriously difficult to match since they're deliberately misspelling and messing with the text.

[snip - spam excerpts, thanks!]

style="font-size: 1;">x</font>op 3 at the lo<font style="font-size:
1;">v</font>west pr<font style="font-size: 1;">n</font>ices any<font
style="font-size: 1;">w</font>where.<BR>
 <A href="http://ffr3ws.com/pc/";>Low man<font style="font-size:
1;">r</font>ufac<font style="font-size: 1;">y</font>turer direct p<font
style="font-size: 1;">i</font>ric<font style="font-size:

Suggestion: score on number of occurrences of font tags.

> > how to do that from the documentation.
>
> Sandbox!

junkmail=sandbox.  on a systemwide basis its /tmp/junkmail

No, USE A SANDBOX. A sandbox is not a mailbox - it's a testing environment. The name "sandbox" is supposed to be reminiscent of "playing in the sandbox", a construct which is supposed to contain the sand, instead of letting it out all over the place (otherwise, it'd be a sand pile), or more specifically, keeping the sand out of everything else. It also doens't hurt as much when you fall down in a sandbox.

Refer to my .sig, where you can download a functional one which you can customize to your environ.

Basically, you write your filters, include them into the sandbox (test rig), and run the saved (or constructed) messages against it:

        formail -s procmail -m sandbox.rc < mailboxfile

The sandbox sets up verbose logging (something you might rather wasn't normally running on all of your inbound mail, but which it quite useful in a test environ since you can review the log and then purge it).

Properly implemented, there is ZERO change in your recipe code between how it is used in testing and implementation - no hacking of something to be a bit different because you're using it for testing instead of in a live config, etc -- all that should be handled in the sandbox itself, so your recipe remains stable, and you don't introduce errors into it when tweaking it into your live config.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>