Re: regex syntax question

At 09:37 2004-03-02 -0500, Curtis Maurand wrote:

external images and pages are also usually valid links.  The latest way
that I've seen it done is by completely obfuscating the link by encodingit as
"&#104;&#116;&#116;&#112;&#58;&#47&#47..." which translates to

ordinal encoding. There's also BASE64 encoding of the body, deliberateMIME quoted-printable (or "overquoted-printable" <g>), use of IP addressesinstead of domains...

> Header check first:
>
>   :0
>   * ^some-header-test
>   * B ?? some-body-test
>   some-action

too many rules for this to be efficient.  I need to be less
discriminitory.

There's a significant logic error in that script anyway, that being that itassumes you're looking for the text as two separate occurrences - once inthe header and once in the body, instead of one occurrence in either place(in which case, the scoring one makes the most sense).

I haven't checked, but a sandbox, several *LARGE* messages, would confirmthat a simple scored method:


* HB ?? some-test

using a simple expression would actually check the header first, andtherefore bail quickly if the text is in fact there. However, on multipleexpression checks (or running of external apps), the engine is likely goingto scan the entire message for each expression before moving along to thenext expression. Using maximal scoring and two separate condition lines -one for the header, and the next for the body, would be a lot moreefficient for large messages, but if you're invoking an external process(such as grep), you have that invocation overhead to contend with.

I'll look these up.  I've had to turn off rbl checking in most things as
one of the rbls started reporting everything as bad.

I have a program "megagrep" (a compiled C++ program) which is intended totake the place of the following form of grep invocation:


        grep [-i] -w -f somefile

'cept that it performs some domain-token and email optimizations on theinput data WRT to -w. This allows a datafile that contains:


        domain.com

to properly match:

        user(_at_)domain(_dot_)com
        user(_at_)host(_dot_)domain(_dot_)com
        from host.domain.com [1.2.3.4]

but NOT trip up on:

        user(_at_)domain(_dot_)com(_dot_)otherdomain
        domain.community.com

which a simple wordsearch would do.

This program loads the file into an AVL tree (auto balanced, very optimizedfor searching), and then walks through the input data, tokenizing it as itgoes, and looking those tokens up in the AVL. Say there are one millionrecords (which for this purpose, is quite a lot) in the AVL, and 200 wordtokens in the headers. Assuming there are no hits in there, each wordrequires examining just 20 records in the data tree (which is a pure memoryoperation once the tree is loaded), so worst case, the headers are checkedin 4000 record examinations (or a few more than that assuming that some ofthe word tokens are subsequently broken into sub-tokens: host.domain.com ->domain.com). Seems like a lot of operations, but it really isn't. Themost intensive bit is loading the initial wordlist.

I had at one time considered loading the message data into an AVL of it'sown, which would handily eliminate duplicates, but that just trades a fewsearches in the keyword AVL for searches in the message AVL, and doesn'tbuy much.

Another optimization, which I might actually decide to perform, is to noteshortest and longest keywords as inserted into the keyword AVL, then, whenparsing the message, before searching the AVL for a specific token, see ifthe token in hand is shorter or longer than any of the keywords in the AVL(to simple, non-lookup checks at that point), and if so, we know that itisn't in the AVL, and can progress to the next check. This would be ofparticular benefit say if you find yourself looking up a line of BASE64text, which could be skipped easily without lookup overhead.

> > I have
> > hundreds of domain names that spamassassin is just not catching.

Sounds as if SA is either untrained, or has a limitation. If you use thattool, why not raise this issue with the developers there? Seems as ifyou're going to use it, it'd make more sense to see that it works properlyrather than reinventing the wheel.

I can understand people who don't use SA (myself included) choosing to gothrough the motions to write stuff to intercept spam, but those who use SAwould be better off getting SA improved. At least attempt to develop yourfix within the SA framework.

> There are domain-names anywhere in a message: in the Received headers,
> in email addresses, in URLs, etc. etc. Which ones do you mean?

Encoded stuff is a PITA. Might make a lot of sense to pipe the message toa filter to recursively decode it (base64, QP, ordinals, etc) beforeperforming body scans. Additionally, elevating the "spammishness" based onthe presence of different encoding tricks would be useful.

Note that since SA reportedly deals with decoding (as I've heard - I don'tuse it, and this list isn't an SA support group), seems like integratingyour solution within SA would be much easier.

Ce.|3brex, Fi0ri'c3t, T'ram(_at_)do|, U|tr(_at_)`m, L3v|'tra, Pr0p3.cia, 
A:cyc|0vir,
Pr0z:@c, P(_at_)x:il, Bu:sp(_at_)r

This stuff is notoriously difficult to match since they're deliberatelymisspelling and messing with the text.


[snip - spam excerpts, thanks!]

style="font-size: 1;">x</font>op 3 at the lo<font style="font-size:
1;">v</font>west pr<font style="font-size: 1;">n</font>ices any<font
style="font-size: 1;">w</font>where.<BR>
 <A href="http://ffr3ws.com/pc/";>Low man<font style="font-size:
1;">r</font>ufac<font style="font-size: 1;">y</font>turer direct p<font
style="font-size: 1;">i</font>ric<font style="font-size:


Suggestion: score on number of occurrences of font tags.

> > how to do that from the documentation.
>
> Sandbox!

junkmail=sandbox.  on a systemwide basis its /tmp/junkmail

No, USE A SANDBOX. A sandbox is not a mailbox - it's a testingenvironment. The name "sandbox" is supposed to be reminiscent of "playingin the sandbox", a construct which is supposed to contain the sand, insteadof letting it out all over the place (otherwise, it'd be a sand pile), ormore specifically, keeping the sand out of everything else. It alsodoens't hurt as much when you fall down in a sandbox.

Refer to my .sig, where you can download a functional one which you cancustomize to your environ.

Basically, you write your filters, include them into the sandbox (testrig), and run the saved (or constructed) messages against it:


        formail -s procmail -m sandbox.rc < mailboxfile

The sandbox sets up verbose logging (something you might rather wasn'tnormally running on all of your inbound mail, but which it quite useful ina test environ since you can review the log and then purge it).

Properly implemented, there is ZERO change in your recipe code between howit is used in testing and implementation - no hacking of something to be abit different because you're using it for testing instead of in a liveconfig, etc -- all that should be handled in the sandbox itself, so yourrecipe remains stable, and you don't introduce errors into it when tweakingit into your live config.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail