Re: regex syntax question
2004-03-02 12:56:58
At 09:37 2004-03-02 -0500, Curtis Maurand wrote:
external images and pages are also usually valid links. The latest way
that I've seen it done is by completely obfuscating the link by encoding
it as
"http://..." which translates to
ordinal encoding. There's also BASE64 encoding of the body, deliberate
MIME quoted-printable (or "overquoted-printable" <g>), use of IP addresses
instead of domains...
> Header check first:
>
> :0
> * ^some-header-test
> * B ?? some-body-test
> some-action
too many rules for this to be efficient. I need to be less
discriminitory.
There's a significant logic error in that script anyway, that being that it
assumes you're looking for the text as two separate occurrences - once in
the header and once in the body, instead of one occurrence in either place
(in which case, the scoring one makes the most sense).
I haven't checked, but a sandbox, several *LARGE* messages, would confirm
that a simple scored method:
* HB ?? some-test
using a simple expression would actually check the header first, and
therefore bail quickly if the text is in fact there. However, on multiple
expression checks (or running of external apps), the engine is likely going
to scan the entire message for each expression before moving along to the
next expression. Using maximal scoring and two separate condition lines -
one for the header, and the next for the body, would be a lot more
efficient for large messages, but if you're invoking an external process
(such as grep), you have that invocation overhead to contend with.
I'll look these up. I've had to turn off rbl checking in most things as
one of the rbls started reporting everything as bad.
I have a program "megagrep" (a compiled C++ program) which is intended to
take the place of the following form of grep invocation:
grep [-i] -w -f somefile
'cept that it performs some domain-token and email optimizations on the
input data WRT to -w. This allows a datafile that contains:
domain.com
to properly match:
user(_at_)domain(_dot_)com
user(_at_)host(_dot_)domain(_dot_)com
from host.domain.com [1.2.3.4]
but NOT trip up on:
user(_at_)domain(_dot_)com(_dot_)otherdomain
domain.community.com
which a simple wordsearch would do.
This program loads the file into an AVL tree (auto balanced, very optimized
for searching), and then walks through the input data, tokenizing it as it
goes, and looking those tokens up in the AVL. Say there are one million
records (which for this purpose, is quite a lot) in the AVL, and 200 word
tokens in the headers. Assuming there are no hits in there, each word
requires examining just 20 records in the data tree (which is a pure memory
operation once the tree is loaded), so worst case, the headers are checked
in 4000 record examinations (or a few more than that assuming that some of
the word tokens are subsequently broken into sub-tokens: host.domain.com ->
domain.com). Seems like a lot of operations, but it really isn't. The
most intensive bit is loading the initial wordlist.
I had at one time considered loading the message data into an AVL of it's
own, which would handily eliminate duplicates, but that just trades a few
searches in the keyword AVL for searches in the message AVL, and doesn't
buy much.
Another optimization, which I might actually decide to perform, is to note
shortest and longest keywords as inserted into the keyword AVL, then, when
parsing the message, before searching the AVL for a specific token, see if
the token in hand is shorter or longer than any of the keywords in the AVL
(to simple, non-lookup checks at that point), and if so, we know that it
isn't in the AVL, and can progress to the next check. This would be of
particular benefit say if you find yourself looking up a line of BASE64
text, which could be skipped easily without lookup overhead.
> > I have
> > hundreds of domain names that spamassassin is just not catching.
Sounds as if SA is either untrained, or has a limitation. If you use that
tool, why not raise this issue with the developers there? Seems as if
you're going to use it, it'd make more sense to see that it works properly
rather than reinventing the wheel.
I can understand people who don't use SA (myself included) choosing to go
through the motions to write stuff to intercept spam, but those who use SA
would be better off getting SA improved. At least attempt to develop your
fix within the SA framework.
> There are domain-names anywhere in a message: in the Received headers,
> in email addresses, in URLs, etc. etc. Which ones do you mean?
Encoded stuff is a PITA. Might make a lot of sense to pipe the message to
a filter to recursively decode it (base64, QP, ordinals, etc) before
performing body scans. Additionally, elevating the "spammishness" based on
the presence of different encoding tricks would be useful.
Note that since SA reportedly deals with decoding (as I've heard - I don't
use it, and this list isn't an SA support group), seems like integrating
your solution within SA would be much easier.
Ce.|3brex, Fi0ri'c3t, T'ram(_at_)do|, U|tr(_at_)`m, L3v|'tra, Pr0p3.cia,
A:cyc|0vir,
Pr0z:@c, P(_at_)x:il, Bu:sp(_at_)r
This stuff is notoriously difficult to match since they're deliberately
misspelling and messing with the text.
[snip - spam excerpts, thanks!]
style="font-size: 1;">x</font>op 3 at the lo<font style="font-size:
1;">v</font>west pr<font style="font-size: 1;">n</font>ices any<font
style="font-size: 1;">w</font>where.<BR>
<A href="http://ffr3ws.com/pc/">Low man<font style="font-size:
1;">r</font>ufac<font style="font-size: 1;">y</font>turer direct p<font
style="font-size: 1;">i</font>ric<font style="font-size:
Suggestion: score on number of occurrences of font tags.
> > how to do that from the documentation.
>
> Sandbox!
junkmail=sandbox. on a systemwide basis its /tmp/junkmail
No, USE A SANDBOX. A sandbox is not a mailbox - it's a testing
environment. The name "sandbox" is supposed to be reminiscent of "playing
in the sandbox", a construct which is supposed to contain the sand, instead
of letting it out all over the place (otherwise, it'd be a sand pile), or
more specifically, keeping the sand out of everything else. It also
doens't hurt as much when you fall down in a sandbox.
Refer to my .sig, where you can download a functional one which you can
customize to your environ.
Basically, you write your filters, include them into the sandbox (test
rig), and run the saved (or constructed) messages against it:
formail -s procmail -m sandbox.rc < mailboxfile
The sandbox sets up verbose logging (something you might rather wasn't
normally running on all of your inbound mail, but which it quite useful in
a test environ since you can review the log and then purge it).
Properly implemented, there is ZERO change in your recipe code between how
it is used in testing and implementation - no hacking of something to be a
bit different because you're using it for testing instead of in a live
config, etc -- all that should be handled in the sandbox itself, so your
recipe remains stable, and you don't introduce errors into it when tweaking
it into your live config.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- Re: regex syntax question, (continued)
- Re: regex syntax question, Alan Clifford
- PKCS7 (Re: regex syntax question), Bart Schaefer
- Re: regex syntax question, LuKreme
- Re: regex syntax question,
Professional Software Engineering <=
- Re: regex syntax question, Ruud H.G. van Tol
- Re: regex syntax question, Professional Software Engineering
- Re: regex syntax question, Alan Clifford
- Re: regex syntax question, LuKreme
- Re: regex syntax question, Alan Clifford
- Re: regex syntax question, Professional Software Engineering
- Re: regex syntax question, Professional Software Engineering
- Re: regex syntax question, Alan Clifford
- Re: regex syntax question, Tim Rice
- Re: regex syntax question, Professional Software Engineering
|
|
|