procmail
[Top] [All Lists]

Re: Score and _AND_

2002-10-10 10:40:53
At 18:23 2002-10-10 +0221, Udi Mottelo wrote:

        In the first one procmail scan the message three time.  In the
        second only one.  Is it because of the 1^1 ?

That expression means that it will continue scanning the source until the end (otherwise, how can you know if there were or were not additional hits to affect the scoring). Any time there is a nonzero after the caret (^), you can figure you're going to scan to the end of the source to satisfy the condition.

   Does 1^0 make it scan like (A|B|C) ?

No, not at all.

* 1^0 A
* 1^0 B
* 1^0 C

Will attempt *ALL THREE* conditions (regardless of whether the other two matched or not), and as necessary, will scan the entire source (three times) in the attempt to match.

* (A|B|C)

SHOULD scan the body once, and will stop on the first occurrance of any of the three texts. I'm not intimate with the internals of the procmail regexp engine - within the regexp processor, it will be manipulating the memory multiple times, but it's a very different way over the above condition (and the more unique the conditions are - a string versus singular characters - the more efficient the regexp engine should become).

  I'm used to break _OR_ regx into score
        style to make the recipes more readable, does it wrong? (from the
        performance point of view).

Use maximal scoring:

* 9876543210^0 A
* 9876543210^0 B
* 9876543210^0 C

The actual number for maximal is much less, like 2^32 or thereabouts (as an actual exponent, not as the scoring expression!), but the above number is VERY EASY to remember and is just as effective.

*AS*SOON*AS* there is a match on this scoring, it jumps to the delivery line, skipping the other scoring conditions. If you use some small number, you're going to have to run through ALL of the conditions.

When you have multiple conditions where say, at least two need to match, you can adjust the score with an initial negative:

:0
* -1^0
* 1^0 word1
* 1^0 word2
* 1^0 word3

Since a score > 0 is a match, by starting at -1, means at least two of the conditions need to match in order to make it a positive (assuming that some conditions don't score higher than 1).

:0
* -1^0
* 1^0 word1
* 1^0 word2
* 2^0 word3

This would require word1 & word2, *OR* word3 (with or without word1/word2).

From the efficiency standpoint, this is still scanning the source multiple times. I choose to apply a certain amount of manual optimization to the process and not fret over individual processor cycles. As an example, the above condition could be written with a maximal:

:0
* -1^0
* 9876543210^0 word3
* 1^0 word1
* 1^0 word2

so that if you hit word3, you've met the conditions, without needing to waste time looking for word1 or word2.

Of course, individual conditions might score as negative as well (say in the counterbalances process of weighting certain texts as spam, but when they actually make reference to something else, they're less likely to be spam).

        Also, Sean explain how important to learn the characteristic
        of the message that we are going to work on before decide the
        algorithm:

This is also the reason you should want to manually optimize by placing the MOST LIKELY condition as the FIRST one in an OR condition, and the LEAST LIKELY FIRST in an AND condition.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>