At 18:23 2002-10-10 +0221, Udi Mottelo wrote:
In the first one procmail scan the message three time. In the
second only one. Is it because of the 1^1 ?
That expression means that it will continue scanning the source until the
end (otherwise, how can you know if there were or were not additional hits
to affect the scoring). Any time there is a nonzero after the caret (^),
you can figure you're going to scan to the end of the source to satisfy the
condition.
Does 1^0 make it scan like (A|B|C) ?
No, not at all.
* 1^0 A
* 1^0 B
* 1^0 C
Will attempt *ALL THREE* conditions (regardless of whether the other two
matched or not), and as necessary, will scan the entire source (three
times) in the attempt to match.
* (A|B|C)
SHOULD scan the body once, and will stop on the first occurrance of any of
the three texts. I'm not intimate with the internals of the procmail
regexp engine - within the regexp processor, it will be manipulating the
memory multiple times, but it's a very different way over the above
condition (and the more unique the conditions are - a string versus
singular characters - the more efficient the regexp engine should become).
I'm used to break _OR_ regx into score
style to make the recipes more readable, does it wrong? (from the
performance point of view).
Use maximal scoring:
* 9876543210^0 A
* 9876543210^0 B
* 9876543210^0 C
The actual number for maximal is much less, like 2^32 or thereabouts (as an
actual exponent, not as the scoring expression!), but the above number is
VERY EASY to remember and is just as effective.
*AS*SOON*AS* there is a match on this scoring, it jumps to the delivery
line, skipping the other scoring conditions. If you use some small number,
you're going to have to run through ALL of the conditions.
When you have multiple conditions where say, at least two need to match,
you can adjust the score with an initial negative:
:0
* -1^0
* 1^0 word1
* 1^0 word2
* 1^0 word3
Since a score > 0 is a match, by starting at -1, means at least two of the
conditions need to match in order to make it a positive (assuming that some
conditions don't score higher than 1).
:0
* -1^0
* 1^0 word1
* 1^0 word2
* 2^0 word3
This would require word1 & word2, *OR* word3 (with or without word1/word2).
From the efficiency standpoint, this is still scanning the source multiple
times. I choose to apply a certain amount of manual optimization to the
process and not fret over individual processor cycles. As an example, the
above condition could be written with a maximal:
:0
* -1^0
* 9876543210^0 word3
* 1^0 word1
* 1^0 word2
so that if you hit word3, you've met the conditions, without needing to
waste time looking for word1 or word2.
Of course, individual conditions might score as negative as well (say in
the counterbalances process of weighting certain texts as spam, but when
they actually make reference to something else, they're less likely to be
spam).
Also, Sean explain how important to learn the characteristic
of the message that we are going to work on before decide the
algorithm:
This is also the reason you should want to manually optimize by placing the
MOST LIKELY condition as the FIRST one in an OR condition, and the LEAST
LIKELY FIRST in an AND condition.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail