procmail
[Top] [All Lists]

Re: Score and _AND_

2002-10-08 15:33:26
At 23:35 2002-10-08 +0221, Udi Mottelo wrote:

        I just wandering:  Suppose one wants to score three words and
        s/he wants to be sure that this three words are exist in the text.
        The only way that I can see is:

:0 B
* word1
* word2
* word3
{
        :0 Bfb
        * 1^1 word1
        * 1^1 word2
        * 1^1 word3
        | /do/something/with $=
}

Seems reasonable.

        Now, we can be sure that  $= >= 3  and every word appearances
        at least one time i.e. if  $= == 5  then the word1 could not
        appearance 3 times.

actually, = 5 under the conditions you provide, one of the words could appear three times:

        1       wordX
        1       wordY
        3       wordZ
        =5

        There is a big deficiency in this recipe - procmail pass twice
        on the data.  Any idea?

Sucks when you're scanning the BODY, but you're missing something - your conditions, in sum total scan the BODY *SIX* times - there's *SIX* conditions!


Note that you can independantly scan for each word, and store the score:


WORD1=0
WORD2=0
WORD3=0

:0B
* 1^1 word1
{
        WORD1=$=
}

:0B
* 1^1 word2
{
        WORD2=$=
}

:0B
* 1^1 word3
{
        WORD3=$=
}

# you've scanned the body THREE times, but techically, in your original
# condition, you did as much.

# Now, act upon the SAVED SCORES ONLY, with a precheck that each of the
# variables can't be ZERO.
:0
* ! WORD1 ?? ^0$
* ! WORD2 ?? ^0$
* ! WORD3 ?? ^0$
* $ ${WORD1}^0
* $ ${WORD2}^0
* $ ${WORD3}^0
| /do/something/with $=


All untested here, so there's bound to be a simple typo or omission on my part, but this all seems a LOT more efficient than what you're presenting.

Alternatively, for MORE efficiency, if the /do/something bit should only occur when all three keywords have been found, nest each successive operation - the second keyword matches within the action braces of the first, the third within the second, and the do something within the third (and withou need for checking for zero values, and without needing to SET zero values):

:0B
* 1^1 word1
{
        WORD1=$=

        :0B
        * 1^1 word2
        {
                WORD2=$=

                :0B
                * 1^1 word3
                {
                        WORD3=$=

                        # you've scanned the body THREE times, but
                        # techically, in your original condition, you
                        # did as much.

                        # Now, act upon the SAVED SCORES ONLY
                        :0
                        * $ ${WORD1}^0
                        * $ ${WORD2}^0
                        * $ ${WORD3}^0
                        | /do/something/with $=
                }
        }
}

hit: three body scans - ONLY if the each successive scan results in a positive (which is the case with your original - bailing early when there's a failure to match). Between this and yours, this shaves THREE (short) body scans off, and has independant scoring for each matched word.

OTOH, a drawback to this approach is that the initial body scans are COMPLETE body scans, not bail on first match, so if you have a match-match-nomatch condition, you scanned the WHOLEBODY-WHOLEBODY-WHOLEBODY, instead of JUSTTOTHEFIRSTMATCH-JUSTTOTHEFIRSTMATCH-WHOLEBODY. I'm not sure how significant an impact this will have on your average search, but the results when there IS a match on all three, will be faster, and when those matches are towards the end of the document anyway, there should be negligible difference in the failed cases.

The simplest solution is:

:0B
* word1
* word2
* word3
* 1^1 (word1|word2|word3)
| /do/something/with $=

If any of the three "first match" body conditions fails, and it bails right there, if they're ALL true, then the required portion of the conditions is fulfilled - and the scoring >0 is going to be a GIVEN. However, this involves four scans of the body in a match condition.


It all depends on what you're trying to accomplish, and whether you want intermediate match counts (say, because you want individual variables > 3 or something).

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>