procmail
[Top] [All Lists]

Re: Help figuring out SCORE-ing

1998-01-25 14:47:50
OK, finally I'm going to reply to Walter Dnes's original post on this thread.
There have been some errors in previous replies to it.

| ]] Start recipie

It's "recipe" ... but small matter.

| :0
| 
| ]] Set initial value to -250.
| ]] Add 200 for each match of "^Subject:.*\!\!\!"
| * -250^0* 200^0 ^Subject:.*\!\!\!

Walter, I gather you meant,
  * -250^0
  * 200^0 ^Subject:.*!!!

An empty regexp is always present, and ^0 essentially says (more about that
later) to find the regexp once and then stop looking for more.  As to the
second condition, let's look at at it again:

| ]] Add 200 for each match of "^Subject:.*\!\!\!"

You're close; because the number after the caret is 0, it does not mean for
*each* match, but for *a* match.  See my explanation below.  If the weight
were 200^1, then it would be for *each* match.  Also, an exclamation point
in a procmailrc regexp does not need to be escaped unless it is the first
character, so you can leave out the backslashes there.

| ]] Add 100 for each match of "^Subject:.*\!\!\!\!"
| *  100^0        ^Subject:.*\!\!\!\!

Again, 
  * 100^0 ^Subject:.*!!!!
would do the job, and again, it is for *a* match, not for *each* match.  But
note that a subject line containing four adjacent exclamation points would
have scored 300 by now.

| ]] Add 100 for matching regexp
| ]] "^Subject:.*\<free|sex|opportunity|money|great\>" | ]]

| ]] Question... what is the significance of the "^1"
| ]] suffix versus the "^0" everywhere else?  Is there
| ]] such a thing as "^2", "^3", etc.?
| *  100^1        ^Subject:.*\<free|sex|opportunity|money|great\>

It's all explained in the procmailsc(5) man page.  On a non-negated regexp
search, the second number says what to score if the expression shows up more
than once.  (On a size test or an exit code test, it means something else,
which I won't get into here; on a negated regexp condition, the second number
is meaningless -- absence of a regexp throughout the search area can occur
only once or not at all) but the syntax requires including something.)

So on a non-negated regexp condition x^0 means "score x if there's a match
and then stop looking."  x^1 means "score x for every non-overlapping match
to the regexp" ... note the word "non-overlapping."

It's very rare to have a second number other than 0 or 1 on a regexp
condition, but the procmailsc(5) man page explains the effect fully.

Your example has five alternatives to match on:

 ^Subject:.*\<free
 sex
 opportunity
 money
 great\>

Suppose it actually read like this (as I think you intended):

 *  100^1        ^Subject:.*\<(free|sex|opportunity|money|great)\>

That condition says to score 100 for every subject line that contains any
of those five words ... not to score 100 for every one of those words in
the subject, but 100 for every subject line that contains any of those
words.  So it will never score more than 100 unless there are multiple
subject lines.  You see, it offers five alternative regexps:

 ^Subject:.*\<free\>
 ^Subject:.*\<sex\>
 ^Subject:.*\<opportunity\>
 ^Subject:.*\<money\>
 ^Subject:.*\<great\>

If there is only one subject line, and it is

        Subject: great money-back opportunity for free sex

the only match procmail will find is "<preceding newline>Subject: great".

However, suppose the recipe had conditions like these:

  * ^Subject:\/.+$
  * 100^1 MATCH ?? ()\<(free|sex|money|opportunity|great)\>

It would score 300.  How?  $MATCH would contain
" great money-back opportunity for free sex", and there would be non-over-
lapping matches to " great ", " opportunity ", and " free ".  If we got
rid of either or both of the word-border marks, it would score 500.

Now let's go back to your original example:

 *  100^1        ^Subject:.*\<free|sex|opportunity|money|great\>

There are five alternates in that test, as I listed above.  Our text reads:

  Subject: great money-back opportunity for free sex

Offhand, I think it would score 200: 100 for "<preceding newline>Subject:
great money-back opportunity for<space>free" and 100 for "sex".  Of course,
the score might be higher if other lines in the head included the strings
"sex", "opportunity", "money", or "great<word border>", but appearances of
"<word border>free" outside the subject wouldn't be counted.

| ]] Add 100 for each match of "^Subject:.*\$"
| *  100^0        ^Subject:.*\$

No, no for "each match".  The second number is 0, so add 100 for *a* match
to "^Subject:.*\$" ... that is, for a subject line that contains a dollar
sign.  But if there are eighteen subject lines and sixteen of them contain
at least one dollar sign, the score added is still 100.

| ]] Subtract 250 for each match of "^Subject: *Re:"
| * -250^0        ^Subject: *Re:

No, for *a* match.

| ]] Subtract 250 for each match of "^Subject: *Fwd:"
| * -250^0        ^Subject: *Fwd:

Again, for *a* match, not for *each*.

|   At the end of the recipie, execute the specified action if
| the accumulator > 0.  The value is lost if <= 0.  In order to
| recover the score in such a case, I have to execute...
| 
| { }
| VARIABLE = $=
| 
| ...immediately after the recipie.

Right.

|   In addition to the significance of "^0", "^1", etc, I have the
| following questions...
|   1) how is a "moving match" handled?  E.g. will "\!!!"  be
| considered to match "!!!!" twice? (once for first 3 characters
| and once for last 3 characters?)

Nope, only once.  The occurrences must be non-overlapping.  (That's
one of the reasons that procmail's non-zero-width word border marks
[which look exactly like egrep's and perl's word BOUNDARY marks] can
bite you hard.)

|   2) similar to 1), if a target word shows up 2, 3, or n times,
| how much is it counted?

That depends on the second number, the one after the caret.  If the weight
is w^x, and x!=0, then the nth non-overlapping occurrence counts for
w*x^(n-1).

Note that for -1 < x < 1, each successive occurrence scores a smaller and
smaller amount; procmail stops looking when the increment gets to be so small
that the math library can't differentiate it from zero.  For x=0, procmail
stops searching after finding one match.

|   3) how many matches of .*\<free|sex|opportunity|money|great\>
| (i.e. 1 or 5) would be counted in a subject like...
| "Great opportunity for free sex; no money required!!!"

If the second number is 0, only one, because procmail knows that any more
will score 0.  If the second number is non-zero, all five: " free", "sex",
"opportunity", "money", and "great ".  But that's *very* *very* different from

 * 100^1 ^Subject.*\<(free|sex|opportunity|money|great)\>

which would score only 100; from

 * 100^1 ()\<(free|sex|opportunity|money|great)\>

which would score 300 (for " Great ", " free ", and " money "); or from

 * 100^1 ^Subject.*\<free|sex|opportunity|money|great\>

which would also score 300 for a very different reason (for "Subject: Great
opportunity for free", "sex", and "money").

There is one exception to the non-overlapping rule: if the second number is
non-zero and a match in the text ends with a newline (real or putative),
procmail will back up one character and start looking for the next match
at, not after, the closing newline of the previous match.  That way, in

 * 1^1 ^something$

the newline that was the $ of one appearance can also be the ^ of the next
one if the expression matches two consecutive lines.  (Unfortunately, that
is not true of \> and \< unless they are matched to newlines.)  As a con-
sequence, you cannot count lines of text this way,

 * 1^1 ^

because procmail will keep reusing the opening putative newline until it
hits the supremum score (2147483647).  You can count them this way:

 * 1^1 ^.*$
 * -1^0 ($)^^

The reason for the second condition is ... well, something for another time.

<Prev in Thread] Current Thread [Next in Thread>