procmail
[Top] [All Lists]

Re: Scoring on spam

1996-06-16 11:21:58
I had suggested to Era Eriksson:

| dattier(_at_)wwa(_dot_)com (David W. Tamkin) wrote:

|  > Of course, if you're going to test for (,000)+ as an entire alternative
|  > by itself with no need for anything specific to the left or the right of
|  > it, you might as well, just test for ,000;

Era replied,

| I was going to comment on this but my message got long enough. :-)
|   This is on purpose; without the plus, $1,000,000,000,000 counts as
| four matches in the system I'm using. But it's a single spook word, in
| the sense I intended. 

|   Of course, if you use a different method, the above argument might
| actually be valid. As long as I'm using this ancient version, I'm
| stuck with nuking $SPAM.*$SPAM.*$SPAM or something fairly much like
| it.

"+" means "one or more," not "two or more."  If you are testing for
$SPAM.*$SPAM.*$SPAM, "1,000,000,000,000" will match

 (,000)+.*(,000)+.*(,000)+

just as it will match

  ,000.*,000.*,000

so there's still no difference.  Your code is not doing what you intended.
"[^0],000(,000)+" (or, since there is nothing on either side of it,
"[^0],000") will make sure that "$1,000,000,000,000" counts only once.

|   (I suppose you could just kill anything matching ",000" and hope
| nobody ever starts a thread about "the $1,000 question" or something
| like that ...)

In the code you said you use now, a submission from suspect site will be
shunted aside for even one occurrence of "$1,000" in the subject.  One
from any other .com site would need at least one more match to $SPAM in
addition to "$1,000" to be rejected, so it's not so bad.  Those from other
domains would need *two* more appearances of $SPAM in the subject besides
"$1,000", so again that's not so bad.

To test for two or more adjacent appearances of ",000" without counting
"$1,000,000,000,000" twice but still counting
"$1,000,000,000,000 and $2,000,000,000,000" twice, you'll need an expression
like
  [1-9]0*,000,000

|   Similarly for "\$\$+" and "!!!+". 

Yes, similarly.  Since you are not extracting with the "\/" operator and not
looking for anything specific left or right of those, \$\$ is just as good as
\$\$+, and !!! as a search expression is just as good as !!!+.

| Excellent! Thanks for this example. (I thought it would have been
| easier, really.) Something for the "best-of" archive, IMHO. 

It *would* have been easier if it weren't for the extra requirement to count
"web" in the subject as a problem only from listed suspect sites.