procmail
[Top] [All Lists]

Re: Scoring for Capitals in the Subject line

2005-04-16 16:58:51
On Sat, Apr 16, 2005 at 09:48:34AM -0700, Bart Schaefer wrote:

:0
* ^Subject:\/.*
{
 :0D
 * -1^1 MATCH ?? [a-zA-Z]
 * -3^1 MATCH ?? [a-z]
 *  4^1 MATCH ?? [A-Z]
 { SEVENTYFIVEPCTCAPS=yes }
}

Good post.  I have a couple of comments, nonetheless.
First, here is a sample message I just created a Subject
for.  I will show the Subject using a shell alias I
have that I call "headparse".   Here is the alias,
btw -- some may find it useful:

  formail < \!:$ -zfx \!:1 -s | sed "s/^<//; s/>//"


(I slice off the brackets because I often use the alias
on Message-IDs, and the brackets mess up piped actions.)



Okay, anyway:

 1:15am [~/Mail] 215[0]> headparse Subject $SPAMPLE 
NOW IS THE TIME FOR ALL GOOD men to come

 1:15am [~/Mail] 216[0]> headparse Subject $SPAMPLE | wc -c
      41

 1:15am [~/Mail] 217[0]> headparse Subject $SPAMPLE | tr -d -c '[:lower:]' | wc 
-c
       9

 1:15am [~/Mail] 218[0]> headparse Subject $SPAMPLE | tr -d -c '[:upper:]' | wc 
-c
      22


So the first question is, what is 75%?  Seventy-five percent of
the alphabetical chars?  Of all chars?  Here is a line with 40
chars -- remember that wc pads by 1 when the newline is there --
of which 22 are upper-case and 9 are lower-case.  (Nine are spaces.
Here there are no non-alphabetical chars, but if there were, they
would obviously also not count as upper or lower.)

Twenty-two upper-case chars out of 40 total chars is not 75%.
It's 55%.  Forthermore, 22 upper-case chars of 31 alphabetical chars
is still not 75%: it's just under 71%.  Still, this messages
"passes" Bart's recipe:

 1:30am [~/Mail] 230[0]> harness $SPAMPLE | tail -15


procmail: Assigning "MATCH="
procmail: Matched " NOW IS THE TIME FOR ALL GOOD men to come"
procmail: Match on "^Subject:\/.*"
procmail: Score:     -31     -31 "[a-zA-Z]"
procmail: Score:     -27     -58 "[a-z]"
procmail: Score:      88      30 "[A-Z]"
procmail: Assigning "SEVENTYFIVEPCTCAPS=yes"
procmail: Assigning "HOST"
procmail: HOST mismatched "panix5.panix.com"
From tachenym(_at_)westonka(_dot_)k12(_dot_)mn(_dot_)us  Sun Apr 17 01:00:24 
2005
 Subject: NOW IS THE TIME FOR ALL GOOD men to come
  Folder:                                                                  1528


So something is not kosher about Bart's recipe.  ("Harness" is
my test harness, or sandbox, for procmail.)

Further, I wish to say that we shouldn't need to count things
in the Subject three times to come up with a useful test for
75%.  Twice is enough.  We only need -- if we're measuring
only against alphabetical chars, and not all chars -- to
have at least three out of four be upper-case.  I'd do
it this way:

:0
* ^Subject:.*\/[^       ].*
{
 :0 D
 *  1^1 MATCH ?? [A-Z]
 * -3^1 MATCH ?? [a-z]
 { MORETHANSEVENTYFIVEPCTCAPS = yes }
} 


A few other points: By insisting on finding a non-space or non-tab
char in the Subject before we bother, we save something, as some
messages come with empty or missing Subjects.  (Not so many, granted,
but some.)  (And that's a space and a tab in the brackets with the
caret in the outer recipe.)  Second, exactly 75% will not fall
through to the assignment action here.  Either call it more correctly
"MORETHAN" 75%, or put a tiny scoring pad in the condition set:

 *  0.1^0
 *    1^1 MATCH ?? [A-Z]
 *   -3^1 MATCH ?? [a-z]

Okay, gotta hit the hay now. . .   :-)

-- 
dman

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail