Re: crud in subject line

in message 
<20021103122215(_dot_)GA1211(_at_)ancient-scotland(_dot_)co(_dot_)uk>,
wrote Martin McCarthy thusly...

(Sample subject lines)

  Subject: [1$0m]A&2H8 0#:4;g @Z0]Au 0x0m...>H3;9...
  Subject: (1$0m)<x0#(_at_)G <1EC(_at_)L @N;}@; AB?lGQ4Y!!!
  Subject: ':N:N0! GT22 :80m Aq1b4B 193; CVCJ <:@N;g(_at_)LF(_dot_)'
  Subject: [1$0m]<v(_at_)T@/>F?kG0, @O:;<v(_at_)T18A& AA@:>F0!?J 180fGO<<?d!
  Subject: Hi Professor, Ultra-Thin Si Inventory 30um & 50um thin 2"-6" in 
stock...
  Subject: VP9zWn4s5DMxBgSNO7Mf<RLl5X
  Subject: Dates For All 1681HdVP8-2-10

Can someone point me to URLs or etc that discuss using procmail for
finding a percentage of unreasonable stuff in subject line.


Well here's an example of how you can do this kind of thing:

 :0
 * ^Subject:\/.*
 {
   :0:
   * MATCH ?? 10^1 [^0-9a-z ]
   * MATCH ?? -1^1 [0-9a-z ]
   tenpercent
 }

First the subject line is captured into $MATCH.  Then in the nested
recipe:
  10 is added to the score for each character that is not a digit,
letter or space;
  1 is subtracted from the score for each character that is a digit,
letter or space.

The result is a positive score if non-space non-alphanumeric characters
take up more than about 10% of the subject line, and any such mails get
delivered to the tenpercent mailbox.


thanks martin.  here is my contribution (remembering that $subj is
extracted from Subject: in recipe not shown, and '\t' is actual tab)...

  #  pattern to "mark" a reply
  RE_COUNT = "([\[\(\{<][0-9]+[\]\)\}>]|[0-9]+)"

  :0
  {
    REPLY = "([aA][w]|[rR][eE]|[fF][wW][dD])(([ \t]*${RE_COUNT})?[ \t]*:|:[ 
\t]*${RE_COUNT})"

    :0
    { RE_BLANK = "[ \t]*${REPLY}[ \t]*" }
  }

  #  A- mail w/ mundane or missing subjects
  #
  :0:
  * $  9876543210^0  subj  ??  ^^(${RE_BLANK})?^^
  * $  9876543210^0  subj  ??  ()(your mail|no subject)
  Ignore/x.junk

    #  B- subject w/ lowercase-single-word subject lines
    #
    #:0 D E:
    #* $ subj  ??  ^^(${RE_BLANK})?[a-z]+^^
    #Ignore/x.junk

    #  C- subject w/o any lowercase character
    #  (this sure caught the ^^3^^ thread, so be aware.)
    #
    :0 D E:
    * $ subj  ??  ^^(${RE_BLANK})?[^a-z]+^^
    Ignore/x.junk-more

    #  D- an attempt to identify (mainly) language encoded subjects
    #
    :0 E:
    * $ subj  ??  ^^(${RE_BLANK})?[^a-z]*[-_.,+:/=?a-z0-9]+[^a-z]*^^
    Ignore/x.junk-more


...what really needed is language parser to tell how much of subject
is recognizable language (like phrases, words, etc.) and how much
of the subject is made of junk character groups.


-- 


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

Re: crud in subject line - spam trap