Re: A tool for refining regex

PSE-L(_at_)mail(_dot_)professional(_dot_)org (Professional Software 
Engineering) writes:

You could set variables based on each set of tests which a message
tripped on, then


Saw you addendum ... thanks

[...]

Some false hits are most easily avoided by greenlisting some lists or
posters.


Greenlisting?

Depends on the nature of your email.  For instance, I use
scoring to accumulate a weighted count of things such as exclamation
points (which appear frequently in spam, because spammers are just so
enthusiastic).  Programming lists (including this one), as well as
digests, shouldn't be subjected to such logic because they either make
extensive use of bangs, or (in the case of digests) represent a large
number of messages.  I've thought about allowing X bangs per KB of
message or something, but haven't sat down to figure out how to apply
such logic.

The rub is that my .log file doesn't show me a very detailed report of
what was hit.  I have full logging on


Ouch.  If you define your spam rules within a separate file, which you
includerc into your main procmail.rc, then you can also include them
from a standalone testing .rc file, which defines a different logfile


Full logging is no problem here.  Its a single user setup and I only
get some 300 messages daily.  Thanks for the tip about pulling it in
from INCLUDERC though.

I actually have a woking system something like that but more
primitive.  I have a test area setup and skeleton .procmailrc that
sets test area maildir, orgmail, default, test area logging and other
defaults.

I just yank the real working .procmailrc minus the head into the
skeleton.  Concatenate my directory of false hits or real hits
depending on what I'm testing and I'm in business.  I have pre
established command that are easily edited in .inputrc so that I can
put them on the command line easily with a key combo.
Cat the mail thru my test apparatus and analyze the results.  

That last part is where the rub is.  I want to let procmail do most of
it by showing what was hit...exactly.  I will then be able to set the
regex accordingly or insert a new recipe as needed.

Eventually I should aquire a pretty tight spam trap setup with little
work.

32133 procmail: Match on 
"^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
(\.tw |\.kr|[^0-9.]202\.|[^0-9]211\.|\
[^0-9.]6[1-6]\.|bogota\.supernet\.com\.co)"


Yea, not very useful.

        :0 D
    * ^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
       (\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
       [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)


:0D
* Other_conditions_just_like_you_have_them_but_for_efficiency_do_them_first
* 1^1 ^\/To:(_dot_)*(_at_)pop\(_dot_)newsguy\(_dot_)com
* 1^1 ^\/Received:.*\/(\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
         [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)
spam_suspect2.in

(note that in your recipe, you're not LOCKING, and you're also not
escaping the dots in your mail host)


I've never seen a problem from not using a LOCKING file here or really
in 90 percent of my recipes.. out of some 30, or so I have one
locking. Most have been in use for at least months.

I'm not saying its smart or right only that I haven't seen a problem I
recognized to be caused by not locking.  What would such a problem
look like?

Concerning the host escaping:  I haven't seen a false hit I tracked to
being cause by that... probably sloppy alright but it seemed much more
important in the host numbers part.

With the scoring, I'm busting your combined expression out into
separate ones - that allows you to see the individual lines which


I would have thought that would case a whole different action since
then both must match.  But apparently the odd looking notation `1^1'
means something I have yet to learn about.  Grepping several of the
procmail manpages turned up no examples of its use.  What is it?

would have matched when you have verbose logging on (perhaps BOTH the
To: and a Received: would have matched - but with a combined
expression, you wouldn't know).


If I had the line that matched, I think it would be fairly easy to
tell what did it even with all those or things.

[...]

 With the scoring number, you'll also
have a feel for how many times an individual expression was matched -
and EACH one will be emitted in the verbose log, although only ONE
will remain in $MATCH to be used by your script.  Also, unless you
tack a '.*' onto the ends of those lines, the expression will
terminate at the part of the line which actually matches -


This is all starting to sound very complicated... Not to be a
ne'r-do-well slacker but I had the idea this could be done in a much
more lazy way by letting procmail show the way.  Nothing so
sophisticated as heuristics or mathethical formulas.

Can't I just make procmail capture the match and plop it into my log
somehow?  Maybe that is what you meant by your comments about setting
variables?

Even with all the or (|) characters it would be no problem to analyze
the match if the line it matched were known.  I had in mind something
like this:

Once procmail knows the destination it will write to, I want the line
that tripped any regex not containing a (!) to appear in my log.
Nothing too fancy or thought provoking, just let the tool show me what
is needed.

Maybe this is not the way procmail can be made to work... I couldn't think
of a way to do that with my limited experience.

Even a sure fire way to translate procmail regex to egrep would allow
me to isolate the matching line pretty easily.  But I find its not
that easy to get the same regex to work the same way with those tools.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail