procmail
[Top] [All Lists]

matching nulls

2002-01-08 12:12:52

I've been fighting with the problem that \< and \> match characters
rather than nulls at the beginning and ends of words.  This doesn't
match the behaviour of grep, for which "the symbols \< and \>
respectively match the empty string at the beginning and end of a word",
according to the man page.  Grep also gives us "the symbol \b [which]
matches the empty string at the edge of a word".

All of these would be really useful.  The fact that \< and \> behave
differently from grep is ... annoying.

I have a variable which I use for tracking down various things.  It is:

FREESERVICES="((chtah|wongfaye|i-?(quebec|france|suisse)|consultant|penn|comic|address|0845dial|mail(metoday|andnews)|(opera|dreame|eudora|tan|dubai)mail)\.com|((desert|free)?mail|voila|uol(\.com?)?)\.(ar|co|cz|fr|hu|mx|ve)|(slon|s5|k2|amis|siol|netsi|sint|mails|ukf(antastic)?)\.net|(yahoo|netscape|aol|msn|excite|lycos|juno|china|arabia|post1|([hg]ot|turbo|bta|gentle|cara|cheetah|rediff)mail|arabia|angelfire|uole|usa|bellsouth|maxleft|(mail|123)?india)\.(com|net(\.cn)?|org|ca)|k\.ro|peace\.is)"

(I realize it's messy, matches a few that are for-pay services and some
unrelated domains, but that doesn't matter; I'll clean it up if it
starts causing me problems.)

I have rules like:

 :0
 * $  ^Return-Path:(_dot_)*(_at_)\/$FREESERVICES
 * !$ ^Received:.*\.$MATCH
 {
  :0 fw
  * $  ^From:(_dot_)*(_at_)\/$FREESERVICES
  * !$ ^Received:.*\.$MATCH
  | formail -A "X-spamtrap: return address is free provider, but message didn't 
originate there"
 }

I would very much like to be able to ensure that whatever this variable
matches is not followed by additional text, but I can't surround the
outer atom in FREESERVICES with \< and \>, without actually matching
the characters themselves, matching "@hotmail.com>".  I could crop what
I match with:

 * MATCH ?? [^a-z0-9]*\/[a-z0-9.-]+

but that seems awkward when all I really want is a version of \< that
matches the null before a word rather than the non-word character,
the same way grep does.

Is this an unwarranted whine?  Am I just being uppity and lazy about an
insignificant design issue here?  Should I really use constructs like:

 * $ ^Return-Path:(_dot_)*(_at_)\/$FREESERVICES\>
 * MATCH ?? \/[a-z0-9.-]+

or include the \< and \> in the FREESERVICES variable and make sure all
my uses of it are wrapped to suit?  Or might it possibly be safe to
assume that FREESERVICES will never be immediatly followed by [a-z0-9_]?
Or is there some way to match nulls that I simply haven't been able to
find?  Is there some undocumented variable I can set which tells
procmail to treat \< and \> the way grep does instead of the way it
historically has treated them?

Wisdom would be appreciated, even if it's just telling me to stop
whining.  ;)

-- 
  Paul Chvostek                                             
<paul(_at_)it(_dot_)ca>
  Operations / Development / Abuse / Whatever       vox: +1 416 598-0000
  it.canada                                            http://www.it.ca/

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>