[This was in response to a personal message from Gregory Woodhouse,
but since I'm arrogant enough to think that it might be of general
interest, I'm forwarding it now to the list, with a couple edits.
Gregory's text is included here with his permission. -Philip]
"Gregory J. Woodhouse" <gjw(_at_)wnetc(_dot_)com> writes:
I think the original poster was looking for something like the ability to
negate a single element within a regexp as in
* ^FROM(_dot_)*(_at_)[^alpha](_dot_)beta(_dot_)org
The intention is that that addresses like
person1(_at_)gamma(_dot_)beta(_dot_)org or
person2(_at_)delta(_dot_)beta(_dot_)org would match, but
person3(_at_)alpha(_dot_)beta(_dot_)org would NOT
match. Think of the regexp syntax of sed.
Well, in a sense, that's what I did interpret it as. I just optimized
my answer some. For your example, for example, the normal way of
writing it would be:
* ^From.*@([a-z0-9-_]+\.)*beta\.org
* ! ^From(_dot_)*(_at_)alpha\(_dot_)beta\(_dot_)org
Since his example was simply (to use your syntax):
* Subject:.*[^script]
this can be written normally as:
* ^Subject:.*
* ! ^Subject:.*script
The first of those happens to be a no-op, as a Subject: line is
required by rfc0822 and MTAs should (sendmail does) insert a blank one
if it's missing. Therefore I simply dropped that condition in my
response.
My main point is that if you try to describe in English what the above
(pseudo-)regexps are saying, you end up saying something like, "...that
match 'foo' but not 'foobar'...". Sounds like two conditions to me!
Heck, the last several times that this has come up in
comp.lang.perl.misc, TomC has recommended using two expressions anded
instead of a single grodie [sic] expression (which perl5's lookahead
extension does allow).
Besides, I'm not sure you such a negation operator is really
definable. For example, in the following:
* ^Subject: foo[^bar]baz
what does that *mean*? Does that mean that the 'foo' and 'baz' (which
are mandatory) can be separated by anything except 'bar'? If so, what
about the following:
* ^TO_guenther(_at_)[^lunen](_dot_)gac(_dot_)edu
Does that mean that the 'guenther@' and '.gac.edu' can be separated by
anything except 'lunen'? That would match
To: guenther(_at_)stolaf(_dot_)edu, jones(_at_)lunen(_dot_)gac(_dot_)edu
or even
To: guenther(_at_)lunen(_dot_)gac(_dot_)edu,
jones(_at_)lunen(_dot_)gac(_dot_)edu
For a hypothetical negation operator to work/be useful, you need to be
able to define what _is_ taking the space that the negation operator
occupies. At that point, you've basically turned it into two regexps,
one which must match and one which must not. Yes, it means the regexp
engine has to run through the common part twice, but for something like
this, the cost is not a tragedy.
Perhaps for the *really* difficult cases where a negation operator
looks good, one should instead use \/ to throw the rest of the line
into $MATCH and then have your conditions match against that (though a
nesting block and a temp variable may be needed). That would even save
the duplicate regexp compilation of the initial common part.
Philip Guenther
----------------------------------------------------------------
Philip Guenther UNIX Systems and Network Administrator
Internet: guenther(_at_)gac(_dot_)edu Phonenet: (507) 933-7596
Gustavus Adolphus College St. Peter, MN 56082-1498
Source code never lies (it just misleads). (Programming by Purloined Letter?)