From: Paul Chvostek [mailto:paul(_at_)it(_dot_)ca]
Sent: Wednesday, August 28, 2002 11:14 PM
To: Dallman Ross
Cc: Procmail List
Subject: Re: Local domain forgery detection?
On Wed, Aug 28, 2002 at 09:52:19PM +0200, Dallman Ross wrote:
The ^ and $ imply line start or end (they are interchangeable
in procmail, but we tend to use them linearly). Actually,
they each mean the literal newline char. "^^" means the
leftmost edge or rightmost edge of the field being examined.
If I've misstated something, I look forward to correction.
Well, the traditional use of ^ and $ is to match a null at
the start or
end of a line, respectively. As such, they do not actually match any
character, including a null. Your explanation appears to imply that
they behave like \< and \>, which is not the case.
David Tamkin has since addressed this. His explanation of "putative
newlines" sounds reasonable to me.
Whenever I do procmail things, I try to make sure my regexps are as
close to "real" regexps as possible. Since ^^ is procmail-specific,
I have never used it. If there's a significant performance issue
with using the standard regexp atom, I'll reconsider....
Well, you're in procmail, right? So what's wrong with using procmail's
syntax? Anyway, in your example it probably won't ever matter. But
there are cases when you don't want just ^ or $ when you meant ^^.
One can force newlines into MATCH. Do you want to examine the start
of MATCH, or the start of any line in MATCH?
Here is an example from my .procmailrc. I was saving the value of
the Cc: header, if one existed, to a private variable called, oddly
enough, "CC". I was counting the number of address incantations
with the ATCOUNT thingie I posted yesterday. Then I started getting
spam
with multiple Cc: lines. The ones past the first weren't being saved
to CC. I had to revise my MATCH recipe. It took some thought and
work (and the collaboration of a friend). I'll share it now. I
consider
it prize procmailese. :)
Remember, $WS has been set in my rc to a space and a tab up above
somewhere.
:0 # find and save value of Cc:(s), if such exist(s):
* $ ^Cc:.*\/[^$WS].*(^Cc:.*)*
{ CC = $MATCH }
:0 E # else check for empty Cc:
* $ ^Cc:[$WS]*$
{ CC = [empty] }
The variable "CC" now contains the value of a Cc: header (if any
existed) *and all further Cc: headers that might exist in the mail*
(along with, unfortunately, any intervening headers, but that's
an outlier case and I don't care; in fact, I've not yet seen one).
Line breaks are, of course, included if there are multiple Cc: headers.
If there is no Cc: header, then the value of CC remains unset.
That happens in the first of the paired recipes.
If there is only one Cc: header, but it is empty, we set CC
to "[empty]". This we do because of Rule 2, "Spammers are stupid";
most empty Cc: fields I've seen were in spam. (Remember, from my
previous posts, I've said that my point in my spam tests is to
generate various quanta or indicia of spam, which I then combine
later on to get high levels of precision in identifying spam
or letting through non-spam.)
Okay, back to your assertions: Suppose CC has a value of -- wait,
let me search my recent procmail logs --
>>>>> START LOG FOR NEW MESSAGE <<<<<
: We're exiting Section ENV
: We're entering Section HEADERS
===> FROM is >"Eric Whitney" <nnt(_at_)aol(_dot_)com><
===> SUBJECT is >Highest Rated Insurers, Instant Quotes,
Hassle-Free!<
===> TO is >"mike(_dot_)kleintank(_at_)ccc-bbs(_dot_)com"
<mike(_dot_)kleintank(_at_)ccc-bbs(_dot_)com><
===> CC is >mike(_dot_)kleintank(_at_)ccc-bbs(_dot_)com
Cc: rick(_dot_)bales(_at_)ccc-bbs(_dot_)com
Cc: scott(_dot_)glasgow(_at_)ccc-bbs(_dot_)com
Cc: tim(_dot_)nourie(_at_)ccc-bbs(_dot_)com
Cc: holger_busse(_at_)ccc-bbs(_dot_)lifenet(_dot_)org
Cc: alanv72(_at_)ccc-cable(_dot_)net
[50 lines deleted!!!!!!!!!!!!!!!!!!!!!!!!!]
Cc: rxb10(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: rxh20(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: rzc10(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: sec00(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: sgs00(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: slh10(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: sqn00(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: swt10(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: tad10(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: tjc50(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: ttp20(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: wba10(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: wem00(_at_)ccc(_dot_)amdahl(_dot_)com
Cc: bluebird(_at_)ccc(_dot_)at
Cc: cybergod(_at_)ccc(_dot_)at
Cc: gerhard(_at_)ccc(_dot_)at
Cc: haidner(_at_)ccc(_dot_)at
Cc: hzimmer(_at_)ccc(_dot_)at
Cc: illsin(_at_)ccc(_dot_)at<
You can see by now, I think, that ^ or $ will find the start of any
particular line in the CC value; while ^^ starts at the top (the chars
past the ">" in my log) or anchors to the end (the chars left of "<"
in my logs).
(Lots of legit mail has Message-ID's that
violate RFCs, including Microsoft Exchange's format, I believe.
So far, aside from spam, my conservative message-id validity
checks have
only caught messages from OpenSRS' trouble ticketing system, for which
this issue is a known bug. If Microsoft Exchange breaks RFC, then of
the 30000 messages per day which I process, none are from
Exchange. Is
that good news, or what? ;-)
From my reading of the RFCs, it looks like each Message-ID should have
a valid pointer to the originating host. To me, that means
"@name.tld>".
(I suppose bang-addressing would be acceptable.) Much legit mail
doesn't
comply. More spam doesn't. Maybe I'm being too strict in my
interpretation of the RFCs.
of what I call "indicia" (word of art taken from Supreme Court dicta
discussing the 13th Amendment).
Actually, it's a bit more common than that.
http://www.it.ca/bin/dict?indicia
Yes, I know that it's an English word. (I studied comparative lit
before I studied law.) I was merely stating why I have my
affinity to using it in the sense I do for spam markers. It has
a certain heritage in law.
HTH,
dman
--
Dallman Ross
"If you find a path with no obstacles, it probably does not lead to
anywhere."
Thoughts of Rev. Sunnan Kubose, from _Zen in the Markets_
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail