Re: Re: CLEANTO anyone?

On 30 May, Jim Osborn wrote:
| On Sun, May 30, 2004 at 12:29:23AM -0400, Don Hammond wrote:
| > I think you can count the addresses with something as simple as:
| > 
| > :0
| > *     ^To:\/.*
| > * 1^1 MATCH ?? @
| > { TOCOUNT = $= }
| 
| That's fine if you're comfortable counting:
| 
|   To: "<guy1(_at_)place1(_dot_)tld>" <guy1(_at_)place1(_dot_)tld>
| 
| as two addresses.  That seems to be an all-too common format on
| legitimate mail, to my dismay.

You're right, that will be counted twice.  A reasonable alternative
then would be to count commas rather than @ (and add 1).  I'm sure
that's not perfect either, but probably better.  Of course that
assumes multiple recipients must be delimited by commas and I don't
have the time to check RFCs right now.

| [...]
| > 
| > http://www.xray.mpe.mpg.de/mailing-lists/procmail/2003-09/msg00199.html
| 
| I'll confess I don't read perl well enough to know if your one-liner
| ignores the comment fields; that is, would it count my example above
| as just one address?

No, it suffers the same flaw.  But a fix seems trivial.  Doing
*minimal* testing, the following seems to work.  It removes
everything between quote or parentheses pairs before doing the
rest of the counting and cleaning.  It still won't help with
an address like:

   aguy(_at_)dom(_dot_)tld <aguy(_at_)dom(_dot_)tld>

where the comment is not enclosed by parentheses or quotes.  I don't
know if that's possible, or common, but it's not anything I care
about for my purposes.

BTW, it's not clear in the original post, but the recipe depends
on these two variable definitions (space and tab in char classes).

wsstar='[       ]*'
wsneg='[^       ]'

:0
* $ ^Cc:$wsstar\/$wsneg.*
{
  CC = "$MATCH"
  CCADDR = `perl -e'$_=$ENV{CC};s/"[^"]+?"//g;s/\([^)]+?\)//g;
                    $j=s/.*?([^\s<,]+(_at_)[^\s>,]+)>?/$1, /g;
                    s/,\s*$//;print join(",",$j||0,$_);'`
  :0
  * CCADDR ?? ^^\/[0-9]+
  { CCCOUNT = $MATCH }
  :0
  * CCADDR ?? ^^[0-9]+,\/.*
  { CCADDR = "$MATCH" }
}

It breaks down like this:

s/"[^"]+?"//g    removes everything between and including each "".
s/\([^)]+?\)//g  removes everything between and including each ().

These could almost have been done with one operation using a
back reference. The problem is the parentheses where, after
matching the left one, you're looking next for a right one.
That problem is probably mitigated by the non-greedy matching
done by [char class]+?, but separating them seems more precise.
If you want to simplify, it can probably safely be done by
replacing those two operations with: s/(["(])[^\1]+?[")]//g .

$j=s/.*?([^\s<,]+(_at_)[^\s>,]+)>?/$1, /g  removes everything except
each @ preceeded by anything that is not white space, <, or a comma
and followed by anything that is not white space, > , or a comma.
It leaves just those strings that match the regexp, each one with
a trailing comma, and returns the match count to $j.

s/,\s*$// removes the trailing comma and any white space.

This is a very loose regexp for an address.  For my purposes it's
good enough.  If your tolerances are more exact, you'll need to
weigh that vs. the work and processing requirements involved in a
more rigorous solution like where Sean was pointing you.

Hope that helps,

Don Hammond

P.S. I just noticed your Mail-followup-to: which I didn't notice
last night.  I presume that means you'd like a courtesy cc: which
I've done here.  Apologies if I've misunderstood.

-- 
Email address in From: header is valid  * but only for a couple of days *
This is my reluctant response to spammers' unrelenting address harvesting


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail