procmail
[Top] [All Lists]

Re: generic matching for mailing lists...

2003-01-12 12:18:35
At 08:08 2003-01-12 -0800, Zack Brown did say:

MONTHFOLDER=`date +%Y-%m`
:0:
* ^(Sender:[ ]*owner-|X-BeenThere:[ ]*|Delivered-To:[ ]*mailing list |X-Loop:[ ]*)\/[-A-Za-z0-9_+]+
$MATCH/$MATCH.$MONTHFOLDER

This works for all but about 8 lists I'm on, which is pretty good. For those 8 stragglers though, I can't seem to figure out anything robust.

Well, checking Sender: would work for 6 of the 8, but the syntax isn't consistently owner-listname.

I'm also a bit hesitant, because I don't have deep knowledge of the culture of email headers, that might indicate that a given header will behave the same for many lists.

So that you don't have to fret over breaking the recipe which works for the majority of your lists, you could implement a _second_ recipe, following this one, which implements the added logic. Only those messages not handled by the first recipe will be left around to be handled. This of course assumes that the lists which aren't getting "properly" handled are in fact NOT MATCHING the first recipe (which say, could be extracting the wrong named for the lists).

:0:
* ^(List-Post:[ ]*(<mailto:)?|List-Owner:[ ]*(<mailto:)?owner-)\/[-A-Z0-9_+]+
$MATCH/$MATCH.$MONTHFOLDER

This works for framers, docbook-apps, docbook, techwr-l.

Note that the character class omits "a-z". Procmail regexps are CASE INSENSITIVE unless otherwise specified with a flag on the flags line.

Origami, list-managers, and oandp-l require a bit more logic, since the owner identifier trails the listname. Note that two of these last three don't actually contain any true "list-type" headers (and lack owner- designations on the sender addresses), it's a bit more diffiult to peg them down as lists.

:0:
* ^Sender:.* List <(mailto:)?\/[-A-Z0-9_+]+
$MATCH/$MATCH.$MONTHFOLDER

This picks up the Origami and oandp-l lists, which have "List" text preceeding the (non-owner) list address in the Sender header, so we can be reasonably sure that the Sender header is actually identifying a list. Since this comes _after_ other filters which should hopefully found owner-listname type identifiers, we should expect that the Sender address is the address of the list, not the listowner. As you subscribe to more and more lists, this may need to be revised, though I really think the listadmins should FIX their lists instead.

That leaves us just with the list-managers list. How Ironic that the one list that doesn't get matched at this point is for list managers...

For efficiency purposes, it may make sense to handle owner suffixed lists in a separate recipe, because you need to run the name through sed. Note that we specifically include the -owner suffix rather than catching it with the regexp character class regexp -- this is so that we KNOW this match actually contained -owner , otherwise, we wouldn't differentiate between a regular (non-list) sender and a list message (though Sender: is really only present on list messages, AFAIK):

:0E
* ^Sender:[     ]*\/[-A-Z0-9_+]+-owner
{
        MATCH=`echo $MATCH | sed -e s/-owner//i`

        :0:
        $MATCH/$MATCH.$MONTHFOLDER
}

I specify the :0E here in case you ever add a 'c' flag to your list recipe for some reason. The reassignment of an internal procmail variable (MATCH) might not seem kosher, but it's valid. If it turns you off, assign it to a different variable name and be sure to use that in your mailbox spec.

You can get all of these conditions (except the suffixed owner) into a long one-line condition, or, to better separate the conditions, use maximal scoring, which makes it a bit easier to extend the conditions without necessarily breaking your intial regexp:

:0:
* 9876543210^0 ^(Sender:[ ]*owner-|X-BeenThere:[ ]*|Delivered-To:[ ]*mailing list |X-Loop:[ ]*)\/[-A-Z0-9_+]+ * 9876543210^0 ^(List-Post:[ ]*(<mailto:)?|List-Owner:[ ]*(<mailto:)?owner-)\/[-A-Z0-9_+]+
* 9876543210^0 ^Sender:.* List <(mailto:)?\/[-A-Z0-9_+]+
$MATCH/$MATCH.$MONTHFOLDER

:0E
* ^Sender:[     ]*\/[-A-Z0-9_+]+-owner
{
        MATCH=`echo $MATCH | sed -e s/-owner//i`

        :0:
        $MATCH/$MATCH.$MONTHFOLDER
}

The maximal scoring ensures us that once any one condition matches, the recipe will proceed directly to the delivery portion, not needing to evaluate the conditions on subsequent lines.

In the end, you don't have a "single recipe" to do it, but you are presented with a fairly generic way of identifying the listname.

I've put up archives of the problem mailing lists at
http://tumblerings.org/~zbrown/procmail/

Suggestion - if you really want people to take their time to evaluate your problem for you, you should take just a handful of messages from each of those lists, *AND* strip the BODY from each of them, then make them available as a single file to download. 60KB or so of headers would be one thing - 4MB of junk is quite another. We're not doing anything with the bodies, which represent the bulk of the messages, so they're a complete waste of everybody's bandwidth to download when evaluating your problem.

You doing that work once, from your end, greatly reduces the wasted time and bandwidth for everyone who might otherwise be willing to assist you. If we've got a single test mailbox which we can download and pipe into a filter running in a sandbox (you do know about sandboxes, right? If not, check my .sig), then we can actually be spending our time helping you instead of downloading your email.

Return-Path: can sometimes be useful, except that you've got to deal with things like mailman bounce encoding, and virtually always need to remove an owner-(listname) or (listname)-owner designation.

Return-path: <list-managers-owner+M1007(_at_)greatcircle(_dot_)com>

ListManagers contained the following header:

Sender: list-managers-owner(_at_)greatcircle(_dot_)com

This is the OPPOSITE of the order you check for owner addresses in your regexp (though your regexp does it in the fashion normally employed). Dealing with trailing text that you don't want included within a match is a PITA.

Some of the messages also contained:

X-MDaemon-Deliver-To: list-managers(_at_)greatcircle(_dot_)com

but that appears to be specific to some of the MTA or MUAs of certain message authors, not the list itself.

Return-Path has MailMan style bounce encoding.

----

OANDP-L:

Return-Path would be directly useable

You should contact the listowner though and have them fix the Sender header -- this is inviting bounces from braindead MTAs (and believe me, there are many) TO THE LIST. The sender should be an owner- alias on the server, so as to not direct certain types of messages to the list, but rather to a person (even if they probably ignore the owner messages).

Sender: Orthotics and Prosthetics List <OANDP-L(_at_)LISTS(_dot_)UFL(_dot_)EDU>

----

Origami:

Return-Path would be useable.

The above comment about the Sender: header applies here as well.

Sender:       Origami Mailing List <Origami(_at_)MIT(_dot_)Edu>

----

TECHWR-L:

Return-Path is encoded. (bounce-)techwr-l(-number)@lists.raycomm.com

The following list-specific headers appear:

List-Unsubscribe: 
<mailto:leave-techwr-l-71444C(_at_)lists(_dot_)raycomm(_dot_)com>
List-Subscribe: <mailto:subscribe-techwr-l(_at_)lists(_dot_)raycomm(_dot_)com>
List-Owner: <mailto:owner-techwr-l(_at_)lists(_dot_)raycomm(_dot_)com>
Sender: bounce-techwr-l-71444(_at_)lists(_dot_)raycomm(_dot_)com

The supplementary filter matches this list via List-Owner:

----

docbook-apps (same applies to docbook, though it should be noted that several of the messages in your docbook archive are actually docbook-apps messages, and are clearly identified as such):

Return-path isn't a simple "owner" type, but is basically the same, replacing "errors" for "owner": <docbook-apps-errors(_at_)lists(_dot_)oasis-open(_dot_)org>

Lots of list-specific headers, though these boneheads don't provide a Sender: header. Tsk. You might have a word with the listadmin and ask why they fail to include this significant header.

List-Owner: <mailto:docbook-apps-help(_at_)lists(_dot_)oasis-open(_dot_)org>
List-Post: <mailto:docbook-apps(_at_)lists(_dot_)oasis-open(_dot_)org>
List-Subscribe: <http://lists.oasis-open.org/ob/adm.pl>,
 
<mailto:docbook-apps-request(_at_)lists(_dot_)oasis-open(_dot_)org?body=subscribe>
List-Unsubscribe: <http://lists.oasis-open.org/ob/adm.pl>,
 
<mailto:docbook-apps-request(_at_)lists(_dot_)oasis-open(_dot_)org?body=unsubscribe>
List-Archive: <http://lists.oasis-open.org/archives/docbook-apps/>
List-Help: <http://lists.oasis-open.org/elists/admin.shtml>,
 <mailto:docbook-apps-request(_at_)lists(_dot_)oasis-open(_dot_)org?body=help>
List-Id: <docbook-apps.lists.oasis-open.org>

These are matched with the List-Post header against the supplementary recipe I provide above (List-Owner isn't a good header to use here because they don't use an owner- style header and the added text is trailing the listname, which is harder to contend with using $MATCH, not to mention "-help" isn't uncommon for part of a listname).

----

framers:

Return-path: <bounce-framers-71493(_at_)lists(_dot_)FrameUsers(_dot_)com>

Plus the following:

List-Unsubscribe: 
<mailto:leave-framers-71493R(_at_)lists(_dot_)FrameUsers(_dot_)com>
List-Subscribe: <mailto:subscribe-framers(_at_)lists(_dot_)FrameUsers(_dot_)com>
List-Owner: <mailto:owner-framers(_at_)lists(_dot_)FrameUsers(_dot_)com>
Sender: bounce-framers-71493(_at_)lists(_dot_)FrameUsers(_dot_)com

This is matched with the List-Owner header against the supplementary recipe I provide above.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>