procmail
[Top] [All Lists]

Re: Get domain and tld ?

2009-01-25 16:31:17
At 12:55 2009-01-25 +0100, Xavier Maillard wrote:
In fact, I thought I would take the Return-Path but after having
analyzed different "target" messages, it won't work. So I need to
find the most useful header for that.

For generic identification of lists, there are _several_ headers which should be examined. The listname_id recipes go through a series of headers looking for an appropriate match, then parse it down to a token which should just be the listname, without needing to work from an array of known lists.

The idea behing my request is to write a rule for domains I am
both subscribed to their numerous mailing-list and where I am
also a moderator.

I use my generic list identification routines, but really only to set a variable that identifies the listname. Elsewhere in my procmail setup, I have recipes that check for a specific listname and then take other actions (say for lists where I'm a moderator and have to wade through WAY too much bogus stuff submitted to the list - who has the time? So I identify listadmin messages and scan the bodies for tokens that would indicate that it isn't a foreign submission but rather an errant reply from an alternate email account of a user, which is pretty common, then flag those to be displayed in my client).


Currently, this is the closer rule I have found:

[snip]

You have way too much stuff dedicated to identifying the one list (or series of lists on one host).

INCLUDERC=listname_id.rc

:0
* LISTNAME ?? ^^gnu-tools^^
{
        # do something specific for this list, or just file away
}


If I weren't so innundated with other stuff right now, I'd consider extending the listname_id.rc recipes to include a section for identifying probable listadmin/moderator messages. I've only had to deal with mailman and majordomo myself though. The plethora of webforums out there would probably complicate this.

With mailman for instance, the listname_id stuff already identifies the moderator messages as belonging to the related list - all one needs to do is get a match on Sender:.*mailman-bounces@ or X-List-Administrivia:[ ]*yes, and you have a reasonable expectation is it a list administration message, so you _set_ another variable indicating it is a LIST_ADMINISTRATIVE message or whatever. You do this generically one time for all messages, and then check it when you need to.

For instance, some lists I'm on are set up to circumvent my spam filters or to have an elevated allowance (say, because there's a lot of spammy type stuff discussed on them). Having that LISTNAME variable at the ready makes this easy.

I want this rule to apply for gnu.org, lolica.org and several
other domains. TLD, DOMAIN and LIST would then be used to sort
mails in a TLD/DOMAIN/LIST hierarchy.

Honestly, that seems more trouble than it is worth - a token-by-token heirarchy makes sense if you have gobs of items to deal with (and if the tokens help categorize and find stuff).

> # first, match the domain down to JUST the rightmost two domain tokens
> # (i.e. remove the optional hostname levels).  As parsed here, I'm allowing
> # for the FROMDOMAIN to actually be an email address - this will still work.

Pretty impressive !

Not really, it just makes sense to examine a regexp and see how you can refine it so that it can happily digest a variety of potential inputs and still give the desired result.

> BTW, you do realize that outside of the country-generic TLDs such as
> .com, .org, .net, .biz, etc, that some country specific TLDs often
> have their own secondary heirarchy.  For example:
>
>         host.demon.co.uk

Ooops, I did not think about this case :/

Some simple changes to my previously posted recipe would handle the two-level TLD (so long as a domain.2-letter.country) alongside a regular tld.

Note that DOMAIN and TLD orders are swapped (previously, it didn't matter what their order was, but in the revised approach, we use the domain to anchor the leading text before the match):

# first, match the domain down to JUST the rightmost two tokens
:0
* FROMDOMAIN ?? 
[(_at_)(_dot_)]?\/[^(_at_)(_dot_)]+\(_dot_)([^.]+|[^.][^.]\.[^.][^.])$
{
        TOPDOMAIN=$MATCH

        # next, get the domain portion - this is everything up to,
        # but not including the first dot.
        :0
        * MATCH ?? ^\/[^.]+
        {
                DOMAIN=$MATCH
        }

        # we need to fall back to the saved TOPDOMAIN and get the
        # TLD portion - this is everything AFTER the domain and a dot.
        # this implementation allows for two-part TLDs (co.uk for example)
        # because the RHS of this condition includes a variable which
        # needs to be expanded, we use the $ flag on the condition.
        :0
        * $ TOPDOMAIN ?? ^$DOMAIN\.\/.*$
        {
                TLD=$MATCH
        }
}

I would really be thankful if somebody would explain the
listname_id.rc line by line :)

My suggestion to you would be to make a test harness - I call it a "sandbox" (sandboxes are intended to keep the sand in, and kids play in them). Then, get the listname_id.rc file and includerc it into your sandbox. DEFAULT can be to /dev/null, and you set verbose logging (which is sort of the object of a sandbox). Take a pile of old email messages then use formail to split the mailbox (assuming it's MBX):

        formail -s procmail -m sandbox.rc < saved_mail.mbx

Note that it is really important that the mailbox you're feeding into this is NOT a delivery target of the rules invoked by the sandbox.

I have a standard sandbox and listname_id posted on my procmail pages (which are a bit long in the tooth for visual appearance, but the scripts are still valid). The rules I posted previously would have simply been put into their own file, such as filter.rc, and included into the sandbox (thus, the sandbox framework remains constant). This is what I use to do quick tests of things I go to post here (and of course, for my own stuff as well).

As fun as it would be to re-explain the logic behind each line of an rcfile, ultimatley, if it has proven to do it's job, and has been subjected to peer review, if it provides you with a string you can use to identify one list from another, is there really a need to comprehend each line of it? The procmail list archives will contain several threads discussing the development and use of the ruleset.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>