Re: Get domain and tld ?

At 12:55 2009-01-25 +0100, Xavier Maillard wrote:

In fact, I thought I would take the Return-Path but after having
analyzed different "target" messages, it won't work. So I need to
find the most useful header for that.

For generic identification of lists, there are _several_ headers whichshould be examined. The listname_id recipes go through a series of headerslooking for an appropriate match, then parse it down to a token whichshould just be the listname, without needing to work from an array of knownlists.

The idea behing my request is to write a rule for domains I am
both subscribed to their numerous mailing-list and where I am
also a moderator.

I use my generic list identification routines, but really only to set avariable that identifies the listname. Elsewhere in my procmail setup, Ihave recipes that check for a specific listname and then take other actions(say for lists where I'm a moderator and have to wade through WAY too muchbogus stuff submitted to the list - who has the time? So I identifylistadmin messages and scan the bodies for tokens that would indicate thatit isn't a foreign submission but rather an errant reply from an alternateemail account of a user, which is pretty common, then flag those to bedisplayed in my client).

Currently, this is the closer rule I have found:


[snip]

You have way too much stuff dedicated to identifying the one list (orseries of lists on one host).


INCLUDERC=listname_id.rc

:0
* LISTNAME ?? ^^gnu-tools^^
{
        # do something specific for this list, or just file away
}

If I weren't so innundated with other stuff right now, I'd considerextending the listname_id.rc recipes to include a section for identifyingprobable listadmin/moderator messages. I've only had to deal with mailmanand majordomo myself though. The plethora of webforums out there wouldprobably complicate this.

With mailman for instance, the listname_id stuff already identifies themoderator messages as belonging to the related list - all one needs to dois get a match on Sender:.*mailman-bounces@ orX-List-Administrivia:[ ]*yes, and you have a reasonable expectationis it a list administration message, so you _set_ another variableindicating it is a LIST_ADMINISTRATIVE message or whatever. You do thisgenerically one time for all messages, and then check it when you need to.

For instance, some lists I'm on are set up to circumvent my spam filters orto have an elevated allowance (say, because there's a lot of spammy typestuff discussed on them). Having that LISTNAME variable at the ready makesthis easy.

I want this rule to apply for gnu.org, lolica.org and several
other domains. TLD, DOMAIN and LIST would then be used to sort
mails in a TLD/DOMAIN/LIST hierarchy.

Honestly, that seems more trouble than it is worth - a token-by-tokenheirarchy makes sense if you have gobs of items to deal with (and if thetokens help categorize and find stuff).

> # first, match the domain down to JUST the rightmost two domain tokens
> # (i.e. remove the optional hostname levels).  As parsed here, I'm allowing
> # for the FROMDOMAIN to actually be an email address - this will stillwork.
Pretty impressive !

Not really, it just makes sense to examine a regexp and see how you canrefine it so that it can happily digest a variety of potential inputs andstill give the desired result.

> BTW, you do realize that outside of the country-generic TLDs such as
> .com, .org, .net, .biz, etc, that some country specific TLDs often
> have their own secondary heirarchy.  For example:
>
>         host.demon.co.uk

Ooops, I did not think about this case :/

Some simple changes to my previously posted recipe would handle thetwo-level TLD (so long as a domain.2-letter.country) alongside a regular tld.

Note that DOMAIN and TLD orders are swapped (previously, it didn't matterwhat their order was, but in the revised approach, we use the domain toanchor the leading text before the match):


# first, match the domain down to JUST the rightmost two tokens
:0
* FROMDOMAIN ?? 
[(_at_)(_dot_)]?\/[^(_at_)(_dot_)]+\(_dot_)([^.]+|[^.][^.]\.[^.][^.])$
{
        TOPDOMAIN=$MATCH

        # next, get the domain portion - this is everything up to,
        # but not including the first dot.
        :0
        * MATCH ?? ^\/[^.]+
        {
                DOMAIN=$MATCH
        }

        # we need to fall back to the saved TOPDOMAIN and get the
        # TLD portion - this is everything AFTER the domain and a dot.
        # this implementation allows for two-part TLDs (co.uk for example)
        # because the RHS of this condition includes a variable which
        # needs to be expanded, we use the $ flag on the condition.
        :0
        * $ TOPDOMAIN ?? ^$DOMAIN\.\/.*$
        {
                TLD=$MATCH
        }
}

I would really be thankful if somebody would explain the
listname_id.rc line by line :)

My suggestion to you would be to make a test harness - I call it a"sandbox" (sandboxes are intended to keep the sand in, and kids play inthem). Then, get the listname_id.rc file and includerc it into yoursandbox. DEFAULT can be to /dev/null, and you set verbose logging (whichis sort of the object of a sandbox). Take a pile of old email messagesthen use formail to split the mailbox (assuming it's MBX):


        formail -s procmail -m sandbox.rc < saved_mail.mbx

Note that it is really important that the mailbox you're feeding into thisis NOT a delivery target of the rules invoked by the sandbox.

I have a standard sandbox and listname_id posted on my procmail pages(which are a bit long in the tooth for visual appearance, but the scriptsare still valid). The rules I posted previously would have simply been putinto their own file, such as filter.rc, and included into the sandbox(thus, the sandbox framework remains constant). This is what I use to doquick tests of things I go to post here (and of course, for my own stuff aswell).

As fun as it would be to re-explain the logic behind each line of anrcfile, ultimatley, if it has proven to do it's job, and has been subjectedto peer review, if it provides you with a string you can use to identifyone list from another, is there really a need to comprehend each line ofit? The procmail list archives will contain several threads discussing thedevelopment and use of the ruleset.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail