procmail
[Top] [All Lists]

Re: Domain based sorting

2011-08-18 15:32:03
At 11:31 2011-08-18, LuKreme wrote:
LuKreme <kremels(_at_)kreme(_dot_)com> squawked out on Thursday 
18-Aug-2011@12:14:46
> So, I started to think (dangerous, I know) and I searched and found Sean's post from a few of years ago about dealing with getting domains in domain.co.uk sorts of situations:

And a few minutes later I found Dan's post in the same thread with (trimmed down to just the part I want)

TLDREGEX = ([cC][oO][.][^.][^.]|[^.]+)

Doesn't need to be case sensitive unless someone explicitly makes a recipe case sensitive by specifying the 'D' flag. The following is more succinct, and accomplishes the same thing within the example recipe:

TLDREGEX = (co[.][^.][^.]|[^.]+)

Note that the [.] expression might more commonly be expressed as \. but one would have to double-escape it to \\. for the slash to appear in the resulting regexp string, so character classing it is in fact clearer.

# Get the domain name
:0
* $ FQDN  ?? ()\/[^.]+[.]$TLDREGEX^^
* MATCH ?? ^^\/[^.]+
{ DOMPART = $MATCH }

This works perfectly as far as I can tell.

Though the second condition line drops the TLD portion(s) -- this will grab the "domain" from "domain.tld", "mail.domain.tld", or "mail.domain.co.uk". However, the TLDREGEX is a 'co.xx' specific expression -- it'll trip up on something such as 'k12.ca.us' (and there are many variations on that), but will get "ca" for a domain in the "k12.ca.us" heirarchy for example (California schools, http://www.ed-data.k12.ca.us/), and ca.us is used for municipalities within the state. nv.us is nevada, and predictably, other states use the same syntax. then there's ca.gov - with a host of subdomains including some municipalities, and agencies, cdfa.ca.gov, etc.

The UK has "org.uk" and "net.uk" as well.

Admittedly, you're not likely to be ordering anything from .ca.us and the like, but in the context of parsing out a domain, there are many issues raised.

Considering the ICANN decision to open up the TLD naming to pretty much anything, some thought needs to be put into how domains are parsed - there's sure to be a LOT logic that will break.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>