procmail
[Top] [All Lists]

RE: Get domain and tld ?

2009-01-25 17:13:34
Xavier Maillard wrote Sunday, January 25, 2009 20:58:

Dallman Ross <dman(_at_)nomotek(_dot_)com> writes:

Let's refine the procmail syntax given yesterday in light
of the wishes described above.

[thanking me]

I'm glad it is helpful and excited you. :-)


 * FQDN ?? ^^()^^

Can you explain this rule to me ? What does this mean exactly ?

There are a few other procmail gurus around, and one of them
first showed us on this list that kind of syntax maybe about
8, 9, or 10 years ago.  I believe I first saw it done by David Tamkin,
who still reads this list.

All that line does is check to see if the variable on the left
is empty or null.

See "^^" in the procmailrc man page.  It's an anchor, left or right,
to the expression.  We put an empty parenthesis set in the middle
so it's clearer that two anchors are being shown and not something
else that comprises four carets strung together.  But it works
equally without the parens.  They are for us humans only.

Procmail has no easy way to distinguish between a var that is unset
and one that is merely empty, btw.  There is an extremely advanced
way to test for that, and David Tamkin once showed that as well,
but it's not something I or anybody I know about ever actually needed
to do in production.  It was merely an intellectual exercise, though
a very fine one.  I'm only telling you this part for the history,
not because it's anything you need to know.

Anyway, 

        ^^expression^^

is close in meaning to

         ^expression$

and the two can often be interchanged.  I noticed Sean used the latter form
in his contribution to your thread yesterday.  The double-caret anchor
is particular to procmail, but it's also not exactly identical to the
second method shown immediately above.  See, the double-caret anchor means
the very start or the very end of the entire expression, not just of the
particular line being parsed.  So

   :0 B
   * ^^Dear John,$
   | some_action

looks for "Dear John," followed, by a line-end, only on the first line of
the body of the message.  If the string appears only in a subsequent line
or lines, it won't trigger that condition.

With variables, they are usually, but not always, only one line long anyway,
so in that case the double-caret anchor or the more familiar ^ or $ equate to
the same thing.

I could have done this instead:

   * ! FQDN ?? .

"If FQDN does not contain any character ..." is, after all, logically
equivalent to "if FQDN is either empty or null."


My algorithm was: "If FQDN is empty, then we've run out of command-
line args, so exit this rcfile without writing the message to
$DEFAULT."  That's what I meant by

   :0
   * FQDN ?? ^^()^^
   { HOST }

(You can also look up "HOST" in "man procmailrc" and see why unsetting
it causes procmail to exit.)


Thank you for, once again, a brillant example. Maybe you should
add this to a FAQ ?

I have some nascent web pages about procmail, but have had them in
that stage for a few years and have not yet published them.  Any day
now ... :-)

You will want to take stock of one thing about the proffered syntax
that may not have been made clear: it's only a heuristical test,
meaning it asks "does the FQDN look like it could be along the lines
of 'example.co.uk'?"  But it could be fooled by anything that looks
that way.  So, for example, my site -- vsnag.spamless.us -- would
not fool the algorithm, but if my site's root domain were "xx"
instead, well, vsnag.xx.us would give the wrong answer.

Here, I'll prove it:

  10:39pm [~] 671[0]> procmail -m rc vsnag.spamless.us vsnag.xx.us < /dev/null
 FQDN is >vsnag.spamless.us<
 DOMPART is ><
 TLD is >spamless.us<
 ---
 FQDN is >vsnag.xx.us<
 DOMPART is >vsnag<
 TLD is >xx.us<

Uh-oh.  Not only did I prove it (second example), but I found a bug I didn't
know about (first example).  Crud.  Now I have to fix it.

[waitiminute]

All right, I fixed it:

 10:49pm [~] 676[0]> procmail -m rc vsnag.spamless.us vsnag.xx.us foo.$HOST 
mars.example.co.uk < /dev/null
FQDN is >vsnag.spamless.us<
DOMPART is >spamless<
TLD is >us<
---
FQDN is >vsnag.xx.us<
DOMPART is >vsnag<
TLD is >xx.us<
---
FQDN is >foo.panix5.panix.com<
DOMPART is >panix<
TLD is >com<
---
FQDN is >mars.example.co.uk<
DOMPART is >example<
TLD is >co.uk<


But thinking about it further, I think we should hard-code "co"
in there so we don't have the erroneous parsing for vsnag.xx.us.
So I'll hard-code that part.  Here is the new relevant code:


##################### start rcfile #####################

 FQDN = $1           # for testing on the command line

 :0
 * FQDN ?? ^^()^^
 { HOST }  # exit without delivery (lose any mail!) if no arg.
           # Repeating myself: this part is for testing only, not
           # production, because it will not deliver mail fed to it
  

 NL = '
' # define newline variable


 ###############################################
 # THIS IS THE START OF USEFUL PRODUCTION CODE #
 ###############################################

 TLDREGEX = ([cC][oO][.][^.][^.]|[^.]+)

 # find last domain subpart; if co.xx format, move
 # left one degree more

 :0
 * $ FQDN  ?? ()\/[^.]+[.]$TLDREGEX^^
 * MATCH ?? ^^\/[^.]+
 { DOMPART = $MATCH }



 # find TLD or country-style-format TLD.
 # Example: "com"; "org"; "co.uk"

 :0
 * $ FQDN ?? $\DOMPART[.]\/$TLDREGEX^^
 { TLD = $MATCH }
  
 ###############################################
 # THIS IS THE END OF PRODUCTION-CODE SECTION  #
 ###############################################



 LOG = "FQDN is >$FQDN<$NL"
 LOG = "DOMPART is >$DOMPART<$NL"
 LOG = "TLD is >$TLD<$NL"
 LOG = "---$NL"   # log iteration separator


 SHIFT = 1
 SWITCHRC = $_   # recurse


###################### end rcfile ######################


Here is the successful test:

 11:03pm [~] 681[0]> procmail -m rc vsnag.spamless.us vsnag.xx.us foo.$HOST 
mars.example.co.uk < /dev/null
FQDN is >vsnag.spamless.us<
DOMPART is >spamless<
TLD is >us<
---
FQDN is >vsnag.xx.us<
DOMPART is >xx<
TLD is >us<
---
FQDN is >foo.panix5.panix.com<
DOMPART is >panix<
TLD is >com<
---
FQDN is >mars.example.co.uk<
DOMPART is >example<
TLD is >co.uk<
---

Dallman


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>