Re: Killing countries by email

Shane asked,

| Can someone explain in a little further detail what the purpose of this
| part of the country blocker does?
|
|         :0
|         * ^Received:.*\.\/(cn|nl|jp)[^  ][      ]
|         * MATCH ?? ^^\/[a-z]+
|
| I understand it until the part where it looks like we're trying to not
| match 2 blank spaces, but we do want to match 6 blanks?  What purpose
| does that serve?  Why couldn't you just use:
|
|       * ^Received:.*\.\/(cn|nl|jp)

The brackets do not contain runs of spaces.  The first one encloses caret,
space, tab; the second encloses space and tab.  Somewhere along the line
an editor program or a mail transport apparently replaced the tab with spaces
to the column where it ended.  That second condition should read,

        * ^Received:.*\.\/(cn|nl|jp)[^  ][      ]

The point is to make sure that the character after the country code is
something visible (usually a right-side angle bracket or a comma or a
semicolon, I'm guessing; it is folly to try to read Era's mind), followed by
a space or a tab.  The point is not to make a false match on, say, mail
from somebody(_at_)somemachine(_dot_)cnn(_dot_)com, for one example.

| I'm also confused about the next line.  We seemingly already have the
| 2-letter country designation in $MATCH at this point, so doesn't that
| get blown away by the next expression?  Namely: ^^\/[a-z]+

No, that is not what we have in $MATCH at that point.  We have the two
letters, the next non-blank character, and the space or tab, the four
characters matched by the right side of the previous condition's ex-
pression.

| And I don't understand what that comparison is checking for.  You're
| matching at the beginning of the expression (which is $MATCH) and
| splitting it into 2 halves, the second of which must match [a-z]+ and
| whose result ends up replacing what's in the $MATCH variable.

Yes; we're looking at the old value of $MATCH and taking, starting at the
beginning, as many alphabetic characters as there are (at least one).  That
way we end up with the two letters of the country code and not the closing
punctuation nor the trailing space or tab from the previous value of $MATCH.

Personally, I think that the first condition should have been something more
like this:

         * ^Received:.*\.\/(cn|nl|jp)[^-._0-9a-z]

to make sure that we were matching on the last two characters of the domain
(well, last three if we count the period before the extractor).  We would
still need the next condition to strip the extra character off and get only
the country code out of it.

(Of course, thanks to procmail, we *do* know somebody in the Netherlands.)