Re: Hostname regexp (was Re: no longer Re: Spam: ...life style...)

Talking to myself again:

On Mon, 29 Sep 1997 23:41:06 +0300 (EET DST), I wrote:
 > On Mon, 29 Sep 1997 11:54:55 -0400 (EDT), rik(_at_)netcom(_dot_)com (Rik 
Kabel)
 > wrote:
 >> In another followup in this thread, someone presented a recipe for
 >> analyzing certain Received headers. That recipe used a regexp like
 >> [a-z][-a-z0-9_.]* to identify what I believe was meant to be a host
 >> name. There are three problems with this. First, host names are now
 > The full expression was ([a-z][-a-z0-9_.]*)* which is something
 > slightly different. The intention is to force the match to contain
 > +some+ alphabetic characters somewhere. Forcing the first character to

I guess one should probably break down and make that more stringent
for a real valid hostname, and something like 
([a-z][-0-9_!(_at_)+]*\(_dot_)?)+
or even more forgiving for a faked hostname.

 >> changed). Finally, this regexp as I have reconstructed it allows
 >> consecutive dots. It might be more accurate, though pedantic and

I note that the faked host name that started this thread actually
contained two consecutive dots. :^)

 >> verbose (my forte), to define a host for this purpose with a regexp
 >> like ([a-z0-9][-a-z0-9]*\.)+[a-z][a-z][a-z]+, as long as it is followed

Meet the .fi domain. Oh, and don't forget the other 300 or so
two-letter TLD:s. OTOH, allowing more than three letters for the TLD
also seems a bit too generous.

How's this: 

([a-z0-9](-?[a-z0-9])+\.)+[a-z0-9][a-z0-9][a-z0-9]?

/* era */


Mea culpa. I +d when I should have ?d, and I certainly do know better.

Your recipe still has a problem or two. It does not pass x.fi or
9--11.com, while it does pass abc.911. The first two are valid, the
third is not, although if you want register it, AlterNIC will probably
do so. (AlterNIC has six character TLDs now, all numeric.) ISO 3166 is
the basis for two-character TLDs, and all of those are currently alpha.
I don't know their rules, however, so this may be a coincidence.

My next candidate:

  ([a-z0-9](-?[a-z0-9])*\.)+[a-z][a-z][a-z]?

It all really depends, I suppose, on one's intent. Do you wish to
affirmatively catch [yes, that was a split infinitive] each
syntactically incorrect domain? If so, you will find yourself
continually (maybe continuously) changing your regexp to match each sly
or incompetent variation spammers put out. If your intent is to verify
conformity with the standards and possibly reject that which does not
conform, however, you must only recreate in your recipe the appropriate
parts of the appropriated rfcs, and need change that only when the rfcs
change. SMOP.  (No smiley here, someone might take it for a recipe and
analyze it.)

-- 
Rik Kabel