Re: what regexps work?

Philip Guenther <guenther(_at_)gac(_dot_)edu> replied to me:

engine with egrep. What egrep? There are many versions. Here are

The _original_ egrep: extended, but not bloated, regexps.  In newer


Fair enough. It is misleading to have references to egrep(1) then
in the man page, as it cannont be known what version of egrep is
on the current system.

Before I answer the rest of your questions, have you considered just
*TRYING* them?  Then you know the answers you get are correct.


I *could* try them, but I might miss something subtle. Some programs
like \{n\}, some {n}, some don't support it; in vim the following pair
of regexps are not equivilent "($|j)" and "(j|$)", one will use the
$ as an anchor, one will use it as a literal; etc.

Can I code this perl regexp in procmail?
    /^From:.*[ <]([^\(_at_)]+\@(\1|[^.]{13,})\.(com|net)([ >]|$)/i


That should have been:
        /^From:.*[ <]([^\(_at_)]+)\@(\1|[^.]{13,})\.(com|net)([ >]|$)/i

(Ie, is {} understood and can I do backreferences?)


Procmail doesn't support backreferences.  Besides, that's not a valid
perl regexp, as you left out a closing paren somewhere.  The check
implied by that regexp _can_ be faked really closely in procmail with
careful use of MATCH.


Chances are near 100% that in the absence of backrefernces, I would
use an external program to process the regexp rather than fix it for
what procmail does support.

Oh, and braces are just syntactic sugar anyway.


The same thing can be said about [] if you have (|). Compare these
to regexp fragments:

        [^.]{13,}

        [^.][^.][^.][^.][^.][^.][^.][^.][^.][^.][^.][^.][^.]+

I'd prefer the sugar.

Why can't procmail have a real (zero-width) word boundry operator?

You'll have to ask Stephen that questions.  Before you do so, consider
the return question: what can you do with them that you can't with the
non-zero width boundary tokens that procmail implements?  Think _very_
carefully here...


Two things come to mind.

1) I can't easily move regexps that work in other programs to procmail.
Take this perl regexp

        /^<[^%]+\b[^(_at_)]+>$/

Which will match "<b(_at_)a%d>" but not "<@%>" or "<g%o%o(_at_)d>". Here is
an equivilent that does not need a word boundry marker:

        /^<([^%]*([^a-z0-9%][a-z0-9]|[a-z0-9][^a-z0-9(_at_)])[^(_at_)]*)>$/

(Note that this is only superficially similar to a test that could
invalidate some types of email addresses.) The RE without the \b
is a lot harder for humans to understand.

2) Consider the case that $FRAGMENT has unknown contents.

        [-a-z0-9(_dot_)_]+(_at_)$FRAGMENT\>usa\.net

Compare the results of that for zero-width and non-zero width \> with
a variety of fragments. Here are some to start with: "mail.", "net",
and "".

I can come up with more esoteric cases that can distinguish between
the usefulness of a perl style \b (any word boundry) and vi style
\< and \> (start end end word boundries).

Does procmail know about [:alpha:] and company? How about [=n=]

No, no, and no.  What's the correct locale to use for an email message

...

BTW: if you find a program that uses locales for [a-z], then it's
broken.  Locales should only be used for the [:foo:], [.ch.] and [=n=]


I have not looked at the POSIX spec, this is an issue I was recently
made aware of reading Friedl's _Mastering Regular Expressions_. See
page 81 of the book that implies [a-z] should expand for locale and
submit a correction to the errata if it is wrong.

Elijah
------
please do not CC me when replying to the list