procmail
[Top] [All Lists]

Re: Simple Recipe

2001-06-13 01:40:25

This is a bit of a ramble on my part.  If something is too obscure,
just skip it; I promise I won't be insulted...


Dallman Ross <dman(_at_)nomotek(_dot_)com> writes:
From: Philip Guenther <guenther(_at_)gac(_dot_)edu>

The "[        ]" right before the \/ token is actually unnecessary--the ".*"
is sufficient given what's after the \/ token--but it makes the intent
clearer (and it should be ever so slightly faster, I think).

Thank you.  I have been wondering this for almost two years.  I never could
understand why the expression right before the `\/' was ever being
used.  To my mind it made things muddier, not simpler: the steps taken
on the right to overcome explicitly the "left-handed greediness" that
is default behavior seemed to me to be clear and quite enough.  But
now I (think I) understand that the extra expression has been used as
a form of "coder's documentation."

You got 'left' and 'right' swapped in that, but I believe you meant the
correct version.

Note that the constaint on what is matched directly to the left of the \/
token is only superfluous when it doesn't match any of those characters
that are matched by what is directly to the right of the \/ token.
Consider trying to extract the last two digits of the year in the Date:
header field:
        * ^Date:.*[1-9][0-9]+\/[0-9][0-9][^0-9]
        * MATCH ?? ^^\/[0-9][0-9]

The character classes just to the left of the \/ token are necessary
to constrain the match to the year part of the Date: and not the day of
the month.  (The "+" makes this regexp Y10K compliant...)


Going back to the original example, it ran something like:

        * ^Subject:.*blah blah blah.*[  ]\/[^   ]+$

where we were trying to extract the last word from the Subject: line
after the text "blah blah blah".

The 'constraint on what is matched...left of the \/' is the "[   ]".
We can replace it by a regexp that matches the union of what it currently
matches and what is matched directly to the right of the \/ token without
changing the meaning of the regexp:

        * ^Subject:.*blah blah blah.*([  ]|[^   ])\/[^  ]+$

Because of procmail's "stingy" matching (the opposite of greedy matching)
to the left of the \/, that condition has the same effect as the
previous one.  It can be simplified quite a bit, because
        ([      ]|[^    ])
matches exactly one non-newline character, just like "." normally does.
That gives us:
        * ^Subject:.*blah blah blah.*.\/[^      ]+$

and then ".*." is the same as ".+"


Aha!  We can now see I was wrong when I said that a simple ".*" was
equivalent to the ".*[  ]": should the expression match if the final
"blah" is part of the last word on the line?  That "[      ]" makes
the difference.


In the end, you can only simplify regexps so much before you find that
you had never stated *exactly* what you wanted.  Most of the time that's
because the simplification changes the meaning of the regexp (whether it
matches or not and, if so, what it extracts) but only for those cases
that you didn't care about anyway.  The simplification above changes
the behaviour when there's no whitespace after the last "blah".  If you
don't care what happens in that case, then the simplification is fine.


Telling a computer to do something is easy.  Figuring out _what_ you
want it to do is the hard part.


Philip Guenther

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>