procmail
[Top] [All Lists]

Re: splitting digests, catching duplicates

1996-12-06 21:58:52
 :0
 * ^Subject: 
next-\/(announce|bugs|hardware|marketplace|misc|software|sysadmin)
 { NEWSGROUP=comp.sys.next.$MATCH }

That brings up what I feel is the one thing that I'd like to see in a future 
version of procmail.  The ability to delimit multiple MATCH's in the manner
of /(.*)(.*)/$1$2/ syntax.  As it happens, it is Match to End, which is 
useful, but not as flexible as it could be.  (You can only match one thing,
and only to the end of the regexp.)

In my opinion, the regexps of ed/sed/grep/egrep are not useful enough;
Emacs' is too clumsy, but Perl's is just right.

If I could choose one to replace them all, I'd pick Perl's.  It has the
most convenient and powerful set of regexp's of them all, by far.  I'm
not saying that procmail should become Perl, but it sure would be nice
if its regexp set is enhanced, to enhance it along the lines of the
Perl regexps.

Here is a fairly complete set of regexp patterns for a "suggested"
enhancement to the Procmail regexp matcher:

   Syntax       Matches
   ------       -------
    \d          digit
    \D          match non-digits
    \w          match "words" ([a-zA-Z0-9_])
    \W          match non-words
    \s          whitespace (space, tab, return, or newline)
    \S          non-whitespace

    \b          word/non-word boundary
    [\b]        indictes a backspace in a character set
    \b          a word boundary
    \B          a non-(word boundary)
    \A          beginning of string
    \Z          end of string (or before newline at the end)

    \t          tab
    \n          newline
    \r          return
    \f          form feed
    \a          alarm
    \e          escape
    \033        octal char
    \x1B        hex char
    \c[         control char
    \l          lowercase next char
    \u          uppercase next char
    \L          lowercase till \E
    \U          uppercase till \E
    \E          end case modification
    \Q          quote regexp metacharacters till \E

    .           any character (except newline)
    ^           beginning of a line
    $           end of a line

These match multiple occurances of a pattern, and match as many
as possible (an "aggressive" match):

    p*          zero or more occurances of "p"
    p+          one or more occurances of "p"
    p{n}        exactly n occurances of "p"
    p{n,}       at least n occurances of "p"
    p{,m}       at most m occurances of "p"
    p{n,m}      at least n and at most m occurance of "p".

These match multiple occurances of a ptern, but as few as
possible (a "lazy" match):

    p*?         zero or more (lazy)
    p+?         one or more (lazy)
    p{n,}?      at least n occurances of "p" (lazy)
    p{,m}?      at most m occurances of "p" (lazy)
    p{n,m}?     at least and no more than m occurances of "p" (lazy)

Perl uses plain parens '(pat)' for naming and alternation, but
this would break the current procmail scripts, so we would have
to continue using '\(pat\)'.

    \(p\)               name the pattern matched by "p" (\1, \2, ..., )

These would map to Procmail variables: $MATCH1, $MATCH2, $MATCH3, ...

And, borrowing Perl's non-naming syntax: \(?:p\)

    \(?:p\)             match pattern without naming it
    \(?!p\)             match anything but pattern without naming it
    \(?#comment\)       allows inline comments until \) or end of line

Stephen, 

If I ripped Perl's regexp parser/engine out, worked it into procmail,
and documented it, would you support it?

___________________________________________________________
Alan Stebbens <aks(_at_)sgi(_dot_)com>      http://reality.sgi.com/aks

<Prev in Thread] Current Thread [Next in Thread>