procmail
[Top] [All Lists]

Re: Defining comments

1999-11-27 19:59:47
era eriksson <era(_at_)iki(_dot_)fi> writes:
On Fri, 26 Nov 1999 08:22:46 +0100, Rejo Zenger
<subs(_at_)sisterray(_dot_)xs4all(_dot_)nl> wrote:
     CHAR        =  <any ASCII character>        ; (  0-177,  0.-127.)
  char         = "[-~]+"

(This is how this came through. Doesn't look right, does it? Perhaps
you have a real NUL and a real DEL there, though. Or NUL to ~ (126))

To match the range NUL to ~ you have to reverse it and match everything
not in the range \x80-\xFF (200-377):

        char =  '[^\x80-\xFF]'

This is because it is impossible to include a literal NUL in a procmail
regexp: it is treated as the end of the string instead.  E-mail messages
may contain NULs, but procmail variables and rcfiles may not meaningfully
do so.

Note that CHAR is "any ASCII character", not "any string of ASCII
characters".  It matches one, not many.


     ctext       =  <any CHAR excluding "(",     ; => may be folded
                     ")", "\" & CR, & including linear-white-space> 
  ctext        = "([-'*-[]-~])+"

This doesn't look right, either. In regular expressions in general,
any ] is the closing bracket unless it's the first character in the
class (after any ^ modifier and possibly -) but frankly, I'm not sure
Procmail follows tradition here 100%. Anyway, I think your regex makes
sense intuitively, but I wouldn't be too sure grep and friends would
agree.

The only place inside a character class where a close bracket is not
treated as the end of the class is as the first characters in the class,
skipping an optional negation.  I.e., "[]]" matches a close bracket,
and "[^]]" matches everything but a close bracket or a newline.  So,
the character class in the regexp:

        ctext   = "([-'*-[]-~])+"

is closed by the first close bracket and everything from there to the
close paren is taken literally.

So, how does define ctext?  As before a complemented character class is
needed to include the NUL.  As for CR, it's not special to procmail so
just stick on in the character class:
        ctext = '[^()\
\x80-\xFF]'

Note that backslash is not special inside characters classes, so there's
no need to double it in the above.


I've been to lazy to ever test this part of Procmail fully, but here
are some things to look at:
...
The "Matched" log entry comes before the corresponding "Match" but
here's a deciphered version

 [^[]+   matched abalaba
 [[].*   matched []()[]
 []].*   matched ]()[]
 [^]]+   matched abalaba[
 [[-]].* matched []()[]

These are exactly the way you would expect, given general regex
principles of longest-leftmost matching and the rules for how these
special cases of classes should be interpreted.

 [^-][]+ matched []

Huh? I would have expected this to match abalaba (or maybe ()) but
certainly not this.

As with close bracket, a minus sign is not special as the first character
of a characters class skipping an optional negation.  Furthermore, if a
regexp ends during the parsing of a character class, procmail will close
the class internally.  So, the regexp
        [^-][]+
is parsed as:
        character class matching everything but - and NL
        character class matching ] and +

Now does the result make sense?


 [^-[]]+ didn't match anything
 [-'*-[]-~] didn't match anything

Perhaps there should be a syntax error or at least a warning from
Procmail. These are illegal or at least weird regex syntax. (But
perhaps it should cope with the latter, in fact.)

None of those are illegal.  They are all even correct closed.  They're
just really weird expressions to be matching:

[^-[]]+
        character class matching everything but -, [ and NL
        one or more ]

[-'*-[]-~]
        character class matching -, ', and the range * to [
        literal -
        literal ~
        literal ]

The first of those might be more clearly written as
        [^-[]\]+

but that's minor.


1. I think i cannot have $comment to be complete correct because of this
   call to itself. This will cause a problem with nested "(" and ")" i
   guess. Can i escape these problems by adding those parentheses to
   both $ctext and $quoted_pair?

Basic language theory. Regular expressions are not a powerful enough
formalism to deal with nested parentheses. Briefly, a regular
expression is computationally equivalent to a simple automaton where a
match causes a transition from one state to another. The automaton
doesn't have a way to remember whether it's been in the same state
before, so it can't know how many opening parens there have been.

Correct.  Regexps can't count.


2. How to exclude CR from ctext? I just don't see how i chould specify
   this here.

Hey, I had never thought of that. How +would+ one do that? Philip?

CR = control-M is not special to procmail.  Just put on in your rcfile.


Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>