Re: Regex matching complete headers.

Nigel Swinson wrote:

2 points and some suggestions about the Sieve Home page

----------------------------------------------------------------------
In the regex draft we have in the example:

            # or the subject is all uppercase (no lowercase)
            header :regex :comparator "i;octet" "subject"
              "^[^:lower:]*$" ) {
What if the Subject is mutliline and one of the lines contains
uppercase letters, while the other contains only lowercase letters?
ie.

Subject: this is the first line that contains only lowercase
  this is a contination of the Subject header but it contains
UPPERCASE LETTERS

The regular expression "^[^:lower:]*$" is going to match the first
line, and therefore give us a match, but this isn't what we intended.


Actually, you've misread the regex.  It is looking for subjects that DO
NOT contain any lowercase chars.  This being the case, :regex should
work for multiline headers, just like the other comparators.  In fact, I
tested it with my regex implementation in cmu-sieve and it works fine. 
Using this script:

require "regex";

if header :regex :comparator "i;octet" "subject" "^[^:lower:]*$" {
        discard;
}


A subject of:

Subject: FIRST LINE
        SECOND LINE

will be discarded

and a subject of:

Subject: FIRST LINE
        Second LINE

will be kept


It is my understanding that all of the comparators work on the the
ENTIRE header contents, regardless of the number of lines, ie the script

if header :contains "subject" "second" {
        discard;
}

will discard either of the examples above.

I have a customer who quite sensibly would like to filter all messages
that have either no from header, or an empty from header.  Our exists
test will pass if the header exists but is empty, so we need a regex
test too that tests to say if the header is completely empty.


Try:

if anyof (not exists "from", header :regex "from" "^$")


Any good header parser should gobble up leading whitespace, so even if
the from: header exists an contains nothing but whitespace, the test
above should work.

----------------------------------------------------------------------

Could we also allow:

   * \w in place of [:word:]
   * \s in place of [:space:]
   * \d in place of [:digit:]
   * \l in place of [:lower:]
   * \u in place of [:upper:]

I find them really quite useful.  Or are we trying to stick to POSIX
rigidly.


I went with POSIX simply because it is a defined spec and I didn't want
to describe exactly what a regex should be (with all the ifs/ands/buts,
etc).  If there is consensus that shortcuts like these are desired, I'm
not opposed to adding them to the draft.

Regards,
Ken
-- 
Kenneth Murchison     Oceana Matrix Ltd.
Software Engineer     21 Princeton Place
716-662-8973 x26      Orchard Park, NY 14127
--PGP Public Key--    http://www.oceana.com/~ken/ksm.pgp