procmail
[Top] [All Lists]

Re: Extracting in an Alternation (Was Scoring Function Question)

1999-04-15 15:59:30
I had suggested to Ralph Sobek,

T> Heck, if it weren't for the extraction operator, we could combine the first
T> two:

T> * 9876543210 HB ?? ^^(.+$)*((From|Subject):.*Condition1|$(.*$)*.*Condition2)

T> If "Condition1" and "Condition2" had been the same, we could manage it even
T> *with* the extraction.  The limitation is that you can't extract inside an
T> alternation (though you can alternate on either side of an extractor)  I've
T> occasionally wished it were possible, but changing it probably would break
T> something else (besides bloating the code).

| The two Conditions ARE THE SAME!  Is your one-liner more efficient
| than Philip's 3 line version?

I don't know; actually, it replaces only two lines of Philip's, not all three.
Era's point that perhaps the unweighted and always required condition should
come first bears consideration.

| I could be happy with just the
| extraction before the Condition.  I suppose that the result would look
| like:
| 
| * HB ?? ^^(.+$)*((From|Subject):.*Condition|$(.*$)*.*Condition)

It would look like this (in order to avoid extracting inside an alternation):

* HB ?? ^^(.+$)*((From|Subject):|$(.*$)*).*\/Condition

| Is this any more efficient?  On second thought, this would let through
| `Condition' on other headers besides From or Subject!  And that is
| bad!

No, it wouldn't!  It would accept Condition in From:, in Subject: or in the
body, but not in other header lines.  Let's break down the regexp:

HB ??  ^^    We'll search head and body together and start from the very
             beginning of the head+body pair: that is, from the beginning
             of the head.

(.+$)*       zero or more NON-empty lines (that is, stay in the head)

(            then EITHER

[1]

(From|Subject):   the field name for From: or Subject: and its colon
                  while we're still in the head,

|           OR

[2]

$(.*$)*     after enough non-empty lines to get to the end of the head,
            the newline at the neck, then any number of lines (maybe none)
            into the body

)           and then after either [1] or [2],

.*\/Condition   any number of non-newline characters to stay in the same
                line but to move away from the header line's colon or from
                the left margin of the body, then Condition, extracting
                the text that matches Condition.

| Let's keep improving this thread.

Little need for further improvement ... except to put the unweighted
condition first, as Era suggested.  Or this reduces the nesting of
alternations, so it should increase the efficiency a little:

* HB ?? ^^(.+$)*(From:|Subject:|$(.*$)*).*\/Condition

Maybe that change makes it easier to see why it would match only if Condition
appears in From:, in Subject:, or in the body but not on an appearance in any
other header lines.