procmail
[Top] [All Lists]

Re: Eliminating Linefeed embedded in a $MATCH result

1999-05-02 23:02:12
Rik Kabel <rik(_at_)netcom(_dot_)com> writes:
...
Furthermore, the best way I have found to also eliminate trailing
whitespace on the last line of the MATCH is to simply repeat David's
formulation with a similar one, this time ending on the last non-whitespace
character. This can be extended to trimming any known character from a
string, a technique which has been discussed in the past.

:0 flags as needed
* some condition which places text into MATCH
* MATCH ?? ^^\/(.*$)*.+
* MATCH ?? ^^\/(.*$)*.*[^     ]
{ Result=$MATCH }

Actually, the second of the two condition will do the complete job,
unless MATCH starts with a single newline and you're using a version of
procmail before 3.12.  (If there's more than one newline, it still
won't work.  See below.)


One caveat. When the original match contains a leading empty line (newline
only), that empty line is removed from the result. Intermediate
newline-only lines and trailing whitespace are preserved. Thus, if MATCH
contains:
...

Correct.  This was a known 'feature' of procmail versions 3.10 through
3.11pre7.  'Feature' is in quotes because it wan't one: the match
length wasn't shortened by one in this case, so that every time MATCH
starts with a newline, it ends up containing one character beyond the
last actually matched by the regexp.  As of version 3.12, procmail
doesn't strip a leading newline from MATCH, period.

Thus, the above pair of conditions still won't do the job if there's
more than one leading newline: the final match will still be one
character too long.


with only the first newline removed. This is true with 3.11pre4 and
3.11pre7. I have not yet tried other versions. Explanations of this are
welcomed, as are single-RE conditions to remove both trailing newlines and
whitespace.

You had it already:
        * MATCH ?? ^^\/(.*$)*.*[^       ]

It just needs a less buggy version of procmail.

I'll note that there's a subtle improvement that can be made:

        * MATCH ?? ^^\/.*($.*)*[^       ]

This is slightly faster, because the regexp engine doesn't have to
consider as many possible matches.  With the former, right after seeing
a newline, it has to consider both the possibility of looping on the
"(.*$)*" construct or following the following ".*[^      ]".  It can
only eliminate the latter when it hits the next newline.  With the
latter, the engine has to consider multiple courses for every
non-whitespace character, but only for one character per-split.  It
should thus be more efficient by approximately the ratio of whitespace
characters to the total number of characters (e.g., if 10% of the
characters are whitespace, it'll be about 10% faster).


Also to be noted is that you can strip leading and trailing
whitespace/blanklines with both old versions of procmail in one
regexp:

        * MATCH ?? ()\/[^       ](.*($.*)*[^    ])?

(For the inquisitive, the above is faster than the anchored version:

        * MATCH ?? ^^[  ]*($[   ]*)*\/[^        ](.*($.*)*[^    ])?

and much faster than the simplitic anchored version:

        * MATCH ?? ^^.*($.*)*\/[^       ](.*($.*)*[^    ])?

Moral: the implicit initial ".*" on unanchored regexps is faster than
any equivalent you put on the regexp.)


Here's some timings for a few conditions executing 10,000 times:

* FOO ?? ^^\/.*($.*)*[^         ]                                9.04
* FOO ?? ^^\/(.*$)*.*[^         ]                               10.83
* FOO ?? ()\/[^         ](.*($.*)*[^    ])?                      9.25
* FOO ?? ^^[    ]*($[   ]*)*\/[^        ](.*($.*)*[^    ])?      9.47
* FOO ?? ^^.*($.*)*\/[^         ](.*($.*)*[^    ])?             11.25

This was on a 360MHZ Sun Ultra60.  FOO contained an entire message
(~1.2K in length) from my mailbox with a handful of whitespace and
newlines added before and after, for a total of 192 whitespace and
newline characters and 1110 non-whitespace/newline characters.  That
gives an expected speed ratio between the first two regexp of 16%
(192/(1110+192)) and an actual of slightly more ((10.93-9.04)/10.83).
The numbers are too small to really say how closely they match.


Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>