Rik Kabel <rik(_at_)netcom(_dot_)com> writes:
...
Furthermore, the best way I have found to also eliminate trailing
whitespace on the last line of the MATCH is to simply repeat David's
formulation with a similar one, this time ending on the last non-whitespace
character. This can be extended to trimming any known character from a
string, a technique which has been discussed in the past.
:0 flags as needed
* some condition which places text into MATCH
* MATCH ?? ^^\/(.*$)*.+
* MATCH ?? ^^\/(.*$)*.*[^ ]
{ Result=$MATCH }
Actually, the second of the two condition will do the complete job,
unless MATCH starts with a single newline and you're using a version of
procmail before 3.12. (If there's more than one newline, it still
won't work. See below.)
One caveat. When the original match contains a leading empty line (newline
only), that empty line is removed from the result. Intermediate
newline-only lines and trailing whitespace are preserved. Thus, if MATCH
contains:
...
Correct. This was a known 'feature' of procmail versions 3.10 through
3.11pre7. 'Feature' is in quotes because it wan't one: the match
length wasn't shortened by one in this case, so that every time MATCH
starts with a newline, it ends up containing one character beyond the
last actually matched by the regexp. As of version 3.12, procmail
doesn't strip a leading newline from MATCH, period.
Thus, the above pair of conditions still won't do the job if there's
more than one leading newline: the final match will still be one
character too long.
with only the first newline removed. This is true with 3.11pre4 and
3.11pre7. I have not yet tried other versions. Explanations of this are
welcomed, as are single-RE conditions to remove both trailing newlines and
whitespace.
You had it already:
* MATCH ?? ^^\/(.*$)*.*[^ ]
It just needs a less buggy version of procmail.
I'll note that there's a subtle improvement that can be made:
* MATCH ?? ^^\/.*($.*)*[^ ]
This is slightly faster, because the regexp engine doesn't have to
consider as many possible matches. With the former, right after seeing
a newline, it has to consider both the possibility of looping on the
"(.*$)*" construct or following the following ".*[^ ]". It can
only eliminate the latter when it hits the next newline. With the
latter, the engine has to consider multiple courses for every
non-whitespace character, but only for one character per-split. It
should thus be more efficient by approximately the ratio of whitespace
characters to the total number of characters (e.g., if 10% of the
characters are whitespace, it'll be about 10% faster).
Also to be noted is that you can strip leading and trailing
whitespace/blanklines with both old versions of procmail in one
regexp:
* MATCH ?? ()\/[^ ](.*($.*)*[^ ])?
(For the inquisitive, the above is faster than the anchored version:
* MATCH ?? ^^[ ]*($[ ]*)*\/[^ ](.*($.*)*[^ ])?
and much faster than the simplitic anchored version:
* MATCH ?? ^^.*($.*)*\/[^ ](.*($.*)*[^ ])?
Moral: the implicit initial ".*" on unanchored regexps is faster than
any equivalent you put on the regexp.)
Here's some timings for a few conditions executing 10,000 times:
* FOO ?? ^^\/.*($.*)*[^ ] 9.04
* FOO ?? ^^\/(.*$)*.*[^ ] 10.83
* FOO ?? ()\/[^ ](.*($.*)*[^ ])? 9.25
* FOO ?? ^^[ ]*($[ ]*)*\/[^ ](.*($.*)*[^ ])? 9.47
* FOO ?? ^^.*($.*)*\/[^ ](.*($.*)*[^ ])? 11.25
This was on a 360MHZ Sun Ultra60. FOO contained an entire message
(~1.2K in length) from my mailbox with a handful of whitespace and
newlines added before and after, for a total of 192 whitespace and
newline characters and 1110 non-whitespace/newline characters. That
gives an expected speed ratio between the first two regexp of 16%
(192/(1110+192)) and an actual of slightly more ((10.93-9.04)/10.83).
The numbers are too small to really say how closely they match.
Philip Guenther