procmail
[Top] [All Lists]

Re: regex - matching capitalisation

2001-09-10 07:56:20
On 10 Sep, Eric Smith wrote:
| I am not having any success in matching for excessive use of caps -
| neither with gnu egrep nor procmail regex.  I creating a filter
| condition for spam, excessive use of caps, so in perl I might go:
| 
| /([A-Z]{3,20}\s){4,}/
| 
| to match 4 or more consecutively capitalised words.  I found something
| suggests something like
| *[A-Z][A-Z][A-Z]([A-Z]([A-Z])?)? 
[A-Z][A-Z][A-Z]([A-Z]([A-Z])?)?[A-Z][A-Z][A-Z]([A-Z]([A-Z])?)?
| 
| but that matches everything.
| 
| any suggestions? - also what is the most efficient way to debug regex in
| procmail?
| 

Without seeing the recipe I can only guess, but a reasonable one is
that you didn't use the "D" flag to make the match case sensitive.
Procmail matching is case insensitive by default. Additionally, your
regexp would be limited to matching words with 3-5 characters only, and
only allow one single space character between words. So let's work on
the first question while answering the second.

---(cut here)---
# $HOME/.procmail/test/rerc
VERBOSE=no
NL="
"
# Can't think of a way to assign a multi-line variable on command line,
# so provide this default value
VAR="${VAR:-This is ALL CAPS X
  FOUR, and ...}"
LOG="VAR=\"$VAR\"$NL"

# space and tab in brackets
:0 D
* VAR ?? ()\<\/[A-Z]+\>[        ]*[A-Z]+\>[     ]*[A-Z]+\>[     ]*[A-Z]+\>
{  LOG="Matched: $MATCH$NL" }

:0
/dev/null
---(cut here)---

Run it:
$ procmail ./rerc VAR="This is ALL CAPS X FOUR, and ..." </dev/null
VAR="This is ALL CAPS X FOUR, and ..."
Matched: ALL CAPS X FOUR,

Now let's try a multi-line match by not providing an assignment to VAR
on the command line:

$ procmail ./rerc </dev/null
VAR="This is ALL CAPS X
  FOUR, and ..."
Matched: ALL CAPS X
  FOUR,

So far so good (and maybe good enough). But if we change the default
asssignment to VAR in the rcfile to read (note trailing garbage line 1):

VAR="${VAR:-This is ALL CAPS X ...
  FOUR, and them some}"

and run it:
procmail ./rerc </dev/null
VAR="This is ALL CAPS X ...
  FOUR, and them some"

there's no match.  This seems to take care of it:

---(cut here)---
# $HOME/.procmail/test/rerc
VERBOSE=no
NL="
"
WB='(\>|\>+$\>*)'
VAR="${VAR:-This is ALL CAPS X ...
  FOUR, and ...}"
LOG="VAR=\"$VAR\"$NL"

:0 D
* $ VAR ?? $WB\/[A-Z]+$WB[A-Z]+$WB[A-Z]+$WB[A-Z]+$WB
{ LOG="Matched: $MATCH$NL" }

:0
/dev/null
---(cut here)---

Run it (twice):
$ procmail ./rerc VAR="This is ALL CAPS X FOUR, and ..." </dev/null
VAR="This is ALL CAPS X FOUR, and ..."
Matched: ALL CAPS X FOUR,

$ procmail ./rerc  </dev/null
VAR="This is ALL CAPS X ...
  FOUR, and ..."
Matched: ALL CAPS X ...
  FOUR,

There are other discussions of LOUD matching in the archives which may
be helpful and/or more efficient. But with this, you now have a means to
do your own testing.

-- 
                   /"\
Don Hammond        \ /     ASCII Ribbon Campaign
Raleigh, NC US      X        Against HTML Mail,
                   / \      and News Too

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>