procmail
[Top] [All Lists]

Re: equivalent for backreference \1

2002-08-05 12:15:57
On  5 Aug, Holger Wahlen wrote:
| To match a subject line that contains any character from a given set
| three times in a row, Don Hammond suggested:
| 
| > *   ^Subject:.*\/[-!-,:-(_at_)]
| > * $ ^Subject:.*$MATCH$MATCH$MATCH
| 
| This won't work in all cases because even if there's a character from
| that set that appears three times in a row in the subject line, the
| first condition won't necessarily assign that particular character to
| MATCH. Take
| 
|   Subject: -Yeah!!!-
| 
| as an example: MATCH is set to "-" in the first condition, the second
| therefore checks for "---" and fails. (You didn't see this in the
| tests you posted earlier because you only tried "555", "567" and
| "666", but nothing like "5666".)

Oh Foo. Try this on for size.

(Philip has already provided the real answer, but here goes anyway.)

First, NONE of this pretends do be the best solution, nor even a good
one. Sometimes I play around simply for recreational education. Or is
that educational recreation?  Whatever ...  There's probably some point
for most people where simplicity wins, even over efficiency. And I
can't even represent that this is efficient anyway. So let's say it's
just for fun.

Near the end of the message are 3 rcfiles.  The first, thisrc, simply
sets up the test.  The second, .rc.substr, is misnamed and has little
useful functionality. ;-)  In fact I never had any worthwhile use for it
until now, and even that's doubtful.  It pretends to do substrings, but
really just truncates strings.  One of these days it may turn in to a
generalized substring solution, but not any time soon. Not at least by
my hand. The third, .rc.backref, uses .rc.substr to create an exclusive
list (and count) of matched characters from a provided character class,
then iterates through that list to implement a back reference kludge.

Looking for Hanspeter's character class in the Subject of a message that
contains:

  Subject: -text """##!!!!&& more text %%% $$$$$ ** still more text

yields:

$ procmail ./thisrc <zmsg 
Found 8 suspects in:  -text """##!!!!&& more text %%% $$$$$ **
They are: *$%&!#"-
Did NOT Match (-)\1\1
Matched """
Did NOT Match (#)\1\1
Matched !!!!
Did NOT Match (&)\1\1
Matched %%%
Matched $$$$$
Did NOT Match (*)\1\1

Following are the 3 rcfiles.

---(cut here: thisrc)---
# ~/.procmail/test/thicrc
DEFAULT = /dev/null
NL = "
"
PMDIR = $HOME/.procmail
TESTDIR = $PMDIR/test
RCSUBSTR =  $PMDIR/.rc.substr

CHARCLASS = '[-!-,:-(_at_)]'

:0
* $ ^Subject:\/.*$CHARCLASS
{
  BACKREF = "$MATCH"
  INCLUDERC = $TESTDIR/.rc.backref
}
---(cut here: thisrc)---

---(cut here: .rc.substr)---
# ~/.procmail/.rc.substr
# Usage (in the rcfile that INCLUDERC's this one):
# SUBSTR = "some string to truncate"
# maxSUBSTR = i  # where i=number of characters at which to truncate
# INCLUDERC = /pathto/.rc.substr
# SUBSTR now contains truncated string

:0
* SUBSTR ?? ^^^^
{ SUBSTR="substring of ??? (see .rc.substr)" }

:0 E
{
  :0
  * recurseSUBSTR ?? ^^^^
  {
     xSCORE = ${maxSUBSTR:-70}   maxSUBSTR
     max1 = '.?'
     max2 = "$max1$max1"
     max5 = "$max2$max2$max1"
     max10 = "$max5$max5"
     max20 = "$max10$max10"
  }
  :0
  * $ $xSCORE^0
  *       -69^0
  { maxSUBSTR = "$max20$max20$max20$max10"  xSCORE = 0 }
  :0 E
  * $ $xSCORE^0
  *       -19^0
  { maxSUBSTR = "${maxSUBSTR:+$maxSUBSTR}$max20"  xSCORE = $= }
  :0 E
  * $ $xSCORE^0
  *        -9^0
  { maxSUBSTR = "${maxSUBSTR:+$maxSUBSTR}$max10"  xSCORE = $= }
  :0 E
  * $ $xSCORE^0
  *        -4^0
  { maxSUBSTR = "${maxSUBSTR:+$maxSUBSTR}$max5"  xSCORE = $= }
  :0 E
  * $ $xSCORE^0
  *        -1^0
  { maxSUBSTR = "${maxSUBSTR:+$maxSUBSTR}$max2"  xSCORE = $= }
  :0 E
  * $ $xSCORE^0
  { maxSUBSTR = "${maxSUBSTR:+$maxSUBSTR}$max1" }

  :0
  * $ $xSCORE^0
  *        -1^0
  {
     xSCORE = $=
     recurseSUBSTR = yes
     INCLUDERC = $_
  }
  :0 E
  * $ SUBSTR ?? ^^\/$maxSUBSTR
  { SUBSTR = "$MATCH"  max1  max2  max5  max10  max20  recurseSUBSTR }
}

---(cut here: .rc.substr)---

---(cut here: .rc.backref)---
# ~/.procmail/test/.rc.backref
:0
* 1^0 ! CHARCLASS ?? ^^\[.+]
* 1^0   BACKREF   ?? ^^^^
{ LOG = "$_ missing required variables$NL" }
:0 E
* recurse ?? ^^^^
{
  COUNTEM = "$BACKREF"
  recurse = "countem"
  INCLUDERC = $_
}

:0
* recurse ?? ^^countem^^
{
  :0
  * $ COUNTEM ?? ^^\/.*$CHARCLASS
  {
    COUNTEM = "$MATCH"
    :0
    * $ COUNTEM ?? ()\/$CHARCLASS^^
    {
      :0
      * $ ! FOUND ?? $\MATCH
      {
        FOUND = "$FOUND$MATCH"
        :0
        * $ ${COUNTED:-0}^0
        *               1^0
        { COUNTED = $= }
      }
      :0
      *  1^1 COUNTEM ?? .
      * -1^0
      {
        rSCORE = $=
        maxSUBSTR = $rSCORE
        SUBSTR = "$COUNTEM"
        INCLUDERC = $RCSUBSTR
        COUNTEM = "$SUBSTR"
        INCLUDERC = $_
      }
    }
  }
  :0 E
  {
    LOG = "Found $COUNTED suspects in: $BACKREF$NL"
    LOG = "They are: $FOUND$NL"
    recurse = findem
  }
}

:0
* recurse ?? ^^findem^^
* $ $COUNTED^0
{
  :0
  * FOUND ?? ()\/.^^
  {
     SUSPECT = $MATCH
     :0
     * $ BACKREF ?? ()\/$\SUSPECT$\SUSPECT$\SUSPECT+
     { LOG = "Matched $MATCH$NL" }
     :0 E
     { LOG = "Did NOT Match ($SUSPECT)\\1\\1$NL" }
  }
  :0
  * $ $COUNTED^0
  *         -1^0
  {
    COUNTED = $=
    SUBSTR = "$FOUND"
    maxSUBSTR = $COUNTED
    INCLUDERC = $RCSUBSTR
    FOUND = "$SUBSTR"
  }
  :0 E
  { COUNTED = 0 }
}

---(cut here: .rc.backref)---

A couple of notes:

.rc.substr will not return a string longer than 70 characters.

There should probably be a test that maxSUBSTR is numeric.

RCSUBSTR needs to be suitably set or eliminated.

Other variables might need tweaking or removal.

At this point in .rc.backref:

     :0
     * $ SUBJ ?? ()\/$\SUSPECT$\SUSPECT$\SUSPECT+
     { LOG = "Matched $MATCH$NL" }
     :0 E
     { LOG = "Did NOT Match ($SUSPECT)\\1\\1$NL" }

something other that "LOG=" would need to be substituted.  In fact,
without an explicit delivery at this point there may not be a way to
stop the recursion for the remaining SUSPECT(s) in FOUND.  I don't have
a host that's current enough to test SWITCHRC=/dev/null, but suspect it
would just short circuit the current pass but not prevent further
recursion. The bottom line is I'm *pretty sure* the recursion is ok as
done above, but it can get tricky.

The "E" recipe just above would probably just be eliminated.

There are other LOG= statements that would need to be removed.

Followup messages from Hanspeter indicate he really wants something
like ([class]).*\1.*\1.  That should work also, but is not the way it is
written here.

.rc.backref requires a character class as written. I can't think of any
obvious reason why that should be necessary, other than that's what
precipitated the whole thing and this is all (even more) unnecessary
without it.  I've already wasted too much time to think about it any
more.

-- 
Reply to list please, or append "8" to "procmail" in address if you must.
Spammers' unrelenting address harvesting forces me to this...reluctantly.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail