Re: Handling Excessive Quoting ?

At 14:00 2002-07-07 -0700, GreenTree Ground Station did say:

here might have info on some kind of ratio calculator
that might work in conjunction with procmail to check

We discussed this a few months back when I was developing a number ofrulesets for a mailing list preprocessor. The following code is basicallywhat I'm using now. It supports list-specific thresholds - which I getfrom a line extracted from a file, but you could mimic for a standalonefilter like so:


# as extracted, expects a listname at the beginning, so all options are
# expected to have a leadig space for delimiter
FILTER_OPTIONS=" BLOATOK BLOAT_IN=120"

This filter uses a few different quoting marks [:|>], and doesn't assumeany special quote format (some braindead MUAs - or their users - do thingslike putting a bracket at the beginning of a quoted section, and another atthe end:


> several
lines of text
terminated with <

Good luck making any sense of that BS when so many people don't even dothat consistently (their OWN text sometimes appears in response WITHIN thecontent they quote in that fashion).


The ruleset:

# First, determine if we should be COPYING bloated messages or not
# (that is, are these just warnings?).  Not only defines if a copy
# is processed, but also affected the advisory message sent.
:0
* $ FILTER_OPTIONS ?? [        ]BLOATOK\>
{
        BLOATCOPY=c
}

#[snip - other optional detection method using a list-added footer banner]

# Define the filter ID
FILTER_ID="BLOATQUOTE"

# The decidedly more involved method - the conditions were pulled from
# <http://pm-doc.sourceforge.net/pm-tips-body.html#195>
# "14.3 Excessive quoting of message"

# X-Loop must match what is being used elsewhere
:0E
* $ FILTER_OPTIONS ?? [         ]$FILTER_ID\>
* ! ^X-Loop:[   ]+$LOOPALERT
* ! ^FROM_MAILER
{
        # - quoted lines
        # - non-blank, non-quoted lines
        # - completely blank lines

        # Establish defaults for all lists (used unless overridden)
        # INitial credit (-)
        BLOAT_IN=80
        # QUote cost (+)
        BLOAT_QU=10
        # NeW line credit (-)
        BLOAT_NW=14
        # BLank line credit (-)
        BLOAT_BL=5

        # locate option values for scoring
        :0
        * OPTIONS ?? [  ]BLOAT_IN=\/[0-9]+
        {
                BLOAT_IN=$MATCH
        }

        :0
        * OPTIONS ?? [  ]BLOAT_QU=\/[0-9]+
        {
                BLOAT_QU=$MATCH
        }

        :0
        * OPTIONS ?? [  ]BLOAT_NW=\/[0-9]+
        {
                BLOAT_NW=$MATCH
        }

        :0
        * OPTIONS ?? [  ]BLOAT_BL=\/[0-9]+
        {
                BLOAT_BL=$MATCH
        }

        # start with a zero extra score
        addscore=0

        :0
        * ^X-Mailer:[   ]*Microsoft Outlook
        {
                # Compute Outlook adjustments
                :0
                * $ $BLOAT_QU^0
                * $ $BLOAT_NW^0
                {
                        BLOAT_OL=$=
                }

                VERBOSE=ON

                # little extra check - MS uses non-conventional
                # way of quoting -- the buggers include most of the header
                # of the original message...
                # the count against values used here are equal to the value
                # a quoted header would NORMALLY have, *PLUS* the value that
                # the next rule will be granting these same lines because it
                # thinks they're NOT quoted lines.  This still doesn't
                # accomodate the extra blanks in there, but by and large
                # should deal with the overall quoting scheme anyway.  Only
                # triggers if the initial "original message" line is found.
                :0B
                * ^[       ]*----- Original Message -----
                * $ $BLOAT_OL^0
                * $ $BLOAT_OL^1 ^[       ]*(From|To|Sent|Subject):
                {
                        # note the score, so we can add it in the generic
                        # check
                        addscore=$=
                }
        }

        # see the details of the logic - then turn this off when you
        # understand it
        VERBOSE=ON

        # initial line between header and body is counted as one of the
        # blank lines when issuing credits..  [snip] lines and the sort
        # are as well..
        :0B$BLOATCOPY
        * $ -$BLOAT_IN^0
        * $ $BLOAT_QU^1 ^[      ]*[>|:]
        * $ -$BLOAT_NW^1 ^[     ]*[^>|:         ]
        * $ -$BLOAT_BL^1 ^[     ]*$
        * $ $addscore^0
        {
                ### Follows is diagnostic stats - you can omit.

                BOUNCENOTES="Scored"
                # note the final (positive) score, so
                # we can add it to the advisory header
                BOUNCENOTES="$BOUNCENOTES ($= weight)"

                # Okay, we know we're going to bounce, but let's just
                # re-compute the values for each line (sadly, we can't
                # store results from each condition above)

                VERBOSE=OFF

                :0
                * ^X-Mailer:[   ]*Microsoft Outlook
                {
                        :0B
                        * ^[       ]*----- Original Message -----
                        * 1^0
                        * 1^1 ^[       ]*(From|To|Sent|Subject):
                        {
                                addscore=$=
                        }
                }

                :0B
                * 1^1 ^[       ]*[>|:]
                * $ $addscore^0
                {
                        BOUNCENOTES="$BOUNCENOTES ($= quoted)"
                }

                :0B
                * 1^1 ^[      ]*[^>|:         ]
                {
                        BOUNCENOTES="$BOUNCENOTES ($= new)"
                }

                :0B
                * 1^1 ^[       ]*$
                {
                        BOUNCENOTES="$BOUNCENOTES ($= blank)"
                }

                LOG="BOUNCENOTES: $BOUNCENOTES$NL"

                ### End diagnostic stats.

                # choose an advisory text.  Note that if you want two
                # different messages - one for those which are being
                # permitted ("warning") and those which are refused, you
                # can use the BLOATCOPY flag ('c') to differentiate the
                # two files.
                BOUNCEMSG=bloat${BLOATCOPY}.msg
                BOUNCESUBJ="Excessive message quoting"

                # Include some generic bounce handler code
                # (or you'd cram your own handler right here)
                INCLUDERC=bouncer.rc
        }

        VERBOSE=OFF
}

the percentage of 'quoted content' in an email post
and act accordingly.

The above doesn't weigh the byte count of the lines, just the raw linesthemselves, assigning different weights to different type of lines, as perlogic for a mailing list.

Mainly for the first couple times it finds 'excessive quoting',

If you want to add count logic, you're free to do that. I'm planning ondoing that with the above -- automatically shifting from warnings to "we'veexplained it several times, but since you haven't clued in, here's someincentive..."

PS: I've searched the net, but didn't see anything
    relating along these lines.

I'd start with the procmail list archives and it's FAQ. The basic logic ofthe above is from the PM Tips document (URL included in the code).



---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail