procmail
[Top] [All Lists]

detecting long sigs in unsubscribe messages

1996-04-08 15:16:45
about a week ago reriksso(_at_)cc(_dot_)helsinki(_dot_)fi (era eriksson)
discussed ways to catch "unsubscribe/subscribe" messages, and
posted a few recipes.

| More ideas for this recipe would be most welcome. In particular, I'm
| not very happy about the byte counts, but they act as sort of a safety
| measure.
|   There's a bit of overlap in the recipes, I notice now. They've
| developed over several months, and I've never really gotten down to
| optimizing them.
|   On the whole, I have found this to work fairly well, so far. And I'm
| glad I've been keeping the incoming ones -- for me, it's enough to
| know I'm in control of the situation. Now, reading unsub messages can
| even be fun! (Because I can choose for myself when to read them. And
| they're not eating any of my precious disk quota.)

i mailed this to srb a little while ago.  while it's done in
perl and not procmail, some of you smartlist hackers might
find it useful -- or even figure out a way to reimplement
the logic in procmail.  it was originally written as part of
my email->pager gateway package.  as a whole, the system
strips out quoted material and the sig, and reformats things
slightly -- point being to minimize the amount of text sent
to the pager.  this bit just gets rid of the sig.

anyways.  hope this is useful to someone.

(if anyone actually uses the code, they should probably find
some way to avoid MIME-encoded messages so the mime
delimiters don't get mistaken for sigs.)

cheers
meng weng wong
mengwong(_at_)pobox(_dot_)com

    From mengwong Tue Apr  2 02:42:03 -0500 1996
    To: srb(_at_)cuci(_dot_)nl
    Subject: administrivia filtering & sigs
    From: mengwong(_at_)icg(_dot_)resnet(_dot_)upenn(_dot_)edu (Meng Weng Wong)

    smartlist's administrivia detection algorithm is pretty
    good, but in the cases where someone has a really long sig,
    it fails, because it thinks the sig is a message.

    this is some perl code that is pretty successful at picking
    up sigs.  it considers a sig delimiter either ^--, or a
    pattern of 1 to 4 characters that repeats to the end of the
    line, followed by 0 to 3 characters of anything.

    for some reason people with really long sigs tend to put
    borders around them, and this code works well at detecting
    those borders.

    i'd be pleased if you find it even minorly useful; maybe you
    could incorporate the concept somehow into the next release
    of smartlist.

    --meng

    # ----------------------------------------------------------
    # leave out the sig.  @body -> @toreturn.
    # assumes leading and trailing whitespaces were stripped.
    # ----------------------------------------------------------

    sub trypattern {
        local ($pattern, $line, $patternlength) = @_;
        $pattern = quotemeta($pattern); # does same thing as $\
        local ($regexp) = "($pattern){2,}?" . ".?" x ($patternlength);
        return ($line =~ /^$regexp$/); # a single line of the same pattern
    }

    for (@body) {
        last if ($insig);

        # -- is a common sig starter.
        if (/^--/) { $insig = 1; last; }

        # patterns may have up to four characters.
        foreach $patternlength (1..4) {
            local ($regexp) = "." x $patternlength;
            if (/^($regexp)/ && &trypattern($1, $_, $patternlength) && 
$withinbody) {
                $insig = 1;
            }
            last if ($insig);
        }
        last if ($insig);

        push (@toreturn, $_);
        $withinbody++;
    }


<Prev in Thread] Current Thread [Next in Thread>