procmail
[Top] [All Lists]

Re: processing incoming mail out to text files.

2003-05-06 19:10:05
On  6 May, Patrick Shanahan wrote:
| * Alan Clifford <lists(_at_)clifford(_dot_)ac> [05-06-03 14:47]:
| > On Tue, 6 May 2003, Michael J. Ayers wrote:
|  
| ["legal" disclaimer snipped]
|  
| > And I thought my signature was growing too long.  It took me a while to
| > find the question.  Wouldn't it be better if you asked for all copies of
| > the original message to be returned?
| 
| I do not consider myself just anyone, I guess I should bounce all of
| his posts unless he includes my specific address in the "To:" field.  I
| would not appreciate being sued for violation of his prohibition. 
| Seems like something that Mickey$loft would do.

You guys *may* be beating up on the wrong guy.  These ridculous
disclaimers seem often to be forced on users, with no way out, by the
lawyer types.  If so, then it's a little unfair to jump in Michael's
chili.  OTOH, if Michael does have control over this I implore him to
reconsider.  Either way, there's no need to despair...

But first, in answer to Michael's question, it sounds like you need to
investigate maildir format.  I've never used it myself, so don't want
to mess you up with possibly bogus usage examples. You'll find some
information in the procmailrc man page and discussions in the searchable
archives:

http://www.xray.mpe.mpg.de/mailing-lists/procmail/

If maildirs is not what you're looking for, you'll have to come back
with more specifics about what you're trying to do.

For Alan and Patrick:

xWORDS = "(Not(ice)?:?|information|privileged|message|confidential|\
protected|disclosure|employee|intended|here(by|in)|distribution|\
copying|prohibited|recipient|dissemination)"

xREQUIRED = 6

# <10p> get rid of moron disclaimers
:0 c
* < 65536
* $ 1^1 B ?? $xWORDS
*  -5^0
{
  LOGABSTRACT = no
  xRULE = '<10p>'
  xWORDS = "Not(ice)?:? information privileged message confidential \
protected disclosure employee intended here(by|in) distribution \
copying prohibited recipient dissemination"

  :0 fbW
  | perl -e '$/="";$r=$ENV{xREQUIRED};@w=split" ",$ENV{xWORDS};$x=0;' \
      -e 'while($p=<>){$r=10000 if($p=~m!Content-Type: text/html;!i);' \
      -e '$i=grep($p=~/\b$_\b/i,@w);if($i<$r){print$p;}else{print' \
      -e '"[ moron disclaimer stripped ($ENV{xRULE} score: $i) ]\n\n";' \
      -e '$x+=$i;}}exit 1 unless$x;'

  :0 e
  /dev/null

  :0:
  $DEFAULT
}

This works reasonably well, but isn't perfect.  It'll pass through the
original message and strip disclaimers from a copy.  In other words
you'll get the original delivered intact (unless later recipes modify
it) AND the stripped one for comparison.  For reasons noted below, I've
found it useful to keep both copies long beyond the initial testing
time.  With Michael's original message, the filtered copy replaced the
disclaimer with:

[ moron disclaimer stripped (<10p> score: 9) ]

For me, it brings great joy to see that.  It feels like I won.

The caveats:
1. I make no representation that the perl code is efficient, let alone
good or idiomatic.

2. It misses some -- especially multiple paragraphs (like they can't
stuff enough inanity into one paragraph?) -- but to the best of my
knowledge has never stripped anything incorrectly except as noted below.
I'm sure there are other words that could be added or tweaking of the
numerical threshold to make it more efficient. I haven't touched this
in a long time.

3. It strips quoted disclaimers too, but there's the rub.  The perl code
processes the body by paragraph.  If the whole message is quoted, the
whole quoted part is treated as a single paragraph and the entire
quoted part is stripped.  This is one place where LookOut's annoying
quoting is actually sometimes useful if it's not been configured to
prepend a quote char/string.

4. It doesn't strip disclaimer delimiters (like lines with runs of * or
- or =).

5. Any message that has a disclaimer stripped will also compress any
multiple blank lines (separating paragraphs) into one single one.  In
that case the filtered copy is not true to the original, but not in any
meaningful way.

One of these days I'm going write something to strip and store quoting,
break a message down by paragraph, more intelligently handle multiple
paragraph crap, then faithfully reassemble all the pieces after proper
stripping.  But some day I'm going to do a lot of things. :-(

-- 
Email address in From: header is valid  * but only for a couple of days *
This is my reluctant response to spammers' unrelenting address harvesting



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>