procmail
[Top] [All Lists]

Re: html tags "remover"?

2002-08-29 09:29:36
At 08:30 2002-08-29 -0700, Michael J. Rensing wrote:

I would like to use procmail to perform a number of mail tasks, including
anti-spam.

I don't directly see how stripping HTML has much of a bearing on spam filtering, except that it makes matching text strings a bit easier on the generic level. That doesn't really matter - converting a message to plaintext is a perfectly normal goal with procmail, whether you're combatting spam, or just dealing with people who think cutesy text is the coolest thing...

Typically, messages sent in HTML format are multipart - there's a plaintext version of the message preceeding the HTML portion. Of course, there are exceptions out there, but for many messages, you might find that you don't really need to convert the HTML so much as drop that content part.

It seems to me that it should also be able to run everything
through a filter which I figure must exist somewhere. That filter would
remove all HTML coding from a message, except links that can be clicked on.
The resulting document could be a bit messy, but at least the html tags
wouldn't be cluttering up the content. Simply coded html messages would
likely come through without problems.

You could pipe it through lynx, more recent versions of which have an option to strip HTML. Search the list archives, linked from <http://www.procmail.org/>. Your primary limitation there will be dealing with links that are <XA HREF="link">some text other than the real link</XA>, which would be stripped down to the text, rather than the link itself. When you have <XA HREF="link">link</XA> type links, you'd obviously not have a problem in the translation.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>