procmail
[Top] [All Lists]

Re: html tags "remover"?

2002-08-29 09:43:28

First, I agree with Sean that you're probably better off checking the
multipart issue.

However, I wrote a nice perl function (for a website, not email) that
strips unwanted html tags out of text.  You can toss that in a script and
use a :0fw rule on it...  You'll probably want to modify the default tag
list to just have <a> also.

        Dave

==begin perl==

# Takes text and an optional list of valid html tags,
# strips out invalid tags, and returns the result
#
# tags is an optional array reference specifying what tags to allow
# alltags is optional parameter which says to strip all tags.
#
# NOTE: this regexp has issues with nested < > chars
#       although I haven't figured out if that will break shit
#
sub html_tag_filter($;$$) {

  my ($text, $tags, $alltags) = @_;

  if($alltags != 1) {
   if(!scalar(@${tags})) {
     @{$tags} = (qw|<a> <br> <i> <b> <h1> <h2> <h3> <h4>|);
   }
  }

  # remove optional < > around valid tags
  foreach my $tag (@{$tags}) {
    $tag =~ s/^<?(.+?)>?$/$1/e;
  }

  $text =~ s/(<\s*\/?([^\s=>]*).*?>)   # match exactly html tag
           / grep(m|^\Q$2\E$|i, @{$tags}) # if the first non-whitespace token 
in the tag
                                  # can be found in @tags, keep it
             ? $1 : "" /isegx;     # otherwise, remove tag

  return $text;

}

==end perl==

On Thu, 29 Aug 2002, Professional Software Engineering wrote:

At 08:30 2002-08-29 -0700, Michael J. Rensing wrote:

I would like to use procmail to perform a number of mail tasks, including
anti-spam.

I don't directly see how stripping HTML has much of a bearing on spam
filtering, except that it makes matching text strings a bit easier on the
generic level.  That doesn't really matter - converting a message to
plaintext is a perfectly normal goal with procmail, whether you're
combatting spam, or just dealing with people who think cutesy text is the
coolest thing...

Typically, messages sent in HTML format are multipart - there's a plaintext
version of the message preceeding the HTML portion.  Of course, there are
exceptions out there, but for many messages, you might find that you don't
really need to convert the HTML so much as drop that content part.

It seems to me that it should also be able to run everything
through a filter which I figure must exist somewhere. That filter would
remove all HTML coding from a message, except links that can be clicked on.
The resulting document could be a bit messy, but at least the html tags
wouldn't be cluttering up the content. Simply coded html messages would
likely come through without problems.

You could pipe it through lynx, more recent versions of which have an
option to strip HTML.  Search the list archives, linked from
<http://www.procmail.org/>.  Your primary limitation there will be dealing
with links that are <XA HREF="link">some text other than the real
link</XA>, which would be stripped down to the text, rather than the link
itself.  When you have <XA HREF="link">link</XA> type links, you'd
obviously not have a problem in the translation.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail






_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>