procmail
[Top] [All Lists]

Re: [Procmail] Re: HTML to ASCII recipe?

1997-01-16 23:58:16
On Thu, 16 Jan 1997 22:52:41 -0500 (EST), dmuth(_at_)ot(_dot_)com (Doug Muth)
wrote:
     Let's not forget the ampersand character that is also present in HTML.
     Wouldn't
     sed -e /&[^;]*;//g
     take care of those tags?

(Strictly speaking, they're not tags, they're entities. Strictly
speaking, you probably don't want to remove them, because they stand
for actual characters. And strictly speaking, that turns the whole
mess into a different problem. You might want to replace each entity
with the closest respective character in the incarnation of ASCII you
think you are going to be using, for instance. The solution to
everything, of course, is to use Perl. :^)

/* era */

(As people elsewhere have pointed out, you might want to zap the
[Procmail] tag you're apparently adding locally from outgoing
replies.) 

#!/usr/local/bin/perl

%ents = (
  'amp'   => '&', 
  'lt'    => '<',
  'gt'    => '>', 
  'aring' => '=E5', #... you get the idea
);

$ents = join ('|', keys %ents);

while (<>)
{
    s/\&($ents);/$ents{$1}/g;
    print;
}

-- 
See <http://www.ling.helsinki.fi/~reriksso/> for mantra, disclaimer, etc.
* If you enjoy getting spam, I'd appreciate it if you'd register yourself
  at the following URL:  <http://www.ling.helsinki.fi/~reriksso/spam.html>

<Prev in Thread] Current Thread [Next in Thread>