procmail
[Top] [All Lists]

Re: Stripping HTML: Another Question

1997-10-09 16:04:43

Quoting _Clint_ (clint(_at_)cray-ymp(_dot_)acm(_dot_)stuorg(_dot_)vt(_dot_)edu):
I will cut to the chase here:

Hmm. You used 18 lines more than you needed to cut to the chase...

Isn't there a command-line method for stripping HTML using perl?

Yes you could invoke perl like this:

perl -pe 's/<[^>]*>//g'

(though it would make more sense to use sed: "sed 's/<[^>]*>//g'")

But you'll have problems with multiline tags, e.g.:
"<a tag>text</a
tag>"

So then you'll have to turn on multiline matching:

perl -pe '$/ = ""; $* = 1; s/<[^>]*>//g;'

And you'll still have trouble with nested <'s, e.g.:
<!-- comment <a nested tag> more comment -->

You can get rid of that too, but I think it would be more efficient as a
c program along the lines of:

count = 0;
while ((c = getc()) != EOF) {
        if (c == '<') {
                do {
                        if (c == '<') count++;
                        else if (c == '>') count--;
                        c = getc();
                } while (count > 0);
        }
        putc(c);
}

-- 
Michael Stone, Sysadmin, ITRI     PGP: key 1024/76556F95 from mit keyserver,
mstone(_at_)itri(_dot_)loyola(_dot_)edu            finger, or email with 
"Subject: get pgp key"