Quoting _Clint_ (clint(_at_)cray-ymp(_dot_)acm(_dot_)stuorg(_dot_)vt(_dot_)edu):
I will cut to the chase here:
Hmm. You used 18 lines more than you needed to cut to the chase...
Isn't there a command-line method for stripping HTML using perl?
Yes you could invoke perl like this:
perl -pe 's/<[^>]*>//g'
(though it would make more sense to use sed: "sed 's/<[^>]*>//g'")
But you'll have problems with multiline tags, e.g.:
"<a tag>text</a
tag>"
So then you'll have to turn on multiline matching:
perl -pe '$/ = ""; $* = 1; s/<[^>]*>//g;'
And you'll still have trouble with nested <'s, e.g.:
<!-- comment <a nested tag> more comment -->
You can get rid of that too, but I think it would be more efficient as a
c program along the lines of:
count = 0;
while ((c = getc()) != EOF) {
if (c == '<') {
do {
if (c == '<') count++;
else if (c == '>') count--;
c = getc();
} while (count > 0);
}
putc(c);
}
--
Michael Stone, Sysadmin, ITRI PGP: key 1024/76556F95 from mit keyserver,
mstone(_at_)itri(_dot_)loyola(_dot_)edu finger, or email with
"Subject: get pgp key"