Re: Fast stripping of HTML tags from MHonArc-generated files

1998-11-12 23:02:17
On November 12, 1998 at 16:25, Jason L Tibbitts III wrote:

I'm trying to improve the speed at which Wilma indexes.  Right now the real
bottleneck is that we pass every MHonArc-generated page through the
striphtml program, which is written in Perl.  The time to load the Perl
interpreter tens or hundred of thousands of times is pretty harsh, and
occasionally we've seen HTML that the simple regexp-based approach freaks
out on, causing it to take near infinite time to process.

Does anyone know of any free (i.e. we can incorporate it into something
under the Artistic License) C code, or a small utility that we can call,
which will do this?

I have the SGML::StripParser as part of the perlSGML[1] package.
It is in Perl, so the performance is not as good as a C program.
But since it is a module, you can write a Perl program to iterate
through a list of files and use SGML::StripParser on each file to
avoid lauching perl for each file separately.

perlSGML is under the GPL, so if that will not work for you, I
can redistribute it under the Artistic License for your needs.

Another option may be to use James Clark's SP[2] package.


[1]     <URL:>
[2]     <URL:>

<Prev in Thread] Current Thread [Next in Thread>