Re: Fast stripping of HTML tags from MHonArc-generated files

1998-11-12 16:53:30
I'm trying to improve the speed at which Wilma indexes.  Right now the real
bottleneck is that we pass every MHonArc-generated page through the
striphtml program, which is written in Perl.  The time to load the Perl
interpreter tens or hundred of thousands of times is pretty harsh, and
occasionally we've seen HTML that the simple regexp-based approach freaks
out on, causing it to take near infinite time to process.

What does Wilma do with the stripped pages?  Are they stored on disk 
or deleted right after indexing?

The search interface at strips the pages and stores them in 
an alternate directory used for indexing.  Each night only files with 
time stamps newer than the stripped one are parsed, so it's only a 
thousand messages/night.  Of course, the foreach loop is called from
within perl, so there's only one instantiation of the interpreter.
So the conversion from HTML takes only a minute or so -- for me 
the biggest problem is indexing with glimpse.  That takes several 
hours to complete (still looking at realtime mySQL updates instead).


<Prev in Thread] Current Thread [Next in Thread>