I'm trying to improve the speed at which Wilma indexes. Right now the real
bottleneck is that we pass every MHonArc-generated page through the
striphtml program, which is written in Perl. The time to load the Perl
interpreter tens or hundred of thousands of times is pretty harsh, and
occasionally we've seen HTML that the simple regexp-based approach freaks
out on, causing it to take near infinite time to process.
What does Wilma do with the stripped pages? Are they stored on disk
or deleted right after indexing?
The search interface at mallorn.com strips the pages and stores them in
an alternate directory used for indexing. Each night only files with
time stamps newer than the stripped one are parsed, so it's only a
thousand messages/night. Of course, the foreach loop is called from
within perl, so there's only one instantiation of the interpreter.
So the conversion from HTML takes only a minute or so -- for me
the biggest problem is indexing with glimpse. That takes several
hours to complete (still looking at realtime mySQL updates instead).