Re: Speed of MHonArc when adding to databases of 4000+ messages

1998-04-21 16:15:31
On April, 15 1998 at 15:00, someone wrote:

However, I have run into an issue (that may or may not be resolvable). 
Some of the lists I am archiving have 4000+ messages, and all the
messages must be kept indefinitely.  When adding to these archives, the
program spends a lot of time rearranging the indexes in memory (at least
that's where I figure it's spending the time, since I'm only adding 50
or so messages).  What I would like to know is how optimized MHonArc is
when it comes to this sort of operation.  If you feel there is room for
improvement, I could carve out some time here at TKG to improve upon the
speed.  Even if the algorithm is optimal, speed gains could probably be
realized by rewriting the critical section in C.  Any work that I do
would not change the copyright of MHonArc.

So, the real question is how much testing of MHonArc's speed when
dealing with large archives have you conducted?

Not much.  Since there is no expiration planned for your messages, you
maybe better off breaking your archive up by a well defined time
period, like by month.  Check the mhonarc mailing list archives for a
good example.  No matter what, archive updates progressively slow down
as an archive increases in size.

As for sorting performance, some sorting is faster than others.  There
are 4 types of sorting: date, subject, author, numeric (will talk about
threading later).  Numeric is the fastest since it is simple hash
lookups to get the comparison keys.  Date is second since the
integer date is part of the message key.  A split is done to extract
it.  Subject and author are the slowest since the strings must be
processed before doing a proper comparison.

I could store preprocessed forms of subjects and authors, but this
would take up more memory and inhibit the advantage of changing
some resources at anytime which can affect the preprocessing step.

Threading is the worst.  First a sort is done by the specified
order (date, subject, numeric).  Then, the list is process to
compute all the threads.

I should note, that MHonArc is designed to process messages in any
order.  Hence, the last message processed may be the first one listed
in the index.  Or, a message added belongs in the middle of thread.

Techniques like trying to store message indexes in pre-sorted ordered
could be tried.  Hence, only binary searches are required when adding
new messages.  However, it is not clear when the payoff will occur, and
if it will justify the development.

I think C is overkill to try to improve things, but of course, you are
free to try.  However, it will make the use of it more troublesome to
install on various platforms.  In the development of MHonArc, I have
tried to avoid utilizing modules that require compilation (or creating
my own that would require compilation) so the program is more portable
(also, the development MHonArc started before Perl 5 was released, so
using "modules" was not an option).  If I were to look into modules at
providing improvements to MHonArc, I'd look at the DB_File module and
the DBI::* modules.  The current mhonarc code base does not abstract
the db very well, so redesign is needed to make it suited to hook in db
backends and still provide portability.  Something on my todo list,
but who know when I get around to doing it.


<Prev in Thread] Current Thread [Next in Thread>
  • Re: Speed of MHonArc when adding to databases of 4000+ messages, Earl Hood <=