On Sat, Mar 18, 2000 at 01:31:43AM +0900, Satoru Takabayashi wrote:
Peter Marelas <maral(_at_)phase-one(_dot_)com(_dot_)au> wrote:
Thank you for your information. It sounds great. Since
Namazu's indexer called mknmz is written in perl, indexing
takes rather a long time. Ryuji Abe has a plan to rewrite
mknmz with C. It would be great if we can employ mifluz to
Certainly mifluz is up to the task. You may have read already
mifluz is designed to index a large (+10 million) number of words.
Mifluz relies on a modified version of Berkeley DB B+Tree's
(we added on compression) for storing its index. The structure
employed makes updates very fast. There is some work going on
to improve the structure.
Speaking of Namazu, as README says "for a small or medium
scale Web search engine", Namazu's is not designed to index
a large number of documents. As far as I know, the largest
Namazu index ever made is as follows:
Documents: 878,914 files
Total size: 2,167,480,108 bytes
Im curious as to how fast query performance is on that index?
On the other hand, mifluz Web site says:
| mifluz has been designed with the further upper limits in mind : 500
| million documents, 50 giga words, 20 million document updates per day.
It is terrific!
I would be interested if the persons that designed namazu's
index structure, critisized the mifluz structure. As the
structure is the key to fast updates and query performance.
I am the designer of Namazu's index structure. The
structure is a very simple inverted index. It is easy to
implement both indexer and search engine, but it is not fast
to update. See the following page for details.
The major differences are your index grows outwards..
i.e. from left to right. You also build different
indexes to solve different queries i.e. phrase.
Mifluz index grows downwards. From top to bottom.
All key/value pairs are stored sorted in a b+tree.
The b+tree introduces prefix compression. This saves
space. Currently only the key is used to store stuff.
There have been many discussions regarding the structure
and its design on the mailing list. If you like there are archives
I just printed out mifluz.texinfo and read it. I notice
that it is really a high-performance library. But at the
moment, I don't know whether or not it is good to employ
mifluz for Namazu.
Mifluz is generalised enough (at the moment) that it will
cater for most requirements. The fact that it is generalised
can be a problem though. I infact use my own indexer derived from
the mifluz structure. If I can prove my optimizations are
worthwhile they should end up in mifluz.
Since Namazu is an easy-to-use search system, features which
mifluz provides are perhaps too much.
I dont think mifluz was designed to provide many features.
In fact I would say its the opposite.
It provides an API to plug in/out words and other data into an
index in a user defined sorted fashion. Thats pretty much what mifluz
gives you. There are other products produced by Senga that
use mifluz for indexing, like the crawler and catalog system.
We mainly uses Namazu
for an intranet or personal use. In my opinion, the latter
will becomes more important because people gets a number of
emails nowadays. That's why Namazu emphasizes mail/news and
For the present, we Namazu project are concentrating on
development of Namazu 2.x. TODOs are:
* Support index compression with zlib.
Mifluz uses zlib and some bit compression. Coupled with
sorted b+tree's it achieves 1/8th compression on-disk.
* Improve index merging. O(n^2) -> O(n log n)
* Rewrite query operations with lex and yacc.
* Make source codes clear. Throw legacy codes away.
When above TODOs are completed, we will change over to 3.0
development and decide employment of mifluz. I hope
mifluz's APIs will be fixed and well documented at that
I've asked the main developer to join this
list, and he has. Im sure he will pass on his thoughts as