Peter Marelas <maral(_at_)phase-one(_dot_)com(_dot_)au> wrote:
Documents: 878,914 files
Total size: 2,167,480,108 bytes
Im curious as to how fast query performance is on that index?
Unfortunately, I don't know because I am not the creater of
the index. But I'm convinced that ordinary query operations
are fast. To perform simple word searching, Namazu searches
a vocabulary list stored in NMZ.w and its index NMZ.wi for a
given keyword by binary searching. It's done quickly.
I am the designer of Namazu's index structure. The
structure is a very simple inverted index. It is easy to
implement both indexer and search engine, but it is not fast
to update. See the following page for details.
The major differences are your index grows outwards..
i.e. from left to right. You also build different
indexes to solve different queries i.e. phrase.
Yes, Namazu sacrifice scalability for ease of
implementation, more over, Namazu's indices can't be updated
efficiently. It's a bad thing.
Although I'm not a good programmer, I was a worse one and I
had even no knowledge on information retrieval two years ago
when I designed the first version of Namazu.
But I think building different indices to solve different
queries is a good approach. It keeps their structures
simple and easy to handle.
For instance, a vocabulary list called NMZ.w can be grep'ed
with regular expressions because it is just a line-oriented
text file. Each line number of NMZ.w corresponds to their
word ID so that a list of documents which contain the word
identified by the ID can be retrieved.
NMZ.field.* are also line-oriented texts and they make
field-specified searching possible. It is simple but not
efficent for a large number of documents though.
NMZ.p and NMZ.pi is the index for phrase searching. Namazu
encodes two words into a 16 bits hash value instead of
storing locations of word occurrences for saving space.
By the way, as Peter said in mifluz's list:
| Finally, they use BER compression of ints for just about
| everything. Again mentioned here.
Yes, Namazu uses BER compression and document IDs are stored
by just their gaps. It's an easy way to reduce space. For
instance, "1, 5, 29, 34" are stored as "1, 4, 24, 5".
I just printed out mifluz.texinfo and read it. I notice
that it is really a high-performance library. But at the
moment, I don't know whether or not it is good to employ
mifluz for Namazu.
Mifluz is generalised enough (at the moment) that it will
cater for most requirements. The fact that it is generalised
can be a problem though. I infact use my own indexer derived from
the mifluz structure. If I can prove my optimizations are
worthwhile they should end up in mifluz.
We will study mifluz to decide employment of mifluz for next
generation of Namazu. The decision will be made by Namazu
* Support index compression with zlib.
Mifluz uses zlib and some bit compression. Coupled with
sorted b+tree's it achieves 1/8th compression on-disk.
Great. How much disk space mifluz's inedx takes as compared
with original text files? Namazu takes about 50-80% disk
space by all NMZ.* files.
When above TODOs are completed, we will change over to 3.0
development and decide employment of mifluz. I hope
mifluz's APIs will be fixed and well documented at that
I've asked the main developer to join this
list, and he has. Im sure he will pass on his thoughts as
Welcome to our list. I just subscribed to your list. :-)
-- Satoru Takabayashi