Next: , Previous: Data Storage Spec, Up: Top   [Contents][Index]


6 Cache tuning

The cache memory used by mifluz has a tremendous impact on performance. It is set by the wordlist_cache_size attribute (see WordList(3) and mifluz(3)). It holds pages from the inverted index in memory (uncompressed if the file is compressed) to reduce disk access. Pages migrate from disk to memory using a LRU.

Each page in the cache is really a node of the B-Tree used to store the inverted index entries. The internal pages are intermediate nodes that mifluz must traverse each time a key is searched. It is therefore very important to keep them in memory. Fortunately they only count for 1% of the total size of the index, at most. The size of the cache must at least include enough space for the internal pages.

The other factors that must be taken into account in sizing the cache are highly dependant on the application. A typical case is insertion of many random words in the index. In this case two factors are of special importance:

repartition of unique words

When filling an inverted index it is very likely that the dictionary of unique words occuring in the index is limited. Let’s say you have 1 000 000 unique words in a 100 000 000 occurrences index. Now assume that 90 000 000 occurrences are only using 20 000 unique words, that is 90% of the index is filled with 2% of the complete vocabulary. If you are in this situation, the indexing process will spend 90% of its time updating 20 000 pages. If you can afford 20 000 * pagesize bytes of cache, you will have the maximum insertion rate.

The general rule is : estimate or calculate how many unique words fill 90% of your index. Multiply this number by the pagesize and increase your cache by that amount. See wordlist_page_size attribute in WordList(3) or mifluz(3).

order of numbers following the key

The cache calculation above is fine as long as the words inserted are associated with increasing numbers in the key. If the numbers following the word in the key are random, the cache efficiency will be reduced. Where possible the application should therefore make sure that when inserting two identical words, the first is followed by a number that is lower than the second. In other words, insert

foo 100
foo 103

rather than

foo 103
foo 100

This hint must not be considered in isolation but with careful analysis of the distribution of the key components (word and numbers). For instance it does not matter much if a random key follows the word as long as the range of values of the number is small.

The conclusion is that the cache size should be at least 1% of the total index size (uncompressed) plus a number of bytes that depends on the usage pattern.


Next: , Previous: Data Storage Spec, Up: Top   [Contents][Index]