Next: Key Specification, Previous: Data Storage Spec, Up: Top [Contents][Index]
The cache memory used by mifluz
has a tremendous impact on
performance. It is set by the wordlist_cache_size attribute
(see WordList(3) and mifluz(3)). It holds pages from the inverted index
in memory (uncompressed if the file is compressed) to reduce disk
access. Pages migrate from disk to memory using a LRU.
Each page in the cache is really a node of the B-Tree used to store the inverted
index entries. The internal pages are intermediate nodes that mifluz
must
traverse each time a key is searched. It is therefore very important to keep them in memory.
Fortunately they only count for 1% of the total size of the index, at most.
The size of the cache must at least include enough space for the internal pages.
The other factors that must be taken into account in sizing the cache are highly dependant on the application. A typical case is insertion of many random words in the index. In this case two factors are of special importance:
When filling an inverted index it is very likely that the dictionary of unique words occuring in the index is limited. Let’s say you have 1 000 000 unique words in a 100 000 000 occurrences index. Now assume that 90 000 000 occurrences are only using 20 000 unique words, that is 90% of the index is filled with 2% of the complete vocabulary. If you are in this situation, the indexing process will spend 90% of its time updating 20 000 pages. If you can afford 20 000 * pagesize bytes of cache, you will have the maximum insertion rate.
The general rule is : estimate or calculate how many unique words fill 90% of your index. Multiply this number by the pagesize and increase your cache by that amount. See wordlist_page_size attribute in WordList(3) or mifluz(3).
The cache calculation above is fine as long as the words inserted are associated with increasing numbers in the key. If the numbers following the word in the key are random, the cache efficiency will be reduced. Where possible the application should therefore make sure that when inserting two identical words, the first is followed by a number that is lower than the second. In other words, insert
foo 100 foo 103
rather than
foo 103 foo 100
This hint must not be considered in isolation but with careful analysis of the distribution of the key components (word and numbers). For instance it does not matter much if a random key follows the word as long as the range of values of the number is small.
The conclusion is that the cache size should be at least 1% of the total index size (uncompressed) plus a number of bytes that depends on the usage pattern.
Next: Key Specification, Previous: Data Storage Spec, Up: Top [Contents][Index]