Next: Cache tuning, Previous: Document name scheme, Up: Top [Contents][Index]
Efficient management of the data storage space is an important issue of the management of inverted indexes. The needs of an inverted index are very similar to the needs of a regular file system. We need:
All these functionalities are provided by file systems and kernel
services. Since we also wanted the mifluz
library to be portable
we chose the Berkeley DB library that implements all the services
above. The transparent compression is not part of Berkeley DB and is
implemented as a patch to Berkeley DB (version 3.1.14).
Based on these low level services, Bekeley DB also implements a Btree
structure that mifluz
used to store the postings. Each posting is
an entry in the Btree structure. Indexing 100 million words implies creating
100 million entries in the Btree. When transparent compression is
used and assuming we have 6 byte words and a document identifier using
7 * 8 bits, the average disk size used per entry is 6 bytes.
Unique word statistics are also stored in the inverted index. For each unique word, an entry is created in a dictionnary and associated with a serial number (the word identifier and the total number of occurrences.
Next: Cache tuning, Previous: Document name scheme, Up: Top [Contents][Index]