Next: , Previous: Document name scheme, Up: Top   [Contents][Index]


5 Data Storage Spec

Efficient management of the data storage space is an important issue of the management of inverted indexes. The needs of an inverted index are very similar to the needs of a regular file system. We need:

All these functionalities are provided by file systems and kernel services. Since we also wanted the mifluz library to be portable we chose the Berkeley DB library that implements all the services above. The transparent compression is not part of Berkeley DB and is implemented as a patch to Berkeley DB (version 3.1.14).

Based on these low level services, Bekeley DB also implements a Btree structure that mifluz used to store the postings. Each posting is an entry in the Btree structure. Indexing 100 million words implies creating 100 million entries in the Btree. When transparent compression is used and assuming we have 6 byte words and a document identifier using 7 * 8 bits, the average disk size used per entry is 6 bytes.

Unique word statistics are also stored in the inverted index. For each unique word, an entry is created in a dictionnary and associated with a serial number (the word identifier and the total number of occurrences.


Next: , Previous: Document name scheme, Up: Top   [Contents][Index]