Previous: Berkeley DB Compression, Up: Compression   [Contents][Index]


8.1.2 Page compression in Mifluz

The mifluz classes WordDBCompress and WordBitCompress do the compression/decompression work. From the list of keys stored in a page it extracts several lists of numbers. Each list of numbers has common statistical properties that allow good compression.

The WordDBCompress_compress_c and WordDBCompress_uncompress_c functions are C callbacks that are called by the the page compression code in BerkeleyDB. The C callbacks then call the WordDBCompress compress/uncompress methods. The WordDBCompress creates a WordBitCompress object that acts as a buffer holding the compressed stream.

Compression algorithm.

Most DB pages contain redundant data because mifluz chose to store one word occurrence per entry. Because of this choice the pages have a very simple structure.

Here is a real world example of what a page can look like: (key structure: word identifier + 4 numerical fields)

756     1 4482    1  10b    
756     1 4482    1  142    
756     1 4484    1   40    
756     1 449f    1  11e    
756     1 4545    1   11    
756     1 45d3    1  545    
756     1 45e0    1  7e5    
756     1 45e2    1  830    
756     1 45e8    1  545    
756     1 45fe    1   ec    
756     1 4616    1  395    
756     1 461a    1  1eb    
756     1 4631    1   49    
756     1 4634    1   48    
.... etc ....

To compress we chose to only code differences between adjacent entries. A flag is stored for each entry indicating which fields have changed. When a field is different from the previous one, the compression stores the difference which is likely to be small since the entries are sorted.

The basic idea is to build columns of numbers, one for each field, and then compress them individually. One can see that the first and second columns will compress very well since all the values are the same. The third column will also compress well since the differences between the numbers are small, leading to a small set of numbers.


Previous: Berkeley DB Compression, Up: Compression   [Contents][Index]