Next: Data Storage Spec, Previous: Constraints, Up: Top [Contents][Index]
In all of the literature dealing with full text indexing a collection of documents is considered to be a flat set of documents containing words. Each document has a unique name. The inverted index associates terms found in the documents with a list of unique document names.
We found it more interesting to consider that the document names have a hierarchical structure, just like path names in file systems. The main difference is that each component of the document name (think path name in file system) may contain terms.
As shown in the figure above we can consider that the first component of the document name is the name of a collection, the second the logical name of a set of documents within the collection, the third the name of the document, the fourth the name of a part of the document.
This logical structure may be applied to URLs in the following way : there is only one collection, it contains servers (document sets) containing URLs (documents) containing tags such as TITLE (document parts).
This logical structure may be also be applied to databases in the following way : there is one collection for each database, it contains tables (document set) containing fields (document) containing records (document part).
What does this imply for full text indexing ? Instead of having only one dictionary to map the document name to a numerical identifier (this is needed to compress the postings for a term), we must have a dictionary for each level of the hierarchy.
Using the database example again:
When coding the document identifier in the postings for a term, we have to code a list of numerical identifiers instead of a single numerical identifier. Alternatively one could see the document identifier as an aribtrary precision number sliced in parts.
The advantage of this document naming scheme are:
uniq
query operator can be trivially implemented. This is mostly
useful to answer a query such as : I want URLs matching the word foo
but I only want to see one URL for a given server (avoid the problem of
having the first 40 URLs for a request on the same server).
Of course, the suggested hierarchy semantic is not mandatory and may be redefined according to sorting needs. For instance a relevance ranking algorithm can lead to a relevance ranking number being inserted into the hierarchy.
The space overhead implied by this name scheme is quite small for databases and URL pools. The big dictionary for URL pools maps URL to identifiers. The dictionary for tags (TITLE etc..) is only 10-50 at most. The dictionary for site names (www.domain.com) will be ~1/100 of the dictionary for URLs, assuming you have 100 URLs for a given site. For databases the situation is even better: the big dictionary would be the dictionary mapping rowids to numerical identifiers. But since rowids are already numerical we don’t need this. We only need the database name, field name and table name dictionaries and they are small. Since we are able to encode small numbers using only a few bits in postings, the overhead of hierarchical names is acceptable.
Next: Data Storage Spec, Previous: Constraints, Up: Top [Contents][Index]