Previous: WordType CONFIGURATION, Up: WordType   [Contents][Index]


10.14.5 WordType METHODS

int Normalize(String &s) const

Normalize a word according to configuration specifications and builtin transformations. Every word inserted in the inverted index goes thru this function. If a word is rejected (return value has WORD_NORMALIZE_NOTOK bit set) it will not be inserted in the index. If a word is accepted (return value has WORD_NORMALIZE_OK bit set) it will be inserted in the index. In addition to these two bits, informational values are stored that give information on the processing done on the word. The bit field values and their meanings are as follows:

WORD_NORMALIZE_TOOLONG

the word length exceeds the value of the wordlist_maximum_word_length configuration parameter.

WORD_NORMALIZE_TOOSHORT

the word length is smaller than the value of the wordlist_minimum_word_length configuration parameter.

WORD_NORMALIZE_CAPITAL

the word contained capital letters and has been converted to lowercase. This bit is only set if the wordlist_lowercase configuration parameter is true.

WORD_NORMALIZE_NUMBER

the word contains digits and the configuration parameter wordlist_allow_numbers is set to false.

WORD_NORMALIZE_CONTROL

the word contains control characters.

WORD_NORMALIZE_BAD

the word is listed in the file pointed by the wordlist_bad_word_list configuration parameter.

WORD_NORMALIZE_NULL

the word is a zero length string.

WORD_NORMALIZE_PUNCTUATION

at least one character listed in the wordlist_valid_punctuation attribute was removed from the word.

WORD_NORMALIZE_NOALPHA

the word does not contain any alphanumerical character.

static String NormalizeStatus(int flags)

Returns a string explaining the return flags of the Normalize method.