Standard tokenizer
This tokenizer is set up by default when you install Exalead CloudView. It breaks down text into tokens whenever it encounters a space or a punctuation mark. It recognizes all known punctuation marks for all languages.
You can have either:
-
One Standard tokenizer, to process all languages without a dedicated tokenizer (this is the default setup).
-
Multiple Standard tokenizers to process one or several specified languages for which you want to guarantee a certain tokenization behavior.
Character Overrides
When Exalead CloudView tokenizes text, it examines each character and checks the character override rules for any special processing instructions for this particular character.
Type |
Description |
|
---|---|---|
token (default) |
Processes the specified character as a normal alphanumeric character. |
|
ignore |
Does not process the specified character during indexing, and does not include it when building the query tree at search time. |
|
punct |
Processes the specified character as a punctuation mark.
|
|
sentence |
The specified character denotes the end of a sentence. |
|
separator |
Processes the specified character as a separator. |
Pattern Overrides
WhenExalead CloudView tokenizes text, it searches for any patterns between the current location to the end of the document, based on the pattern override rules. For each matching pattern, it applies the corresponding processing instruction to this particular pattern.
Specify patterns using PERL 5 regular expressions.
Type |
Description |
---|---|
token (default) |
Processes the specified pattern as a normal alphanumeric pattern. The following override patterns are used for standard tokenization:
Example: to avoid slicing compound words separated by hyphens into 3 tokens and get only one token, you can set the following pattern:
|
ignore |
Does not process the specified pattern during indexing, and does not include it when building the query tree at search time. |
punct |
Processes the specified pattern as a punctuation mark. |
sentence |
The specified pattern denotes the end of a sentence. |
separator |
Processes the specified pattern as a separator. |
Disagglutination Options
Selecting these options for German, Norwegian, or Dutch means once the tokenizer has produced a token during indexing, it checks if the token is a compound word. If it is, it adds annotations for each part of the compound word. This way, searching for one part of the compound word can match the whole word.
Fore more details on decompounding, see About Decompounding.
Concatenation Options
Selecting these options ensures words that include numbers are processed as a single token. For example:
-
Windows7
is processed as a single token if concatAlphaNum is true. Otherwise, it is processed as two tokens. -
9Mile
is tokenized as a single token if concateNumAlpha is true. Otherwise, it is processed as two tokens.
Normal separator rules for tokenization still apply: if there is a separator between numbers and other letters, the numbers and letters are processed as two separate tokens.
Transliteration Option
You can activate the transliteration
option in the Exalead CloudView XML configuration (it is not available in the Administration Console), for transliterating Unicode Latin Extended B characters. Several characters are
converted to their closest Latin equivalent to facilitate query typing. This is useful
when you want to match documents in another charset from the one you use for
searching.
For example, you may want to search for words containing "Ł" using the closest Latin character "L" even though it does not match phonetically.
As for now, the supported transliterations are the following:
-
00D0 = Ð -> 'd'
-
0110 = Đ -> 'd'
-
00F0 = ð -> 'd'
-
0111 = đ -> 'd'
-
00D8 = Ø -> 'o'
-
00F8 = ø -> 'o'
-
0126 = Ħ -> 'h'
-
0127 = ħ -> 'h'
-
0131 = ı -> 'i'
-
0141 = Ł -> 'l'
-
0142 = ł -> 'l'
-
0166 = Ŧ -> 't'
-
0167 = ŧ -> 't'
-
0192 = ƒ -> 'f'