Using Native Tokenizers

Standard tokenizer

This tokenizer is set up by default when you install Exalead CloudView. It breaks down text into tokens whenever it encounters a space or a punctuation mark. It recognizes all known punctuation marks for all languages.

You can have either:

One Standard tokenizer, to process all languages without a dedicated tokenizer (this is the default setup).
Multiple Standard tokenizers to process one or several specified languages for which you want to guarantee a certain tokenization behavior.

Character Overrides

When Exalead CloudView tokenizes text, it examines each character and checks the character override rules for any special processing instructions for this particular character.

Type

Description

token (default)

Processes the specified character as a normal alphanumeric character.

ignore

Does not process the specified character during indexing, and does not include it when building the query tree at search time.

punct

Processes the specified character as a punctuation mark.

Important: The underscore character is not considered as a punctuation mark by default. For example, john_doe is considered as a single token. To consider the underscore as a punctuation mark, specify it in the characters to override (_). For example,

john_doe

is sliced into three tokens.

sentence

The specified character denotes the end of a sentence.

separator

Processes the specified character as a separator.

Pattern Overrides

WhenExalead CloudView tokenizes text, it searches for any patterns between the current location to the end of the document, based on the pattern override rules. For each matching pattern, it applies the corresponding processing instruction to this particular pattern.

Specify patterns using PERL 5 regular expressions.


Type	Description
token (default)	Processes the specified pattern as a normal alphanumeric pattern. The following override patterns are used for standard tokenization: `[[:alnum:]][&][[:alnum:]]` – for example, `M&M` or `3&c` are considered as one token. `[[:alnum:]]*[.](?i:net)` – none, one or several alphanumeric characters followed by `.net` or `.Net` or `.NET` or `.nEt` or `.nET` or `.neT` is considered as a single token, for example, `.Net`, `ASP.NET` `[[:alnum:]]+[+]+` – for example, `C++` is considered as one token. `[[:alnum:]]+#` – one or several alphanumeric characters followed by a sharp sign are considered as a single token, for example, `M#`, `free#`, `2u#` Example: to avoid slicing compound words separated by hyphens into 3 tokens and get only one token, you can set the following pattern: `[[:alnum:]]+[-][[:alnum:]]+`
ignore	Does not process the specified pattern during indexing, and does not include it when building the query tree at search time.
punct	Processes the specified pattern as a punctuation mark.
sentence	The specified pattern denotes the end of a sentence.
separator	Processes the specified pattern as a separator.

Disagglutination Options

Selecting these options for German, Norwegian, or Dutch means once the tokenizer has produced a token during indexing, it checks if the token is a compound word. If it is, it adds annotations for each part of the compound word. This way, searching for one part of the compound word can match the whole word.

Fore more details on decompounding, see About Decompounding.

Concatenation Options

Selecting these options ensures words that include numbers are processed as a single token. For example:

Windows7 is processed as a single token if concatAlphaNum is true. Otherwise, it is processed as two tokens.
9Mile is tokenized as a single token if concateNumAlpha is true. Otherwise, it is processed as two tokens.

Normal separator rules for tokenization still apply: if there is a separator between numbers and other letters, the numbers and letters are processed as two separate tokens.

Transliteration Option

You can activate the transliteration option in the Exalead CloudView XML configuration (it is not available in the Administration Console), for transliterating Unicode Latin Extended B characters. Several characters are converted to their closest Latin equivalent to facilitate query typing. This is useful when you want to match documents in another charset from the one you use for searching.

For example, you may want to search for words containing "Ł" using the closest Latin character "L" even though it does not match phonetically.

As for now, the supported transliterations are the following:

00D0 = Ð -> 'd'
0110 = Đ -> 'd'
00F0 = ð -> 'd'
0111 = đ -> 'd'
00D8 = Ø -> 'o'
00F8 = ø -> 'o'
0126 = Ħ -> 'h'
0127 = ħ -> 'h'
0131 = ı -> 'i'
0141 = Ł -> 'l'
0142 = ł -> 'l'
0166 = Ŧ -> 't'
0167 = ŧ -> 't'
0192 = ƒ -> 'f'

Japanese Tokenizer

This tokenizer is set up by default. You can have only one Japanese tokenizer per tokenization config.

The 4 alphabets used in Japanese texts (kanji, hiragana, katakana, romaji) are implemented by this tokenizer.

Japanese Text Normalization

The text normalization process for Japanese is more complex than for European languages:

normalization of half-width katakana to full-width, applying NFKC (Normalization Form Compatibility Composition)
normalization of full-width roman characters to standard latin alphabet
in addition to each normalized token, indexing of the corresponding hiragana form. At query time, the same processing is applied. It allows to match this normal form with the alphabet used in the documents and the query.
over-indexing of the okurigana-free forms of kanji

This processor supports the following transcriptions:

The default configuration uses only tokens and their hiragana transcriptions. You can use any processor in the semantic pipe to exploit other transcriptions produced for linguistic purposes. To annotate text with parts of speech to flag nouns, verbs, and adjectives, select the option Add morphology in Linguistics > Tokenizations > your tokenization > Japanese.

Use of Recall

Since tokenizing japanese text is a difficult task, the processor may produce different outputs for the same input. Some of them may contain errors.

For example, in the following cases, queries may fail and skip documents (thus increasing silence):

the context is different.
A text may be tokenized in the context of a document in a different way than it is in a query because the context is missing.
the alphabet is different.
For example, tokenizing kanji may produce a different output than tokenizing the equivalent hiragana.
the document has been processed by a nonjapanese tokenizer at indexing time.
For example, the tokenizer for european languages produces single-character tokens from japanese texts, which do not match correct japanese tokenization.

To maximize recall and avoid errors, a character-based over-indexing is enabled by default. This over-indexing reduces search silence by trying to match character sequences independently of tokenization, in addition to the tokenized query. However, to avoid an unreasonable amount of noise in top search results, this character-based query expansion has a much lower contribution to the document score than the tokenized part.

If too many results are displayed, you can disable this recall by changing the value of the favor option from recall to precision in the Japanese tokenizer configuration (in <DATADIR>/config/Linguistic.xml).

Japanese Spell-Check Settings

To compute spell-check suggestions for Japanese, you need to change default settings in Search > Search logics > Your search logic > Query Expansion. See Set Up Spell-Check for CJK (Chinese-Japanese-Korean) for more details.

If you have purchased Extended Languages, you can remove this tokenizer and instead set up a Basis Tech tokenizer for Japanese tokenization.

The main differences are:

Exalead tokenizer is based on the open-source JUMAN Japanese processor. BasisTech is based on proprietary technology. You may find differences in tokenization results when switching from one system to another.
Exalead tokenizer shows better processing performances on average.
BasisTech tokenizer shows a higher coverage when extracting named entities.

Chinese Tokenizer

This tokenizer is specified by default. You can only have one Chinese tokenizer per tokenization config.

It includes the option to annotate the text with simplified Chinese transliterations.

The Chinese tokenizer retrieves first (and with a higher score) the sets of ideograms that correspond to real words. It also tokenizes ideogram by ideogram to support arbitrary ideogram combination, but this is less relevant in terms of meaning.

Note: If you have purchased Extended Languages, you can remove this tokenizer and instead set up a Basis Tech tokenizer for Chinese tokenization.