Tokenizing Text

Tokenization is the process of splitting up a segment of text into smaller pieces, or tokens. Tokens can be broadly described as words, but it is more accurate to say that a token is a sequence of characters grouped together for useful semantic processing.

Since tokenization is a required processing step for all searchable alphanumeric text, it is set up automatically as part of the Exalead CloudView installation. This setup is known as the default tokenization config, tok0.

A tokenization configuration specifies which tokenizers to use when Exalead CloudView analyzes incoming documents at index-time. It also specifies how to tokenize queries at search-time.

Note: You can test the result of the tokenization process in Index > Data Processing > Test.


In this section:

Using Native Tokenizers
Using Basis Tech Tokenizer
About Creating Additional Tokenization Configurations
Customizing the Tokenization Config
About Decompounding