When do you Need to Create a New Tokenization
You may want to create a new tokenization config to define:
-
A certain character as a separator, or a character as NOT being a separator. See Character Overrides.
-
A specific pattern to be a word only, instead of a word with separators. For example, to make sure
C++
is always indexed as the tokenC++
. See Pattern Overrides.
Specify a tokenization config for a specific index mapping or semantic type, when you know that a certain meta contains characters that need to be interpreted differently than other alphanumeric metas. Typical examples are identifiers like user IDs, model numbers and product codes.