Explore the Configuration File

Each item in the configuration file may accept many parameters.

See Also
Create a Simple MOTPipe
Perform Language Detection and Word Lemmatization
Extract Named Entities
Use Case
<ling:MOTConfig xmlns="exa:com.exalead.linguistic.v10">
  <!-- Tokenizers -->
  <ling:StandardTokenizer >
    <ling:charOverrides>
      <ling:StandardTokenizerOverride type="token" toOverride=":" />
    </ling:charOverrides>
    <ling:patternOverrides>
      <ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]][&amp;][[:alnum:]]" />
      <ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]]*[.]net" />
      <ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+[+]+" />
      <ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+#" />
    </ling:patternOverrides>
  </ling:StandardTokenizer>

  <!-- Normalizer -->
  <ling:NormalizerConfig>
    <ling:NormalizerIndexLower language="fr" word="thé" />
    <ling:NormalizerIndexLower language="fr" word="maïs" />
  </ling:NormalizerConfig>

  <!-- Semantic Processors -->
  <ling:Lemmatizer name="englishLemmatizer" language="en" />
  <ling:Lemmatizer name="frenchLemmatizer" language="fr" />
</ling:MOTConfig>

You can see that the Standard Tokenizer accepts specific overrides:

  • For characters: the : character is considered as an alphabetical character and not punctuation. For example, a:b is only one token.

  • For patterns: any sequence of characters that match the specified regexp are considered as a whole token. For example, R&D that matches the first pattern.

You can also configure the Normalizer with normalization exceptions. Here, the forms thé, Thé, or THÉ are all normalized to thé and not the.