Normalizer

We usually put the normalizer right after the tokenizer in analysis pipelines. The normalizer computes lowercase and unaccentuated (normalized) forms for each alphanumeric token.

For each alphanumeric token, you must have in output: a LOWERCASE annotation and one or two NORMALIZE annotations.

See Also
Compound Words Splitter
Appendix - Configure Semantic Processors

All languages where there is a distinction between lowercase vs uppercase and accentuated vs unaccentuated.

Moreover, normalization exceptions are defined for:

  • German

  • Spanish

  • French

  • Italian

And normalization alternatives are defined for:

  • German

For example, in German, grüne (green) has an alternative normalized form gruene. You get the following annotations:

  • LOWERCASE grüne

  • NORMALIZE grune

  • NORMALIZE gruene