Normalizer

We usually put the normalizer right after the tokenizer in analysis pipelines. The normalizer computes lowercase and unaccentuated (normalized) forms for each alphanumeric token.

For each alphanumeric token, you must have in output: a LOWERCASE annotation and one or two NORMALIZE annotations.

See Also

Compound Words Splitter

Appendix - Configure Semantic Processors

All languages where there is a distinction between lowercase vs uppercase and accentuated vs unaccentuated.

Moreover, normalization exceptions are defined for:

German
Spanish
French
Italian

And normalization alternatives are defined for:

German

For example, in German, grüne (green) has an alternative normalized form gruene. You get the following annotations:

LOWERCASE grüne
NORMALIZE grune
NORMALIZE gruene