Language Detection with Java
The MOTPipe.process()
method accepts language as a parameter. The
tokenizer uses this language later for all the tokens it creates.
Let us redo the previous example using Language.EN
instead of
Language.XX
. Note how the lng
attribute of the tokens
has changed from lng[xx]
to lng[en]
.
Processors may support all, one or a subset of languages. A lemmatizer, for example, is dedicated to tokens in the language of its resource only.
The following snippet illustrates the creation of such a processor and its declaration in a pipe.
List<Tokenizer> tokenizers = new ArrayList<Tokenizer>(); tokenizers.add(new StandardTokenizer()); LanguageDetector languageDetector = new LanguageDetector(); languageDetector.setName("languageDetector"); List<SemanticProcessor> processors = new ArrayList<SemanticProcessor>(); processors.add(languageDetector); pipe = LinguisticFactory.buildPipe(tokenizers, new NormalizerConfig(), processors);
With this initialization, running the program detects the language of tokens. The processor
replaces the language used in the call to MOTPipe.process()
with the one it
detects.