Language Detection with Java
The MOTPipe.process() method accepts language as a parameter. The
tokenizer uses this language later for all the tokens it creates.
Let us redo the previous example using Language.EN instead of
Language.XX. Note how the lng attribute of the tokens
has changed from lng[xx] to lng[en].
Processors may support all, one or a subset of languages. A lemmatizer, for example, is dedicated to tokens in the language of its resource only.
The following snippet illustrates the creation of such a processor and its declaration in a pipe.
List<Tokenizer> tokenizers = new ArrayList<Tokenizer>();
tokenizers.add(new StandardTokenizer());
LanguageDetector languageDetector = new LanguageDetector();
languageDetector.setName("languageDetector");
List<SemanticProcessor> processors = new ArrayList<SemanticProcessor>();
processors.add(languageDetector);
pipe = LinguisticFactory.buildPipe(tokenizers, new NormalizerConfig(), processors);
With this initialization, running the program detects the language of tokens. The processor
replaces the language used in the call to MOTPipe.process() with the one it
detects.