Perform Language Detection and Word Lemmatization

This page discusses:

Language Detection with Java
Language Detection with XML Configuration File
Retrieve Document Annotations
Lemmatization with Java
Lemmatization with an XML Configuration File

Language Detection with Java

The MOTPipe.process() method accepts language as a parameter. The tokenizer uses this language later for all the tokens it creates.

Let us redo the previous example using Language.EN instead of Language.XX. Note how the lng attribute of the tokens has changed from lng[xx] to lng[en].

Processors may support all, one or a subset of languages. A lemmatizer, for example, is dedicated to tokens in the language of its resource only.

Recommendation: Perform a detection when the language of the content is unknown, or when it may contain several languages. To do so, use the Language Detector processor.

The following snippet illustrates the creation of such a processor and its declaration in a pipe.

List<Tokenizer> tokenizers = new ArrayList<Tokenizer>();
tokenizers.add(new StandardTokenizer());
LanguageDetector languageDetector = new LanguageDetector();
languageDetector.setName("languageDetector");
List<SemanticProcessor> processors = new ArrayList<SemanticProcessor>();
processors.add(languageDetector);
pipe = LinguisticFactory.buildPipe(tokenizers, new NormalizerConfig(), processors);

With this initialization, running the program detects the language of tokens. The processor replaces the language used in the call to MOTPipe.process() with the one it detects.

Note: A sufficient number of words is required to correctly perform the detection. In a long text containing different languages, the processor gives each token its most probable language.

Language Detection with XML Configuration File

The following configuration file illustrates the creation of the processor and its declaration in a pipe.

<ling:MOTConfig xmlns="exa:com.exalead.linguistic.v10">
  <!-- Tokenizers -->
  <ling:StandardTokenizer/>

  <!-- Normalizer -->
  <ling:NormalizerConfig />

  <!-- Semantic Processors -->
  <ling:LanguageDetector name="languageDetector" />
</ling:MOTConfig>

Retrieve Document Annotations

Both tokens and documents can receive annotations. You can retrieve document annotations with a call to MOTPipe.getDocumentAnnotations() after the call to MOTPipe.endDocument() (processor usually uses this trigger to add their annotations to the document).

The Language Detector processor adds an annotation for each detected language in the document with the following snippet:

pipe.newDocument();
pipe.newField("field");
AnnotatedToken[] tokens = pipe.process(content, Language.XX);
print(tokens);
pipe.endDocument();
Annotation[] annotations = pipe.getDocumentAnnotations();
System.out.println("  Document");print(annotations);

You obtain the following result:

Enter a sentence to parse: have a good day
  Token[have] kind[ALPHA] lng[en] offset[0]
    Annotation[have] tag[LOWERCASE] nbTokens[1]
    Annotation[have] tag[NORMALIZE] nbTokens[1]
  Token[ ] kind[SEP_SPACE] lng[en] offset[4]
  Token[a] kind[ALPHA] lng[en] offset[5]
    Annotation[a] tag[LOWERCASE] nbTokens[1]
    Annotation[a] tag[NORMALIZE] nbTokens[1]
  Token[ ] kind[SEP_SPACE] lng[en] offset[6]
  Token[good] kind[ALPHA] lng[en] offset[7]
    Annotation[good] tag[LOWERCASE] nbTokens[1]
    Annotation[good] tag[NORMALIZE] nbTokens[1]
  Token[ ] kind[SEP_SPACE] lng[en] offset[11]
  Token[day] kind[ALPHA] lng[en] offset[12]
    Annotation[day] tag[LOWERCASE] nbTokens[1]
    Annotation[day] tag[NORMALIZE] nbTokens[1]
  Document
    Annotation[en] tag[language] nbTokens[0]

Lemmatization with Java

Lemmatization is a common linguistic task that consists in identifying the lemma of each word using a language dictionary.

To enable lemmatization in both English and French, we need to add two dedicated processors. To do so, complete the pipe creation as shown in Language Detection with Java by the following snippet:

Lemmatizer englishLemmatizer = new Lemmatizer();
englishLemmatizer.setName("englishLemmatizer");
englishLemmatizer.setLanguage("en");
processors.add(englishLemmatizer);
Lemmatizer frenchLemmatizer = new Lemmatizer();
frenchLemmatizer.setName("frenchLemmatizer");
frenchLemmatizer.setLanguage("fr");
processors.add(frenchLemmatizer);

The lemmatizers then enrich the tokens generated. The result looks like the output below:

Token[books] kind[ALPHA] lng[en] offset[65]
  Annotation[books] tag[LOWERCASE] nbTokens[1]
  Annotation[books] tag[NORMALIZE] nbTokens[1]
  Annotation['book', 'book', 'n', 'p', 'book', 'book', 'noun'] tag[lemmainformation] nbTokens[1]

Here a new annotation is added for the token books. It corresponds to an array containing the following information: [lowercase masculine singular, normalized masculine singular, genre, number, lowercase singular, normalized singular, category]. You can access it in a more convenient way using the Lemmatizer.deserialize() method, which returns a LemmaInformation instance that provides accessors.

You can add several annotations if there is any ambiguity. For example, use a Part of Speech Tagger processor next to disambiguate.

Lemmatization with an XML Configuration File

<ling:MOTConfig  xmlns="exa:com.exalead.linguistic.v10">
  <!-- Tokenizers -->
  <ling:StandardTokenizer />

  <!-- Normalizer -->
  <ling:NormalizerConfig />

  <!-- Semantic Processors -->
  <ling:LanguageDetector name="languageDetector" />
  <ling:Lemmatizer name="englishLemmatizer" language="en" />
  <ling:Lemmatizer name="frenchlemmatizer" language="fr" />
</ling:MOTConfig>