Extract Named Entities

You can extract more complex information such as Named Entities.

See Also
Create a Simple MOTPipe
Perform Language Detection and Word Lemmatization
Explore the Configuration File
Use Case
Note: Dependencies that are not declared in the configuration are implicitly included during initialization.

A dedicated processor is available for this task: the NamedEntitiesMatcher. For more information, see the CloudView Configuration Reference and Named Entities Matcher in the Exalead CloudView Configuration Guide .

To use it, configure it in the XML configuration file as follows:

<NamedEntitiesMatcher name="neMatcher" prefix="NE" />

Required dependencies are automatically added during initialization (RelatedTerms and recursively PartOfSpeechTagger). You do not need to include them explicitly in the configuration.

The NamedEntitiesMatcher processor adds named entities annotations on the tokens.

Note: Annotations are added only on the first tokens composing the entities.
The nbTokens property indicates the scope of the annotation. The result looks like the output below:

Token[Barack] kind[ALPHA] lng[xx] offset[0]
    Annotation[barack] tag[LOWERCASE] nbTokens[1]
    Annotation[barack] tag[NORMALIZE] nbTokens[1]
    Annotation[barack obama] tag[relatedTerm] nbTokens[3]
    Annotation[Barack Obama] tag[relatedTermDisplay] nbTokens[3]
    Annotation[Barack Obama] tag[exalead.people] nbTokens[3]
    Annotation[] tag[exalead.nlp.firstnames] nbTokens[1]
    Annotation[famouspeople] tag[NE] nbTokens[3]
    Annotation[1] tag[sub] nbTokens[3]
    Annotation[Barack Obama] tag[NE.famouspeople] nbTokens[3]
  Token[ ] kind[SEP_SPACE] lng[xx] offset[6]
  Token[Obama] kind[ALPHA] lng[xx] offset[7][...]

Here is a summary of the tag values:

  • relatedTerm = noun phrase in a lemmatized + normalized form

  • relatedTermDisplay = noun phrase full-text form

  • sub = internal indicator that you can ignore

  • exalead.nlp.firstnames = internal indicator (presence of a first name)

  • exalead.people = internal indicator (entity present in the NETVIBES persons dictionary built from wikipedia/freebase/dbpedia)

  • NE and NE.famouspeople = result of named entity detection under two different forms based on the specified prefix (that is, "NE" in our example)

The last annotation is the most useful. It contains the canonical form of the entity as well as a tag specifying its type. You can rely on the tag prefix (defined in the configuration file) to locate named entities annotations. When possible, the content of the annotation is a canonical form of the entity that is specified in a thesaurus (for example, "United States" and "U.S." are normalized to "USA"), or inferred with linguistic rules.