Extract Named Entities

Note: Dependencies that are not declared in the configuration are implicitly included during initialization.

A dedicated processor is available for this task: the NamedEntitiesMatcher. For more information, see the CloudView Configuration Reference and Named Entities Matcher in the Exalead CloudView Configuration Guide .

To use it, configure it in the XML configuration file as follows:

<NamedEntitiesMatcher name="neMatcher" prefix="NE" />

Required dependencies are automatically added during initialization (RelatedTerms and recursively PartOfSpeechTagger). You do not need to include them explicitly in the configuration.

The NamedEntitiesMatcher processor adds named entities annotations on the tokens.

Note: Annotations are added only on the first tokens composing the entities.

The nbTokens property indicates the scope of the annotation. The result looks like the output below:

Token[Barack] kind[ALPHA] lng[xx] offset[0]
    Annotation[barack] tag[LOWERCASE] nbTokens[1]
    Annotation[barack] tag[NORMALIZE] nbTokens[1]
    Annotation[barack obama] tag[relatedTerm] nbTokens[3]
    Annotation[Barack Obama] tag[relatedTermDisplay] nbTokens[3]
    Annotation[Barack Obama] tag[exalead.people] nbTokens[3]
    Annotation[] tag[exalead.nlp.firstnames] nbTokens[1]
    Annotation[famouspeople] tag[NE] nbTokens[3]
    Annotation[1] tag[sub] nbTokens[3]
    Annotation[Barack Obama] tag[NE.famouspeople] nbTokens[3]
  Token[ ] kind[SEP_SPACE] lng[xx] offset[6]
  Token[Obama] kind[ALPHA] lng[xx] offset[7][...]

Here is a summary of the tag values:

relatedTerm = noun phrase in a lemmatized + normalized form
relatedTermDisplay = noun phrase full-text form
sub = internal indicator that you can ignore
exalead.nlp.firstnames = internal indicator (presence of a first name)
exalead.people = internal indicator (entity present in the NETVIBES persons dictionary built from wikipedia/freebase/dbpedia)
NE and NE.famouspeople = result of named entity detection under two different forms based on the specified prefix (that is, "NE" in our example)

The last annotation is the most useful. It contains the canonical form of the entity as well as a tag specifying its type. You can rely on the tag prefix (defined in the configuration file) to locate named entities annotations. When possible, the content of the annotation is a canonical form of the entity that is specified in a thesaurus (for example, "United States" and "U.S." are normalized to "USA"), or inferred with linguistic rules.