Test the Semantic Processing of your Analysis Pipeline

To do so, you can use the semantic annotate function of the cvdebug command-line tool, located in the <DATADIR>/bin directory.

cvconsole cvdebug > semantic annotate [args]

Where possible arguments [args] are:

[buildGroup] – Build group name (default: bg0)
[context] – Context of the chunk (type: STRING)
[language] – ISO code of the language (type: ISO_CODE)
[pipeline] – Analysis pipeline to use (default: ap0)
[value] – (Required) Text to process, standard input if missing (type: STRING)

Example:

Consider that our analysis configuration contains only one pipeline. This pipeline contains a single semantic processor, the Named Entities Matcher. This processor provides Named Entities annotations.

We start the semantic annotate function to test the Named Entities Matcher with the following textual input.

cvconsole cvdebug > semantic annotate value="Bill Keller and Barack Obama" language=en

Applying this command gives the following XML output for the first three tokens.

<AnnotatedToken token="Bill" kind="ALPHA" lang="en" offset="0">
  <Annotation displayForm="bill" displayKind="lowercase" tag="LOWERCASE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="bill" displayKind="normalized" tag="NORMALIZE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="" displayKind="exact" tag="exalead.nlp.firstnames" nbTokens="1" 
trustLevel="0" />
  <Annotation displayForm="BILL" displayKind="exact" tag="exalead.loose.nlp.firstnames" nbTokens="1"
  trustLevel="0" />
  <Annotation displayForm="person" displayKind="exact" tag="NE" nbTokens="3" trustLevel="100" />
  <Annotation displayForm="2" displayKind="exact" tag="sub" nbTokens="1" trustLevel="100" />
  <Annotation displayForm="Bill Keller" displayKind="exact" tag="NE.person" nbTokens="3" trustLevel="100" />
</AnnotatedToken><AnnotatedToken token=" " kind="SEP_SPACE" lang="en" offset="4">
</AnnotatedToken><AnnotatedToken token="Keller" kind="ALPHA" lang="en" offset="5">
  <Annotation displayForm="keller" displayKind="lowercase" tag="LOWERCASE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="keller" displayKind="normalized" tag="NORMALIZE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="" displayKind="exact" tag="exalead.nlp.firstnames" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="KELLER" displayKind="exact" tag="exalead.loose.nlp.firstnames" nbTokens="1" 
trustLevel="0" />
  <Annotation displayForm="3" displayKind="exact" tag="sub" nbTokens="1" trustLevel="100" />
</AnnotatedToken><AnnotatedToken token="and" kind="ALPHA" lang="en" offset="12">
  <Annotation displayForm="and" displayKind="lowercase" tag="LOWERCASE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="and" displayKind="normalized" tag="NORMALIZE" nbTokens="1" trustLevel="0" />
</AnnotatedToken><AnnotatedToken token=" " kind="SEP_SPACE" lang="en" offset="15">
</AnnotatedToken><AnnotatedToken token="B" kind="ALPHA" lang="en" offset="16">
  <Annotation displayForm="b" displayKind="lowercase" tag="LOWERCASE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="b" displayKind="normalized" tag="NORMALIZE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="Barack Obama" displayKind="exact" tag="exalead.people" nbTokens="4" 
trustLevel="100" />
  <Annotation displayForm="famousperson" displayKind="exact" tag="NE" nbTokens="4" trustLevel="100" />
  <Annotation displayForm="1" displayKind="exact" tag="sub" nbTokens="4" trustLevel="100" />
  <Annotation displayForm="Barack Obama" displayKind="exact" tag="NE.famousperson" nbTokens="4"
  trustLevel="100" />
</AnnotatedToken><AnnotatedToken token="." kind="PUNCT" lang="en" offset="17">
  <Annotation displayForm="PUNCT" displayKind="exact" tag="tagger" nbTokens="1" trustLevel="100" />
</AnnotatedToken><AnnotatedToken token=" " kind="SEP_SPACE" lang="en" offset="18">
</AnnotatedToken><AnnotatedToken token="Obama" kind="ALPHA" lang="en" offset="19">
  <Annotation displayForm="obama" displayKind="lowercase" tag="LOWERCASE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="obama" displayKind="normalized" tag="NORMALIZE" nbTokens="1" trustLevel="0" />
  <Annotation displayForm="Obama" displayKind="exact" tag="exalead.loose.world" nbTokens="1" 
trustLevel="0" />
  <Annotation displayForm="Obama" displayKind="exact" tag="exalead.loose.world.subnational_entities"
 nbTokens="1" trustLevel="0" />
  <Annotation displayForm="Obama" displayKind="exact" tag="exalead.loose.world.subnational_entities.cities"
 nbTokens="1" trustLevel="0" />
</AnnotatedToken>

Note: For details about the XML tags, see Appendix - Semantic Resources Reference. Keep in mind that this XML output is a serialization of the underlying JAVA objects manipulated by the semantic pipeline.

This is how the XML processes the textual input:

The pipeline processes each token ("Bill", " ", "Keller", " ", "B", ".", " ", "Obama") separately. We then obtain as many AnnotatedToken nodes as the number of tokens contained in the textual input.
Each token goes through the pipeline and each processor generates one or many Annotation java objects that are appended to the AnnotatedToken object.

Focus on the two main attributes of an Annotation, tag and displayForm, which you can consider as (a key, value) describing the content of the Annotation.
- tag – the name of the annotation. For example, "Bill" is labeled as a first name using the tag "exalead.nlp.firstnames"
- displayForm – the value of the annotation (may be empty). This attribute is very useful for normalization purposes. For example, in a sentence containing "Barack Obama", "B. Obama", "Obama Barack", all these 3 N-grams may be annotated as "exalead.people" and with the same displayForm "Barack Obama".
The Named Entities Matcher processor matches the "Bill" token since "Bill Keller" (the display form) is tagged by NE.person. We notice here that the Annotation object has an nbToken attribute set to 3. This reveals the way processors work:
- When treating a token (here "Bill"), the processor checks in its resources if the token matches with the beginning of a displayForm. If it is the case, the generated Annotation includes information about the number of tokens involved, for example, 3 for "Bill Keller".
  Note: That processors work forward and never backward. They consider the tokens following the current one but not the previous ones.

Once the pipeline has produced these annotations, they may be mapped to produce as many index fields or categories as required.