Use Case

For such a task, a simple RulesMatcher processor is adequate. This processor can match patterns expressed by rules against a token stream. Rules are described in a dedicated XML configuration file. For the syntax description, see Rules Matcher (Rule-Based) in the Exalead CloudView Configuration Guide .

Example: the following sample shows how to extract French vehicle registration plates.

<TRules xmlns="exa:com.exalead.mot.components.transducer">
  <!-- SIV (e.g. AA-229-AA) -->
  <TRule priority="0">
    <MatchAnnotation kind="NE.plates.SIV"/>
    <Seq>
      <Or>
        <TokenRegexp value="[A-Za-z]{2}"/>
        <Word value="W" level="exact"/>
  <!-- to match temporary plates starting with only one W (e.g. W-001-AA) -->
      </Or>
      <Opt> <Word value="-" level="exact"/> </Opt>
      <TokenRegexp value="[1-9]{3}"/>
      <Opt> <Word value="-" level="exact"/> </Opt>
      <TokenRegexp value="[A-Za-z]{2}"/>
    </Seq>
  </TRule>

  <!-- FNI (e.g. 1233 CD 33) -->
  <TRule priority="1">
    <MatchAnnotation kind="NE.plates.FNI"/>
    <Or>
      <Seq>
        <TokenRegexp value="[1-9]{1,4}"/>
        <TokenRegexp value="[A-Za-z]{1,3}"/>
        <Or>
          <TokenRegexp value="[1-9]{2}"/>
          <TokenRegexp value="2[AB]"/>  <!-- Corse -->
          <TokenRegexp value="97[1-8]"/>  <!-- DOM-TOM -->
        </Or>
      </Seq>
      <Seq>
        <TokenRegexp value="[1-9]{1,6}"/>
        <Or>
          <Word value="NC" level="exact"/>  <!-- New Caledonia -->
          <Word value="P" level="exact"/>  <!-- French Polynesia -->
        </Or>
      </Seq>
      <Seq>  <!-- TAAF - Kerguelen islands  -->
        <TokenRegexp value="[05-9][1-9]"/>
        <TokenRegexp value="[1-9]{4}"/>
      </Seq>
      <Seq>  <!-- Wallis-and-Futuna -->
        <TokenRegexp value="[1-9]{1,4}"/>
        <Word value="WF" level="exact"/>
      </Seq>
    </Or>
  </TRule>
</TRules>

To use it, add a RulesMatcher in the configuration file:

<RulesMatcher name="platesMatcher" resourceFile="resource:///tutorial-mot/plates.xml" />

In this example, we used the NETVIBES resource protocol. It implies that the resource is relative to the NGRESOURCEPATH directory (see Java Project Requirements). The file protocol is also supported and you can use it to specify an absolute path. The result looks like the output below.

Token[AA] kind[ALPHA] lng[fr] offset[0]
    Annotation[aa] tag[LOWERCASE] nbTokens[1]
    Annotation[aa] tag[NORMALIZE] nbTokens[1]
    Annotation[AA-123-BB] tag[NE.plates.SIV] nbTokens[5]
  Token[-] kind[DASH] lng[fr] offset[2]
  Token[123] kind[NUMBER] lng[fr] offset[3][...]