Rules Matcher (Rule-Based)

Dependencies

If the matching rules for this processor depend on phonetic, stem, or lemma matching, you must add the corresponding processor above this one in the pipeline.

For example, if your rules require phonetic forms, place the Phonetizer processor above this processor in the analysis pipeline.

Basics of Creating Rules

Rules matching is based on an XML file that lists patterns to identify. It is defined using regexp-like expressions.

The XML file may contain several rules. Each rule has a priority and a specific annotation.
Use priority to disambiguate multiple matches for the same content.

The best match is determined by the criteria below, applied in this order:

1. Leftmost match

2. Longest match

3. Highest priority value, where 0 = lowest priority.

4. The order in which the rules are defined in the rules file.
The annotation marks the matched tokens.

For example, a Named-Entity matcher could tag people's names with the annotation (NE, people) and then the Rules Matcher tags the first names with (sub, 1) and the last names with (sub, 2) thus allowing match normalization.

Note: For an explanation of the available Boolean and operators, see Rules Syntax.

Sample Rules Matcher XML File

In this example, we want to identify email addresses, long dates, and short dates in French and English. We use the Rules Matcher to create an annotation for a document when it identifies an email address, based on the character sequence defined in the Rules XML file.

Important: This example also relies on a day names ontology provided by the Named Entities matcher: <Annotation kind="exalead.nlp.date.days"/>. For this example to work, you must place the Named Entities matcher before the Rules Matcher processor.

<TRules xmlns="exa:com.exalead.mot.components.transducer">
   <!-- 1st rule tags emails with the annotation NE.email -->
  <TRule priority="0">
    <MatchAnnotation kind="NE.email"/>
    <Seq>
    <TokenRegexp value="[0-9A-Za-z_]+"/>
    <Noblank/>
    <Iter min="0" max="4">
      <Or>
        <Word value="-" level="exact"/>
        <Word value="." level="exact"/>
      </Or>
      <Noblank/>
      <TokenRegexp value="[0-9A-Za-z_]+"/>
      <Noblank/>
    </Iter>
    <Word value="@" level="exact"/>
    <Noblank/>
    <TokenRegexp value="[0-9A-Za-z_]+"/>
    <Noblank/>
    <Iter min="0" max="6">
      <Or>
        <Word value="-" level="exact"/>
        <Word value="." level="exact"/>
      </Or>
      <Noblank/>
      <TokenRegexp value="[0-9A-Za-z_]+"/>
      <Noblank/>
    </Iter>
    <Word value="." level="exact"/>
    <Noblank/>
     <Or>
      <TokenRegexp value="[A-Za-z][A-Za-z]"/>
      <Word value="gov"/>
     <!-- government entities in the US -->
      <Word value="com"/>
     <!-- commercial entities -->
      <Word value="net"/>
     <!-- network providers -->
      <Word value="org"/>
     <!-- non-profit organizations -->
      <Word value="edu"/>
     <!-- educational institutions -->
      <Word value="info"/>
    <!-- informative websites -->
      <Word value="biz"/>
     <!-- business -->
      <Word value="pro"/>
     <!-- business use by qualified professionals -->
      <Word value="name"/>
    <!-- individuals' real names, nicknames, pseudonyms -->
      <Word value="aero"/>
    <!-- aviation-related businesses -->
      <Word value="asia"/>
    <!-- region of Asia, Australia, and the Pacific -->
      <Word value="cat"/>
     <!-- Catalan language and culture -->
      <Word value="coop"/>
    <!-- cooperatives -->
      <Word value="int"/>
     <!-- international treaty-based organizations -->
      <Word value="jobs"/>
    <!-- employment-related sites -->
      <Word value="mil"/>
     <!-- US Department of Defense -->
      <Word value="tel"/>
     <!-- publishing contact data -->
      <Word value="museum"/>
  <!-- museums -->
      <Word value="travel"/>
  <!-- travel industry -->
    </Or>
  </Seq>
  </TRule>
 <!-- 2nd rule tags dates with the annotations NE.date.full and NE.date.compact -->
  <TRule priority="9">
  <!-- All that's been matched by the captures, French way -->
  <MatchAnnotation kind="NE.date.full" value="%1 %2 %3 %4"/>
  <!-- Don't use day of week -->
  <MatchAnnotation kind="NE.date.compact" value="%2/%3/%4"/>
  <!-- matches dates like Samedi 31 janvier, 2004 or 16th of November 2003 -->
  <Seq>
    <!-- 1st capture: optional week day -->
    <Opt>
      <Sub no="1">
        <!-- Day names ontology, taken from Named Entities matcher -->
         <Annotation kind="exalead.nlp.date.days"/>
      </Sub>
      <Opt>
 <Word value=","/>
 </Opt>
      <Opt>
 <Word value="the"/>
 </Opt>
    </Opt>
    <!-- 2nd capture: day number/ordinal -->
        <Sub no="2">
      <Or>
        <!-- day ordinal : 16, 28, 15th, 1st,3rd ... -->
        <Annotation kind="exalead.nlp.date.ordinals"/>
        <!-- day number with no ordinals: 16, 28, 1 ... -->
        <TokenRegexp value="0?[1-9]"/>
        <TokenRegexp value="[12][[:digit:]]" />
        <TokenRegexp value="3[01]" />
     </Or>
    </Sub>
   <Opt>
      <Or>
        <Word value="of"/>
        <Word value="-"/>
      </Or>
    </Opt>
    <!-- 3rd capture: month name -->
    <Sub no="3">
      <Annotation kind="exalead.nlp.date.months"/>
    </Sub>
    <Opt>
 <Word value="-"/> </Opt>
    <!-- 4th capture: year -->
    <Sub no="4">
      <Or>
        <!-- year like 06 -->
        <TokenRegexp value="[[:digit:]]{2}"/>
        <!-- full year [1000, 2999] -->
        <TokenRegexp value="[12][[:digit:]]{3}" />
      </Or>
    </Sub>
  </Seq>
  </TRule>
</TRules>

Rules Syntax

Booleans

Booleans express constraints on a single token. These constraints can be combined in a tree using the classic operators AND, OR, and NOT. The individual leaf conditions must be met to continue matching. These conditions can be about the format of the token, its possible annotations, its type or language, etc.

Even if Booleans express constraints on a single token, the annotation may match more than one token. A match on a token bearing a specific annotation results in a match of all the tokens that are delimited by the annotation. This can be more than one token long. This is valid for the conditions concerning the keywords annotation and path (See table below). All others match at most one token.


Boolean Operators	Description
Boolean `OR`	An `Or` matches if at least one of its sub expressions matches. The length of the annotation matched is the longest of the sub expression matches.
Boolean `AND`	An `And` matches a token if all its sub expressions match. The length of the annotation matched is the longest of all sub expression matches.
Boolean `NOT`	A `Not` matches a token if its sub expression does not match. The length of the annotation matched is 1.
Boolean `NOR`	A `Nor` matches a token for the combined Boolean operators `Not Or`. The length of the annotation matched is 1.


Boolean Atoms	Description
`TokenRegexp`	A `TokenRegexp` matches if the exact anchored token string is matched. This is the default behavior. It is, however, possible to define the match as normalized or case-insensitive. The following regexp expressions are not implemented: assertions like `\b, \B, ?=, ?!, ?<=, ?<!` back references `\1, \2, ..`. support for UNICODE like `\u0020` or `\p{name}` nongreedy repeat operators like `??, *?, +?` octal notation like `\0333` For example, <TokenRegexp value="0?[1-9]\|[12][[:digit:]]\|3[01]"/>
`Word`	A `Word` matches if its value matches the normalized form of the token string. This is the default behavior. It is, however, possible to define the match as “exact” or “case-insensitive”. For example, <Word value="-" level="exact"/>
`Annotation`	An `Annotation` matches if the token bears an annotation matching the specified kind and possibly a nonrequired value. For example, <And> <Annotation kind="some"/> <Annotation kind="other"/> </And>
`Path`	A `Path` matches a path value in an ontology. The implementation relies on annotations emitted by an OntologyMatcher somewhere upstream in the analysis pipeline.
`AnyToken`	`AnyToken` matches any token.
`Noblank`	An assertion matching a nonblank token. Its use is restricted to the root of a Boolean expression.
`Digit`	A `Digit` matches a token whose kind is `TOKEN_NUMBER` (set by the tokenizer for tokens made of a sequence of one or more digits). This is semantically equivalent to using the regular expression `[0-9]+` but is more efficient since the work has already been done by the tokenizer.
`Alpha`	`Alpha` matches a token made of uppercase or lowercase letters.
`Alnum`	`Alnum` matches a token made of uppercase/lowercase letters or digits.
`Paragraph`	`paragraph` matches a token whose kind is `TOKEN_SEP_PARAGRAPH` (set by the tokenizer).
`TokenLanguage`	`<TokenLanguage>` matches a token with a specific language id. This allows you to write rules, which are triggered for certain languages only.
`Punct`	A `Punct` matches a token whose kind is `TOKEN_SEP_PUNCT`(set by the tokenizer).
`Dash`	A `Dash` matches a token whose kind is `TOKEN_SEP_DASH` (set by the tokenizer).
`Sentence`	A `Sentence` matches a token whose kind is `TOKEN_SEP_SENTENCE` (set by the tokenizer).
`TokenKind`	A `TokenKind` matches a token whose kind matches the specified value (set by the tokenizer). Allowed values are: `SEP_PARAGRAPH` `SEP_SENTENCE` `SEP_PUNCT` `SEP_QUOTE` `SEP_DASH` `NUMBER` `ALPHANUM`. Note, this means alphabetical and numerical, not alphabetical or numerical (use <Alnum> instead) `ALPHA`

Operators

Rules operators resemble those of standard regular expressions but with an XML syntax. Only the <near> operator has been added to the usual set.


Operators	Description
Concatenation	`<Seq>` is a concatenation pattern.
Disjunction	`<Or>` is a disjunction pattern.
Proximity	`<Near>` matches subpatterns at a maximum distance of `n` nonblank tokens. By default, the order is free but may be imposed by setting the Boolean attribute `ordered` to `true`. This pattern matches the longest possible match. Use the `slop` attribute to set the number of nonblank tokens allowed between A and B (default 0).
Option	`<Opt>` matches its subpattern zero or one time.
Bounded Repetition	`<Iter>` matches the sequence of its subpatterns between min and max times. The maximum is `128`. <Iter min="0" max="6"> <Or> <Word value="-" level="exact"/> <Word value="." level="exact"/> </Or> <Noblank/> <TokenRregexp value="[0-9A-Za-z_]+"/> <Noblank/> </Iter>
Capture	`<Sub>` defines a function to tag subparts of an annotation (kind, value) for later retrieval. For example, the day of the week is annotated (sub, 1), the day of the month (sub, 2), the month (sub, 3), and the year (sub, 4). By concatenating subs in increasing order, we can get normalized dates.
Pattern referencing	Each operator can have a name (attribute name) used later for referencing the operator <`PatternRef`>.
Pattern reuse	The operator <`TImport`> allows the reuse of existing patterns from a file (attribute file name). To be reusable, a pattern must have a name.
Including Rules	The operator <`TInclude`> works like a `#include` in C/C++. It adds to the current TRules set all the TRule objects found in the specified file (attribute file name).

Rules Best Practices

Do the following:

Use the <Digit> operator to match tokens made of a sequence of digits instead of a regular expression. Indeed, it uses the token kind computed by the tokenizer and is therefore more efficient.
Use a normalizer; it is likely that you need a normalizer somewhere upstream in the analysis pipeline because <Word> and <TokenRegexp> operators may match against the lowercase or normalized forms of tokens.
Use the <Noblank> operator if you do NOT want to skip spaces. The rules matcher skips spaces and tabs so that the rules are not littered with hundreds of references to blanks. It is however possible to assert that there is no blank token between two tokens at a precise position in the stream with the operator <Noblank>.

Avoid the following:

Using the <Near> operator too often. It uses repeat operators, and so matches the longest possible match. This can be costly in terms of compilation time and RAM consumption, so you must try to keep A and B as simple as possible and limit NEAR overlapping.

Caveats

The Rules Matcher has the following caveats:

It does not report overlapping or embedded matches; the earliest and longest match is reported. If there are ambiguities, the tokens matched by the highest-priority rules are kept. If there are two or more rules with similar priorities, the first rule in the declaration order has highest priority. For the repeat operators (<Opt>, <Iter>) and <Near>, the longest possible match is preferred.
The operator <Sub> allows for sub expression matches retrieval. It has two attributes kind and value defining the annotation emitted each time the sub expression matches. These "numerical subs" are useful in match normalization. They are defined so that concatenating the text they have matched in increasing order of their value yields normalized matches. For example, people's names detection rules could mark first names with (sub, 1) and last names with (sub, 2) thus giving equal results after concatenation for "John Smith" and "Smith John". Of course, these submatch annotations are only emitted when the overall pattern matches.
During the compilation, the RulesMatcher performs a number of optimizations, in particular for Boolean ORs that are, whenever possible, replaced with a single regular expression that indicates if any of the conditions in the list match. For example, the following example is automatically rewritten to the second simplified expression, which is more efficient. The same thing is done with words <word> are transformed to regular expressions.
```
<Or>
  <TokenRegexp value="0?[1-9]"/>
  <TokenRegexp value="[12][[:digit:]]" />
  <TokenRegexp value="3[01]" />
</Or>
<TokenRegexp value="0?[1-9]|[12][[:digit:]]|3[01]"/>
```
The <And> operator does not impose that lengths of submatches be equal. A match is found if all its subpatterns match, irrespective of the length of the matches. The following example matches even if one annotation does not have the same length as another annotation, provided that they are both present on the same token. The match length is the length of the longest annotation. For example:
```
<And>
  <Annotation kind="some"/>
  <Annotation kind="other"/>
</And>
```

Limitations

The Rules Matcher has the following limitations:

The window size is 200 tokens. This is the maximum length of a match.
The upper bound for the operator <iter> must remain as low as possible as it is costly in terms of resources. The maximum is limited to 128.
UNICODE is not handled and matching is done on UTF-8 strings without specific processing (at the byte level). Consequently:
- Accentuated characters do not match if they are used in case-insensitive or normalized mode.
- The dot wildcard matches a byte and therefore not all UTF-8 characters.

Create a Rules Matcher Resource File

Create a Resource File from the Administration Console

The most convenient method consists in creating an empty resources file in the Administration Console and defining its content with the Business Console. See Create a Resource File from the Administration Console.

To Compile a Resource File from the Command Line

Create a rule XML file and save it in the resource directory. For more information, see Basics of Creating Rules.
In the Administration Console, select Index > Data processing > Pipeline name > Semantic Processors.
Drag the Rules Matcher to the required position in the Current Processors list.
Enter the Resource file path.

Map the Annotation to a Category Facet

We now need to configure the Rules Matcher processor to map the NE.email annotation to a category facet that represents the email address. This allows the document to be related to all email addresses found in it.

In the Administration Console, select Index > Data processing > Pipeline name > Semantic Processors.
On the Mappings subtab, click Add mapping source.
1. Name: Enter the annotation name that you created in the rules file, for example, NE.email for the sample file above.
2. Type: select Annotation.
(Optional) In Input from field of the mapping, restrict the mapping so it only applies to a subset of comma-separated metas (also known as contexts) associated with this annotation.
Click Add mapping target and add a category target.
Modify the category-mapping properties.

For example, the Create categories under this root property must be modified to Top/Email in our example.
Go to Search > Search Logics > Your_Search_Logic > Facets and add a category group.
1. Click Add facet and enter the name to display in the Mashup UI Refinements panel.
2. For Root, enter the value you have entered for Create categories under this root in step 4, for example, Top/Email.
Click Apply.