Dependencies
If the matching rules for this processor depend on phonetic, stem, or lemma matching, you
must add the corresponding processor above this one in the pipeline.
For example, if your rules require phonetic forms, place the Phonetizer processor above
this processor in the analysis pipeline.
Basics of Creating Rules
Rules matching is based on an XML file that lists patterns to identify. It is defined using
regexp-like expressions.
-
The XML file may contain several rules. Each rule has a priority and a specific
annotation.
-
Use priority to disambiguate multiple matches for the same content.
The best match is determined by the criteria below, applied in this order:
1. Leftmost match
2. Longest match
3. Highest priority value, where 0 = lowest priority.
4. The order in which the rules are defined in the rules file.
-
The annotation marks the matched tokens.
For example, a Named-Entity matcher could tag people's names with the annotation
(NE, people ) and then the Rules Matcher tags the first names with
(sub, 1 ) and the last names with (sub, 2 ) thus allowing
match normalization.
Note:
For an explanation of the available Boolean and operators, see Rules Syntax.
Sample Rules Matcher XML File
In this example, we want to identify email addresses, long dates, and short dates in French
and English. We use the Rules Matcher to create an annotation for a document when it
identifies an email address, based on the character sequence defined in the Rules XML file.
<TRules xmlns="exa:com.exalead.mot.components.transducer">
<!-- 1st rule tags emails with the annotation NE.email -->
<TRule priority="0">
<MatchAnnotation kind="NE.email"/>
<Seq>
<TokenRegexp value="[0-9A-Za-z_]+"/>
<Noblank/>
<Iter min="0" max="4">
<Or>
<Word value="-" level="exact"/>
<Word value="." level="exact"/>
</Or>
<Noblank/>
<TokenRegexp value="[0-9A-Za-z_]+"/>
<Noblank/>
</Iter>
<Word value="@" level="exact"/>
<Noblank/>
<TokenRegexp value="[0-9A-Za-z_]+"/>
<Noblank/>
<Iter min="0" max="6">
<Or>
<Word value="-" level="exact"/>
<Word value="." level="exact"/>
</Or>
<Noblank/>
<TokenRegexp value="[0-9A-Za-z_]+"/>
<Noblank/>
</Iter>
<Word value="." level="exact"/>
<Noblank/>
<Or>
<TokenRegexp value="[A-Za-z][A-Za-z]"/>
<Word value="gov"/>
<!-- government entities in the US -->
<Word value="com"/>
<!-- commercial entities -->
<Word value="net"/>
<!-- network providers -->
<Word value="org"/>
<!-- non-profit organizations -->
<Word value="edu"/>
<!-- educational institutions -->
<Word value="info"/>
<!-- informative websites -->
<Word value="biz"/>
<!-- business -->
<Word value="pro"/>
<!-- business use by qualified professionals -->
<Word value="name"/>
<!-- individuals' real names, nicknames, pseudonyms -->
<Word value="aero"/>
<!-- aviation-related businesses -->
<Word value="asia"/>
<!-- region of Asia, Australia, and the Pacific -->
<Word value="cat"/>
<!-- Catalan language and culture -->
<Word value="coop"/>
<!-- cooperatives -->
<Word value="int"/>
<!-- international treaty-based organizations -->
<Word value="jobs"/>
<!-- employment-related sites -->
<Word value="mil"/>
<!-- US Department of Defense -->
<Word value="tel"/>
<!-- publishing contact data -->
<Word value="museum"/>
<!-- museums -->
<Word value="travel"/>
<!-- travel industry -->
</Or>
</Seq>
</TRule>
<!-- 2nd rule tags dates with the annotations NE.date.full and NE.date.compact -->
<TRule priority="9">
<!-- All that's been matched by the captures, French way -->
<MatchAnnotation kind="NE.date.full" value="%1 %2 %3 %4"/>
<!-- Don't use day of week -->
<MatchAnnotation kind="NE.date.compact" value="%2/%3/%4"/>
<!-- matches dates like Samedi 31 janvier, 2004 or 16th of November 2003 -->
<Seq>
<!-- 1st capture: optional week day -->
<Opt>
<Sub no="1">
<!-- Day names ontology, taken from Named Entities matcher -->
<Annotation kind="exalead.nlp.date.days"/>
</Sub>
<Opt>
<Word value=","/>
</Opt>
<Opt>
<Word value="the"/>
</Opt>
</Opt>
<!-- 2nd capture: day number/ordinal -->
<Sub no="2">
<Or>
<!-- day ordinal : 16, 28, 15th, 1st,3rd ... -->
<Annotation kind="exalead.nlp.date.ordinals"/>
<!-- day number with no ordinals: 16, 28, 1 ... -->
<TokenRegexp value="0?[1-9]"/>
<TokenRegexp value="[12][[:digit:]]" />
<TokenRegexp value="3[01]" />
</Or>
</Sub>
<Opt>
<Or>
<Word value="of"/>
<Word value="-"/>
</Or>
</Opt>
<!-- 3rd capture: month name -->
<Sub no="3">
<Annotation kind="exalead.nlp.date.months"/>
</Sub>
<Opt>
<Word value="-"/> </Opt>
<!-- 4th capture: year -->
<Sub no="4">
<Or>
<!-- year like 06 -->
<TokenRegexp value="[[:digit:]]{2}"/>
<!-- full year [1000, 2999] -->
<TokenRegexp value="[12][[:digit:]]{3}" />
</Or>
</Sub>
</Seq>
</TRule>
</TRules>
Rules Syntax
Booleans
Booleans express constraints on a single token. These constraints can be combined in a
tree using the classic operators AND , OR , and
NOT . The individual leaf conditions must be met to continue matching.
These conditions can be about the format of the token, its possible annotations, its type
or language, etc.
Even if Booleans express constraints on a single token, the annotation may match more
than one token. A match on a token bearing a specific annotation results in a match of all
the tokens that are delimited by the annotation. This can be more than one token long.
This is valid for the conditions concerning the keywords annotation and
path (See table below). All others match at most one token.
Boolean Operators
|
Description
|
Boolean OR
|
An Or matches if at least one of its sub expressions matches.
The length of the annotation matched is the longest of the sub expression
matches.
|
Boolean AND
|
An And matches a token if all its sub expressions match.
The length of the annotation matched is the longest of all sub expression
matches.
|
Boolean NOT
|
A Not matches a token if its sub expression does not match.
The length of the annotation matched is 1.
|
Boolean NOR
|
A Nor matches a token for the combined Boolean operators
Not Or . The length of the annotation matched is 1.
|
Boolean Atoms
|
Description
|
TokenRegexp
|
A TokenRegexp matches if the exact anchored token string is
matched. This is the default behavior. It is, however, possible to define the
match as normalized or case-insensitive. The following regexp expressions are
not implemented:
- assertions like
\b, \B, ?=, ?!, ?<=, ?<!
- back references
\1, \2, .. .
- support for UNICODE like
\u0020 or
\p{name}
- nongreedy repeat operators like
??, *?, +?
- octal notation like
\0333
For example,
<TokenRegexp value="0?[1-9]|[12][[:digit:]]|3[01]"/>
|
Word
|
A Word matches if its value matches the normalized form of the
token string. This is the default behavior. It is, however, possible to define
the match as “exact” or “case-insensitive”. For example,
<Word value="-" level="exact"/>
|
Annotation
|
An Annotation matches if the token bears an annotation
matching the specified kind and possibly a nonrequired value. For
example,
<And>
<Annotation kind="some"/>
<Annotation kind="other"/>
</And>
|
Path
|
A Path matches a path value in an ontology. The implementation
relies on annotations emitted by an OntologyMatcher somewhere upstream in the
analysis pipeline.
|
AnyToken
|
AnyToken matches any token.
|
Noblank
|
An assertion matching a nonblank token. Its use is restricted to the root of a
Boolean expression.
|
Digit
|
A Digit matches a token whose kind is
TOKEN_NUMBER (set by the tokenizer for tokens made of a
sequence of one or more digits). This is semantically equivalent to using the
regular expression [0-9]+ but is more efficient since the work
has already been done by the tokenizer.
|
Alpha
|
Alpha matches a token made of uppercase or lowercase
letters.
|
Alnum
|
Alnum matches a token made of uppercase/lowercase letters or
digits.
|
Paragraph
|
paragraph matches a token whose kind is
TOKEN_SEP_PARAGRAPH (set by the tokenizer).
|
TokenLanguage
|
<TokenLanguage> matches a token with a specific language
id. This allows you to write rules, which are triggered for certain languages
only.
|
Punct
|
A Punct matches a token whose kind is TOKEN_SEP_PUNCT
(set by the tokenizer).
|
Dash
|
A Dash matches a token whose kind is
TOKEN_SEP_DASH (set by the tokenizer).
|
Sentence
|
A Sentence matches a token whose kind is
TOKEN_SEP_SENTENCE (set by the tokenizer).
|
TokenKind
|
A TokenKind matches a token whose kind matches the specified
value (set by the tokenizer). Allowed values are:
-
SEP_PARAGRAPH
-
SEP_SENTENCE
-
SEP_PUNCT
-
SEP_QUOTE
-
SEP_DASH
-
NUMBER
-
ALPHANUM . Note, this means alphabetical
and numerical, not alphabetical
or numerical (use <Alnum> instead)
-
ALPHA
|
Operators
Rules operators resemble those of standard regular expressions but with an XML syntax.
Only the <near> operator has been added to the usual set.
Operators
|
Description
|
Concatenation
|
<Seq> is a concatenation pattern.
|
Disjunction
|
<Or> is a disjunction pattern.
|
Proximity
|
<Near> matches subpatterns at a maximum distance of
n nonblank tokens. By default, the order is free but
may be imposed by setting the Boolean attribute ordered to
true . This pattern matches the longest possible
match. Use the slop attribute to set the number of
nonblank tokens allowed between A and B (default 0).
|
Option
|
<Opt> matches its subpattern zero or one time.
|
Bounded Repetition
|
<Iter> matches the sequence of its subpatterns between
min and max times. The maximum is 128 .
<Iter min="0" max="6">
<Or>
<Word value="-" level="exact"/>
<Word value="." level="exact"/>
</Or>
<Noblank/>
<TokenRregexp value="[0-9A-Za-z_]+"/>
<Noblank/>
</Iter>
|
Capture
|
<Sub> defines a function to tag subparts of an annotation
(kind, value) for later retrieval. For example, the day of the week is annotated
(sub, 1), the day of the month (sub, 2), the month (sub, 3), and the year (sub,
4). By concatenating subs in increasing order, we can get normalized dates.
|
Pattern referencing
|
Each operator can have a name (attribute name) used later for referencing the
operator <PatternRef >.
|
Pattern reuse
|
The operator <TImport > allows the reuse of existing
patterns from a file (attribute file name). To be reusable, a pattern must have
a name.
|
Including Rules
|
The operator <TInclude > works like a
#include in C/C++. It adds to the current TRules set all the
TRule objects found in the specified file (attribute file name).
|
Rules Best Practices
Do the following:
-
Use the <Digit> operator to match tokens made of a sequence of
digits instead of a regular expression. Indeed, it uses the token kind computed by the
tokenizer and is therefore more efficient.
-
Use a normalizer; it is likely that you need a normalizer somewhere upstream in the
analysis pipeline because <Word> and
<TokenRegexp> operators may match against the lowercase or
normalized forms of tokens.
-
Use the <Noblank> operator if you do NOT want to skip spaces.
The rules matcher skips spaces and tabs so that the rules are not littered with hundreds
of references to blanks. It is however possible to assert that there is no blank token
between two tokens at a precise position in the stream with the operator
<Noblank> .
Avoid the following:
-
Using the <Near> operator too often. It uses repeat operators,
and so matches the longest possible match. This can be costly in terms of compilation
time and RAM consumption, so you must try to keep A and B as simple as possible and
limit NEAR overlapping.
Caveats
The Rules Matcher has the following caveats:
-
It does not report overlapping or embedded matches; the earliest and longest match is
reported. If there are ambiguities, the tokens matched by the highest-priority rules are
kept. If there are two or more rules with similar priorities, the first rule in the
declaration order has highest priority. For the repeat operators (<Opt>,
<Iter> ) and <Near> , the longest possible match
is preferred.
-
The operator <Sub> allows for sub expression matches retrieval.
It has two attributes kind and value defining the
annotation emitted each time the sub expression matches. These "numerical subs" are
useful in match normalization. They are defined so that concatenating the text they have
matched in increasing order of their value yields normalized matches. For example,
people's names detection rules could mark first names with (sub, 1) and last names with
(sub, 2) thus giving equal results after concatenation for "John Smith" and "Smith
John". Of course, these submatch annotations are only emitted when the overall pattern
matches.
-
During the compilation, the RulesMatcher performs a number of optimizations, in
particular for Boolean ORs that are, whenever possible, replaced with a single regular
expression that indicates if any of the conditions in the list match. For example, the
following example is automatically rewritten to the second simplified expression, which
is more efficient. The same thing is done with words <word> are
transformed to regular expressions.
<Or>
<TokenRegexp value="0?[1-9]"/>
<TokenRegexp value="[12][[:digit:]]" />
<TokenRegexp value="3[01]" />
</Or>
<TokenRegexp value="0?[1-9]|[12][[:digit:]]|3[01]"/>
-
The <And > operator does not impose that lengths of submatches be
equal. A match is found if all its subpatterns match, irrespective of the length of the
matches. The following example matches even if one annotation does not have the same
length as another annotation, provided that they are both present on the same token. The
match length is the length of the longest annotation. For example:
<And>
<Annotation kind="some"/>
<Annotation kind="other"/>
</And>
Limitations
The Rules Matcher has the following limitations:
-
The window size is 200 tokens. This is the maximum length of a match.
-
The upper bound for the operator <iter> must remain as low as
possible as it is costly in terms of resources. The maximum is limited to 128.
-
UNICODE is not handled and matching is done on UTF-8 strings without specific
processing (at the byte level). Consequently:
Create a Rules Matcher Resource File
Create a Resource File from the Administration Console
The
most convenient method consists in creating an empty resources file in the Administration Console and defining its content with the Business Console. See Create a Resource File from the Administration Console.
To Compile a Resource File from the Command Line
- Create a rule XML file and save it in the resource directory. For more information,
see Basics of Creating Rules.
- In the Administration Console, select Index > Data processing >
Pipeline name > Semantic Processors.
- Drag the Rules Matcher to the required position in the Current
Processors list.
- Enter the Resource file path.
Map the Annotation to a Category Facet
We now need to configure the Rules Matcher processor to map the
NE.email annotation to a category facet that represents the email
address. This allows the document to be related to all email addresses found in
it.
-
In the Administration Console, select Index > Data processing > Pipeline name > Semantic
Processors.
-
On the Mappings subtab, click Add mapping
source.
-
Name: Enter the annotation name that you created in the rules
file, for example, NE.email for the sample file above.
-
Type: select Annotation.
-
(Optional) In Input from field of the mapping, restrict the
mapping so it only applies to a subset of comma-separated metas (also known as contexts)
associated with this annotation.
-
Click Add mapping target and add a category target.
-
Modify the category-mapping properties.
For example, the Create categories under this root property must
be modified to Top/Email in our example.
-
Go to Search > Search Logics > Your_Search_Logic >
Facets and add a category group.
-
Click Add facet and enter the name to display in the Mashup UI
Refinements panel.
-
For Root, enter the value you have entered for
Create categories under this root in step 4, for example,
Top/Email .
-
Click Apply.
|