Crawl Rules
Crawl rules perform Boolean matching on a URL string therefore the rules operate on any part of a URL (scheme, host, path, query, fragment), and match strings or regular expressions.
You can configure the behavior of the match to define whether it is case-sensitive or left
and right anchored. For example, a rule of kind ‘Length
’ matches the length
of a part of the URL against an integer range. Shortcuts are available as shown in the table
below.
val
attribute is a regular expression.
However, you must escape the special characters of a regexp ^$.*+?[](){}|
with a \
.
Rule category |
Rule key |
Example |
Description and parameters |
---|---|---|---|
Boolean |
|
<Or>...</Or>
|
Performs Boolean matching for the nested rules defined. Commonly used for Boolean matching on URL strings. |
generic |
|
Syntax:
Example:
|
This generic rule needs to define the type of match: The field defines the type of element to match. Possible values are
|
shortcut |
|
Exact match on domain. For example, <Domain val="foo.com"
/>
Matches |
This is a shortcut for this combination of rule: <Or><Atom
field="host" kind="suffix" value=".foo.com" /><Atom field="host"
kind="exact" value="foo.com" /></Or>
|
|
<Path val="/cgi-bin" />
|
This rule is a shortcut for atom path-prefix. It is a left anchored match on the path. |
|
|
<Ext val=".gif" />
|
This rule is a shortcut for atom path-suffix. It is a right anchored match on the path. |
|
|
<Host val="www.wikipedia.org" />
|
Performs an exact match on host. This rule is a shortcut for atom host-exact. |
|
|
<Url val="http://en.wikipedia.org/wiki>
|
This rule is a shortcut for atom url-exact. |
|
|
<Scheme val="http" />
|
This rule is a shortcut for atom scheme-exact. Possible values:
|
|
|
<Query val="q=foo" />
|
This rule is a shortcut for atom query-exact. It performs an exact match on the query. |
|
|
<InQuery val="q=foo" />
|
This rule is a shortcut for atom query-inside. It performs a match on the query not anchored. |
|
|
<Length field="path" val="[30:]" /> matches URLs with a path length
>= 30
|
This rule is a shortcut for atom field-length. It specifies the length of the URL path. |
exa:com.exalead.actionrules.v21