Advanced Configuration

You can configure the Crawler connector using the Crawl Manager of the Exalead CloudView Management APIs. Therefore you use the <DATADIR>/config/CrawlConfig.xml file to perform advanced configuration of the Crawler connector.

You can edit this XML file from the Administration Console by clicking Edit as XML (Experts only) links.

This page discusses:

Crawl Rules

Crawl rules perform Boolean matching on a URL string therefore the rules operate on any part of a URL (scheme, host, path, query, fragment), and match strings or regular expressions.

You can configure the behavior of the match to define whether it is case-sensitive or left and right anchored. For example, a rule of kind ‘Length’ matches the length of a part of the URL against an integer range. Shortcuts are available as shown in the table below.

Note: For all parts of the url, the val attribute is a regular expression. However, you must escape the special characters of a regexp ^$.*+?[](){}| with a \.

Rule category

Rule key

Example

Description and parameters

Boolean

And, Not, Or

<Or>...</Or>

Performs Boolean matching for the nested rules defined. Commonly used for Boolean matching on URL strings.

generic

Atom

Syntax:

<Atom field="" kind="" norm="" value="" />

Example:

<Atom field="path" kind="prefix" norm="none" value="/watch" />

This generic rule needs to define the type of match:

The field defines the type of element to match. Possible values are url|scheme|host|path|query

kind defines the part of the field to use for the match. It can have the values:

  • exact|prefix|suffix

    • inside where you specify a regexp and its anchoring in val.

    • length where you specify the length of a field ([:10], [11:12], [30:]) in val.

      norm impacts the normalization level. The default is a case insensitive match that corresponds to non. Possible values: norm|lower|none.

      value is a regexp that must be matched with the links during crawl, based on the field and kind parameters.

shortcut

Domain

Exact match on domain. For example,

<Domain val="foo.com" />

Matches http://foo.com/ and http://bar.foo.com but not http://barfoo.com

This is a shortcut for this combination of rule:

<Or><Atom field="host" kind="suffix" value=".foo.com" /><Atom field="host" kind="exact" value="foo.com" /></Or>

Path

<Path val="/cgi-bin" />

This rule is a shortcut for atom path-prefix. It is a left anchored match on the path.

Ext

<Ext val=".gif" />

This rule is a shortcut for atom path-suffix. It is a right anchored match on the path.

Host

<Host val="www.wikipedia.org" />

Performs an exact match on host.

This rule is a shortcut for atom host-exact.

Url

<Url val="http://en.wikipedia.org/wiki>

This rule is a shortcut for atom url-exact.

Scheme

<Scheme val="http" />

This rule is a shortcut for atom scheme-exact. Possible values: http|https

Query

<Query val="q=foo" />

This rule is a shortcut for atom query-exact. It performs an exact match on the query.

InQuery

<InQuery val="q=foo" />

This rule is a shortcut for atom query-inside. It performs a match on the query not anchored.

Length

<Length field="path" val="[30:]" /> matches URLs with a path length >= 30

This rule is a shortcut for atom field-length. It specifies the length of the URL path.

Note: All objects belong to the namespace: exa:com.exalead.actionrules.v21

Crawl Rule Actions

The table below lists the various crawl rule actions that you can use with their corresponding XML tags.

For example, a crawl rule with an index and follow action is written as follows:

<Rules group="example" key="auto">
  <Rule>
    <ar:Atom litteral="true" value="http://www.example.com/" norm="none" kind="prefix" field="url"/>
    <Index/>
    <Follow/>
    <Accept/>
  </Rule>
</Rules>

Action

adds XML Tags...

Index and follow

<Index/>

<Follow/>

<Accept/>

Index and don’t follow

<Index/>

<NoFollow/>

<Accept/>

Follow but don’t index

<NoIndex/>

<Follow/>

<Accept/>

Index

<Index/>

<Accept/>

Follow

<Follow/>

<Accept/>

Don’t index

<NoIndex/>

Don’t follow

<NoFollow/>

Ignore

<NoIndex/>

<NoFollow/>

<Ignore/>

Source

<Source name=""/>

Add meta

<AddMeta value="" name=""/>

Priority

<Priority shift=""/>

Possible values:

-2 = Highest

-1 = Higher

0 = Normal

1 = Lower

2 = Lowest

How Priorities Work

When Smart refresh is enabled, the crawler scheduler may contain up to 6 URL sources; 5 fifos and 1 refresh source. If Smart refresh is disabled, you can use the Exalead CloudView scheduler to refresh sources at a specific time; URLs are sent to the fifo:index.

URL source

Number

Content

Default weight (priority)

fifo: user

0

Only user-submitted root URLs with priority 0, and roots with default priority.

10000

fifo: redir

1

Targets of redirections.

2000

fifo: index

2

Documents that are indexed but whose links are not be followed.

1000

fifo: index_follow

3

Documents that are indexed and whose links are followed.

100

fifo: follow

4

Documents whose links are followed, but which are not indexed.

10

smart refresh source

5

Documents to refresh.

1

The crawler scheduler picks URLs in each URL source according to their weight. The higher the weight of a fifo, the more links are picked.

If you define a crawl rule with a priority action, the priority shift is ascending or descending depending on the value:

  • -2 = Highest,
  • -1 = Higher,
  • 0 = Normal,
  • 1 = Lower,
  • 2 = Lowest

Error Handling

Many errors can occur when crawling a URL.

These errors are split into the following categories:

  • Permanent errors

    • HTTP 404 Not Found
    • HTTP 304 Not Modified when GET was not conditional (inconsistent server behavior)
    • Redirection to malformed URL
    • HTTP 5XX server errors
    • HTTP 4XX content errors
    • DNS permanent errors (host not found, etc.)
    • Other connection errors
  • Temporary errors
    • Connection time out, connection reset by peer, connection refused
    • HTTP 503 error with Retry-After header
    • DNS temporary errors (no answer)

The error status is remembered for each URL. When a URL triggers a permanent error, if a document was indexed for that URL, a deletion order is immediately sent to the Push API.

Documents in error are refreshed like other documents:

  • When a URL is refreshed and triggers too many temporary errors, if a document was indexed for that URL, a deletion order is sent to the Push API, and its status is removed too. It will not be crawled again unless the crawler comes across a new link to it.

  • When a URL is refreshed and triggers permanent errors, the URL status is removed. It will not be crawled again unless the crawler comes across a new link to it.