Configuring the Crawler

This section implies that the Crawler connector has already been added.

See Creating a Standard Connector.

See Also
Define the Groups of URLs to Crawl
Define Crawl Rules
Specify the File Name Extension and MIME Types to Crawl
Sample Configurations for Different Use Cases
Hardware Sizing
Advanced Configuration
Table 1. Crawler General Options

Option

Description

Smart refresh

When enabled, the crawler continuously goes through the list of already crawled URLs and adds them to the crawling queue so they are refreshed.

The URLs that have been refreshed or crawled recently - that is to say that have not crossed the Min age threshold - are not added to the crawling queue immediately.

The URLs added by the refresh loop have a lower priority than the new URLs discovered during the crawl, so the refresh process does not slow down the discovery of new pages. For more details, see How Priorities Work.

Min Age / Max age

Documents older than Min Age and younger than Max age are automatically refreshed depending on their update frequency. Documents that change often will be refreshed faster than documents that have never changed.

In other words, documents are automatically refreshed according to their update frequency, but never faster than Min Age, and never slower than Max age.

Reasonable values for a small intranet site are min = 1 hour, max = 1 week. Reasonable values for a big web crawl are much higher, at minimum 1 day and maximum 6 months or 1 year.

No. of crawler threads

Specifies the number of crawler threads to run simultaneously.

If the crawler overloads your network, specify fewer threads.

Web server throttle time (ms)

Specifies the lapse of time between queries (number of milliseconds).

By default, the connector sends one query at a time to a given HTTP server. This avoids overloading the HTTP server.

Note: The throttle time is ignored if you select the Aggressive mode. However, this may generate a heavy load on Web servers. Use it only when crawling your own servers.

Ignore robots.txt

Ignores the robots.txt that may be defined on the site you want to crawl.

Caution! By default, the crawler follows the rules defined in the robot.txt to crawl the HTTP server.

Detect near duplicates

Detects similar documents.

When a document is detected as identical to a previously crawled page (when the content is the same), it is not indexed and links are not followed.

When it is only similar to a previously crawled document, it is indexed but links are not followed.

This can avoid looping on infinitely growing URLs with duplicate content, and speeds up the crawl on sites where content can be accessed from different URLs.

Detect patterns

Detects sites based on specific patterns, such as CMS, parking site, SPAM, and adult content. It then applies the specific crawl rules defined for these sites. For example, it can detect search or login pages on forums that the connector must not crawl.

Convert processor

Searches for links in binary documents.

To save CPU and memory, disable this option when crawling sites where there are no new links in binary documents.

Site collapsing

Activates site collapsing.

By default, the site collapsing ID is based on the host part of the URL. All URLs belonging to the same host share the same ID and are shown as one collapsed result. You can configure the site ID to include more segments of the URL's path, to distinguish sites on the same host. This can be done by specifying the recursion with the Depth parameter. By default, the value is 0 corresponds to the host part of the URL.

For more details, see About Site Collapsing.

Default behavior

Allows you to specify a combination of index and follow rules for URLs that do not have any rules.

  • Accept – crawls URLs that do not match any rules. Beware, it can crawl the entire internet.
  • Index – indexes the contents of the documents at this URL.
  • Follow – follows the hyperlinks found in the documents at this URL. This finds content outside of this URL. Beware, it can crawl the entire internet.
  • Follow roots – follows root URLs that do not match any rules.

Default data model class

Allows you to specify the class in which the documents must be indexed for URLs that do not have any data model class rules.

License document type

Select the type of documents that this connector produces.

Blacklisted session ID

Allows you to specify the session IDs to be excluded (blacklisted) from the matching URLs.

Many sites use session IDs which add parameters to the URLs randomly. This is troublesome since they can create different links for the same pages.

For example, if you want to blacklist parameter 's' from URLs, enter s in the field. Consequently, if it finds a link to:

http://foo.com/toto?s=123456789&t=2

... it crawls:

http://foo.com/toto?t=2