Configuring the Crawler

Table 1. Crawler General Options
Option	Description
Smart refresh	When enabled, the crawler continuously goes through the list of already crawled URLs and adds them to the crawling queue so they are refreshed. The URLs that have been refreshed or crawled recently - that is to say that have not crossed the Min age threshold - are not added to the crawling queue immediately. The URLs added by the refresh loop have a lower priority than the new URLs discovered during the crawl, so the refresh process does not slow down the discovery of new pages. For more details, see How Priorities Work.
Min Age / Max age	Documents older than Min Age and younger than Max age are automatically refreshed depending on their update frequency. Documents that change often will be refreshed faster than documents that have never changed. In other words, documents are automatically refreshed according to their update frequency, but never faster than Min Age, and never slower than Max age. Reasonable values for a small intranet site are min = 1 hour, max = 1 week. Reasonable values for a big web crawl are much higher, at minimum 1 day and maximum 6 months or 1 year.
No. of crawler threads	Specifies the number of crawler threads to run simultaneously. If the crawler overloads your network, specify fewer threads.
Web server throttle time (ms)	Specifies the lapse of time between queries (number of milliseconds). By default, the connector sends one query at a time to a given HTTP server. This avoids overloading the HTTP server. Note: The throttle time is ignored if you select the Aggressive mode. However, this may generate a heavy load on Web servers. Use it only when crawling your own servers.
Ignore robots.txt	Ignores the robots.txt that may be defined on the site you want to crawl. Caution! By default, the crawler follows the rules defined in the robot.txt to crawl the HTTP server.
Detect near duplicates	Detects similar documents. When a document is detected as identical to a previously crawled page (when the content is the same), it is not indexed and links are not followed. When it is only similar to a previously crawled document, it is indexed but links are not followed. This can avoid looping on infinitely growing URLs with duplicate content, and speeds up the crawl on sites where content can be accessed from different URLs.
Detect patterns	Detects sites based on specific patterns, such as CMS, parking site, SPAM, and adult content. It then applies the specific crawl rules defined for these sites. For example, it can detect search or login pages on forums that the connector must not crawl.
Convert processor	Searches for links in binary documents. To save CPU and memory, disable this option when crawling sites where there are no new links in binary documents.
Site collapsing	Activates site collapsing. By default, the site collapsing ID is based on the host part of the URL. All URLs belonging to the same host share the same ID and are shown as one collapsed result. You can configure the site ID to include more segments of the URL's path, to distinguish sites on the same host. This can be done by specifying the recursion with the Depth parameter. By default, the value is `0` corresponds to the host part of the URL. For more details, see About Site Collapsing.
Default behavior	Allows you to specify a combination of index and follow rules for URLs that do not have any rules. Accept – crawls URLs that do not match any rules. Beware, it can crawl the entire internet. Index – indexes the contents of the documents at this URL. Follow – follows the hyperlinks found in the documents at this URL. This finds content outside of this URL. Beware, it can crawl the entire internet. Follow roots – follows root URLs that do not match any rules.
Default data model class	Allows you to specify the class in which the documents must be indexed for URLs that do not have any data model class rules.
License document type	Select the type of documents that this connector produces.
Blacklisted session ID	Allows you to specify the session IDs to be excluded (blacklisted) from the matching URLs. Many sites use session IDs which add parameters to the URLs randomly. This is troublesome since they can create different links for the same pages. For example, if you want to blacklist parameter 's' from URLs, enter `s` in the field. Consequently, if it finds a link to: `http://foo.com/toto?s=123456789&t=2` ... it crawls: `http://foo.com/toto?t=2`