Sample Configurations for Different Use Cases

This page discusses:

Crawl One Intranet Site

When crawling only intranet sites, the configuration is simple, you must enter root URLs and maybe some rules to avoid indexing useless documents.

The server is friendly, so you can remove crawl speed limits. In that case, we say that you can configure the crawler in "aggressive" mode, which means that there is no throttle time between requests. This allows faster crawl and refresh.

You can also enable "smart refresh" to index new documents quickly. This option refreshes pages that change more often automatically, though it needs some adaptation time.

If required, you can configure a deterministic refresh, to ensure that all pages refresh in a given period. You have to check that the refresh period is long enough to recrawl all pages, using the server response time. You can schedule deterministic refresh using the product's scheduler. See Refresh.

Crawl an Intranet Site with SSO Authentication

The crawler configuration is the same as above, but the fetcher needs some authentication configuration too.

Follow the instructions. Usually you can copy-paste the URL patterns from the crawl rules for the authentication configuration.

Crawl Many Intranet Sites

It is better to follow these recommendations when crawling many different Intranet sites.

  • Usually, one crawler containing all the sites is more efficient than many crawlers. Indeed, the crawler can share resources (CPU, memory, threads), whereas two independent crawlers consume twice the resources without being faster.

  • The only good reasons to separate the sites in several crawlers are the following:

    • You need different crawler options (throttle time, aggressive mode, refresh).
    • You need documents indexed in a different build group.
    • You deploy different crawlers on different hosts for performance.
  • Deterministic refresh is still an option, see Crawl One Intranet Site.
  • The number of threads is important. Having too many threads uses too much memory, and few are able to crawl fast enough. To find the appropriate number, remember that you can crawl a host with only one thread at a time, so threads above the total number of hosts are useless. You can also fine-tune using the performance graphs in the Monitoring Console: the crawler has a graph showing the average number of idle threads. If many threads are idle, you can lower their number.

Crawl Specific Internet Sites

If you need to crawl specific pages on a small number of sites (for example, product pages on e-commerce sites, or articles on a news site), write crawl rules carefully to define precisely which URLs to index, and which URLs to follow but do not index.

The easiest way is to use the Site option. This way, the crawler crawls every URL in the site. With some additional rules, you can select which pages to index among the crawled documents:

  • A rule with the same pattern as the whole site, and don't index action overwrites the default for the Site option. Now the site is still crawled, but no longer indexed.

  • Enough rules to match the pages you want to index with the index action. This overwrites the previous rule only for the wanted pages. You can also use the index and don't follow action if the links from these pages are not required.

To discover new links earlier without missing any, use Smart refresh or a specific deterministic refresh on intermediate listing pages. For more information about deterministic refresh, see Refresh.

Be aware of the limited crawl speed (usually 1 document every 2.5 seconds) on the internet, which defines an upper bound on the number of crawled pages per day. You can tune the refresh periods with the Min/Max document age options. To optimize the crawl behavior:

  • Increase the acceptable time to slow the refresh,

  • Decrease it if you have available resources to enhance freshness.

Crawl Many internet Sites

If you need to crawl a large or very large list of websites, configuring them in the Administration Console becomes a chore.

The easiest way to achieve this task involves several features:

  • Site mode,

  • Root sets.

The Site mode allows you to avoid setting crawl rules per site.

The root set configuration allows the crawler to refresh the list of sites to crawl dynamically, from a file outside the configuration.

The root set URL locates the file containing the site list. It can be a regular file:

  • On the local filesystem (file://),

  • In the product's data directory (data://),

  • In the configuration directory (config://latest/master/),

  • Or even a remote server (http://).

The file can be as simple as one URL per line.

You can also store the root set in the resource manager, using a resourcemanager:// URL. For more information, see the resource manager's documentation.

Refresh works better with Smart refresh, and longer time periods (1 week to 1 month), to avoid unnecessary aggressive refresh.

The crawler can have as many crawl threads as the hardware resources allow. 20 to 100 threads are generally used in large crawls.

You can categorize sites using the root set group (split rootsets into several smaller sets, identified by their group), or on a per-site basis using forced metas. Example of root set using forced metas:

http://www.example1.com/ category=news theme=sports relevancy=100

http://www.example2.com/ category=blogs relevancy=10

To optimize resource usage, you can configure a maximum number of crawled pages per site. The default value is 1 million pages.