Define Crawl Rules

If you do not want to crawl an entire site, you can define precisely how to crawl each root URL in the HTTP source, using crawl rules.

These rules allow you to specify the action to perform for each url. For example, follow the hyperlinks and index the contents found on the pages.

Remind that the crawler applies the rule to all URLs in all groups. The pattern is a prefix matching the URL's beginning. It must only match the URLs of its group, otherwise unexpected behavior may occur.

  1. Expand Advanced rules.
  2. Add as many rules as required and for each:
    1. Specify a URL pattern.
    2. Define the action to take when crawling URLs that match this specific pattern.

      Action

      Description

      Index and follow

      Indexes the contents at this URL. The links found in the page are followed and crawled.

      Index and don't follow

      Indexes the contents but ignore the hyperlinks found in the page.

      Follow but don't index

      Follows the hyperlinks found in the pages at this URL but do not index the content.

      Index

      Indexes the contents of the pages at this URL.

      Follow

      Follows the hyperlinks found in the pages at this URL. This finds content outside of this URL.

      Don’t index

      Does not index the contents at this URL.

      Don’t follow

      Ignores the links found in the page.

      Ignore

      Ignores the defined URL completely.

      Source

      For compatibility with version 5.1 where several sources could be defined for the same crawler.

      Add meta

      Adds a meta as a key/value pair to flag the contents and hyperlinks of the pages at this URL.

      Priority

      If you define several crawl rules, you can sort their priority from the Priority select box. This changes the priority of URLs matching the pattern. See How Priorities Work.

      Data model class

      Allows you to specify the data model class of the documents pushed to the index.

    For example, we can crawl a single URL “http://www.example.com” and define the following patterns and actions:

    • http://www.example.com/ index and follow
    • http://www.example.com/test ignore
    • http://www.example.com/test/new index and follow
    • http://www.example.com/listings/ don’t index

    Note: You can quickly check the effect of your crawl rules (whether they work or not) in the Test rules section. This is interesting when rules are complex and you want to make sure that they do not break anything before applying the configuration to the crawler. Select Advanced mode to specify the expected behavior for each URL, and test rules without applying the configuration. Check boxes turn to green when the actual behavior matches the expected one, otherwise, they turn red.

  3. Use the up and down arrows on the right of the Actions field to sort the rules. Precedence is given to the last matching rule (the last rule has more priority).
  4. Expert users can also click Edit rules as xml to fine-tune rules manually. See Advanced Configuration.