Define the Groups of URLs to Crawl

The Crawler connector can have one or more groups containing root URLs.

Adding groups with Add group is a good means of organizing your sources by topic. For example, you can create a news group, to gather the source URLs of online newspapers.

See Also
Configuring the Crawler
Define Crawl Rules
  1. In the Groups section, click Add group to add a new group.
  2. In the URLs pane, enter the URLs of the sites OR the parts of the sites you want to crawl.

    Select Site to crawl the entire site. If you crawl an entire site, there is no need to define crawl rules. - if the url is http://foo.com/bar it crawls only pages whose urls begin with http://foo.com/bar

    - if the url is http://foo.com it crawls only urls beginning with http://foo.com/ OR http://www.foo.com/

    Note: If the home page is a redirection or a meta redirection to another site, it follows it and crawls the destination site. Except for the home page, it does not follow redirections to pages external to the site.
    If you want to restrict the crawl, clear the Site option and define crawl rules.
    Note: If you have used the Site option and then want to use advanced rules, clear the crawler documents first, otherwise unexpected behavior may occur.

  3. If you define several source URLs, you can sort their crawling priority from the Priority select box. See How Priorities Work.
  4. Optional: Add more groups and define the URLs to crawl.