Feed Fetcher Connector

This section describes the functionality and configuration of the Feed Fetcher connector to crawl RSS, Atom, and RDF feeds.

This task shows you how to:

Introducing the Feed Fetcher Connector

This section implies that the connector has already been added.

See Creating a Standard Connector.

The General configuration is the same as the Crawler connector's one. For more information, see Configuring the Crawler. In the following use case, we want to crawl two feeds and refresh their crawls every hour.

Configure the Connector

  1. In the General pane, use the default options.
  2. In the URLs pane, define the 2 RSS feeds to crawl.

    Parameter

    Setting

    Feed

    Enter the URLs of the feeds. For example, we can crawl the 2 following feeds: http://www.example.com/feed1 and http://www.example.com/feed2

    Refresh period(s)

    Specify how often the feeds are refreshed. For example, 3600 seconds.

    Actions

    Choose how to crawl the feeds:

    • Index feed and article content: Indexes both what is displayed in the feed, and the content of the article (by following the link)
    • Index feed content only: Indexes only what is displayed in the feed, NOT the content of the article.
    • Index metas quickly: Indexes metas first when crawling the feed, and then again when crawling the entire article. This can be useful when you want to index article titles and summaries very quickly. The drawbacks are that your index will contain both partial and complete articles and you may have to configure your Mashup UI to avoid displaying empty hit contents (showing titles and summaries only).

    Extract image/video links

    In addition to extracting media enclosures by default, look for links to images and links to youtube or dailymotion in the content of the feed items.

    Priority

    If you define several RSS feeds, you can sort their priority from the Priority select box. This changes the priority of URLs matching the pattern. See How Priorities Work.

  3. Click Apply.
  4. On the Home page, click Start crawl.
    The crawl of the 2 RSS feeds begins.