Smart refresh
|
When enabled, the crawler continuously goes through the list
of already crawled URLs and adds them to the crawling queue so they are
refreshed.
The URLs that have been refreshed or crawled recently - that is to say that have
not crossed the Min age threshold - are not added to the
crawling queue immediately.
The URLs added by the refresh loop have a lower priority than
the new URLs discovered during the crawl, so the refresh process does not slow
down the discovery of new pages. For more details, see
How Priorities Work.
|
Min Age /
Max age
|
Documents older than Min Age and younger than
Max age are automatically refreshed depending on their
update frequency. Documents that change often will be refreshed faster than
documents that have never changed.
In other words, documents are automatically refreshed
according to their update frequency, but never faster than
Min Age, and never slower than
Max age.
Reasonable values for a small intranet site are min = 1 hour,
max = 1 week. Reasonable values for a big web crawl are much higher, at minimum
1 day and maximum 6 months or 1 year.
|
No. of crawler threads
|
Specifies the number of crawler threads to run simultaneously.
If the crawler overloads your network, specify fewer threads.
|
Web server throttle time (ms)
|
Specifies the lapse of time between queries (number of
milliseconds).
By default, the connector sends one query at a time to a given
HTTP server. This avoids overloading the HTTP server.
Note:
The throttle time is ignored if you select the Aggressive
mode. However, this may generate a heavy load on Web servers. Use it
only when crawling your own servers.
|
Ignore robots.txt
|
Ignores the robots.txt that may be defined on the site you
want to crawl.
Caution! By default, the crawler follows the rules defined in
the robot.txt to crawl the HTTP server.
|
Detect near duplicates
|
Detects similar documents.
When a document is detected as
identical to a previously crawled page (when the content
is the same), it is not indexed and links are not followed.
When it is only
similar to a previously crawled document, it is indexed
but links are not followed.
This can avoid looping on infinitely growing URLs with duplicate content, and
speeds up the crawl on sites where content can be accessed from different URLs.
|
Detect patterns
|
Detects sites based on specific patterns, such as CMS, parking site, SPAM, and
adult content. It then applies the specific crawl rules defined for these sites. For
example, it can detect search or login pages on forums that the connector must not
crawl.
|
Convert processor
|
Searches for links in binary documents.
To save CPU and memory, disable this option when crawling
sites where there are no new links in binary documents.
|
Site collapsing
|
Activates site collapsing.
By default, the site collapsing ID is based on the host part of the URL. All URLs
belonging to the same host share the same ID and are shown as one collapsed result.
You can configure the site ID to include more segments of the URL's path, to
distinguish sites on the same host. This can be done by specifying the recursion
with the Depth parameter. By default, the value is
0 corresponds to the host part of the URL.
For more details, see
About Site Collapsing.
|
Default behavior
|
Allows you to specify a combination of index and follow rules
for URLs that do not have any rules.
-
Accept – crawls URLs that do not match any rules. Beware,
it can crawl the entire internet.
-
Index – indexes the contents of the documents at this URL.
-
Follow – follows the hyperlinks found in the documents at
this URL. This finds content outside of this URL. Beware, it can crawl the entire
internet.
-
Follow roots – follows root URLs that do not match any rules.
|
Default data model class
|
Allows you to specify the class in which the documents must be indexed for URLs
that do not have any data model class rules.
|
License document type
|
Select the type of documents that this connector produces.
|
Blacklisted session ID
|
Allows you to specify the session IDs to be excluded
(blacklisted) from the matching URLs.
Many sites use session IDs which add parameters to the URLs
randomly. This is troublesome since they can create different links for the
same pages.
For example, if you want to blacklist parameter 's' from URLs, enter
s in the field. Consequently, if it finds a link to:
http://foo.com/toto?s=123456789&t=2
... it crawls:
http://foo.com/toto?t=2
|