The crawler uses it, but also the product's generic fetcher, to generate previews,
thumbnails, and direct downloads from the Mashup UI.
The fetcher configuration defined in the Administration Console
> CRAWLER connector > Fetcher tab, is also used by the
fetchapi remote API, called by the Mashup UI for downloads, thumbnails, and previews.
You can define different fetch configurations. By default, in the Administration Console, each new crawler is associated with a new fetcher configuration.
HTTP Common Parameters
Cookies are always processed, but never stored by default. They are
however automatically enabled, stored, and sent back on sites configured with an HTML form
authentication. The global Enable cookies for all sites parameter
allows you to store and send cookies for all sites. Only cookies following the standard
privacy rules are stored, as web browsers do.
The User agent by default identifies itself as a Mozilla-compatible
Exalead CloudView. When crawling the public internet, customize it with a URL pointing to a page
describing the crawler and how to contact the administrator.
The From field is empty by default, and can contain a contact email
address, which is useful when crawling the public internet.
You can add any HTTP header to the default request. The default
configuration only contains Accept : */* , which is required
by some sites.
Conditional Fetch
The fetcher uses the Last-Modified date and E-Tag
HTTP Headers to allow conditional fetch.
The crawler uses this condition when refreshing pages, to avoid fetching unchanged
documents. In that case, the server only answers 304 Not Modified .
HTTP Authentication
The fetcher can access pages protected through different kind of authentication protocols:
-
Basic only requires a username and a password. They are sent as
plain text in the HTTP request.
-
Digest requires a username, a password, and an optional realm.
Unlike Basic, the Digest protocol avoids sending the password as plain text.
-
NTLM requires a username, a password, and either a hostname or a
domain.
Configure URL patterns to tell the fetcher which sites the login information applies to.
They have the same syntax and meaning that the patterns used in the crawler configuration.
They must match only the URLs where the login information is required.
HTML Form Authentication
The fetcher also supports authentication using HTML forms. This applies to any website
where the login procedure involves filling in an HTML form and submitting it through a
POST request.
What You Must Know
To properly configure the fetcher to authenticate using an HTML form, gather the
following information:
-
Find patterns to identify which URLs require authentication. Sometimes authentication
is not required or not relevant to retrieve some parts of a website, for example
static resources or images. URL patterns are the same than for standard HTTP
authentication (see above), and the same recommendations apply. They work the same way
as crawl rule patterns, and support the same syntax.
-
Describe how to identify a successfully authenticated page from an unauthenticated
one. This authentication test requires a success/failure
condition, and is applied to all crawled URLs matching the above
patterns. You can either choose a characteristic of the failed authentication page, to
make a "failure condition", or a characteristic of a successfully authenticated page,
to make a "success condition". See below for examples.
-
Find the following information to authenticate:
Write a Correct Success/Failure Condition
A good condition must return:
- Failure if and only if the fetch result shows that you are not authenticated.
- Success if and only if the fetch result shows that you have successfully
authenticated.
Body text match condition |
- Finding a string like Welcome $USER in the content of the
page may be a good success criteria.
- Finding a string like Please authenticate to access this
document would be a good failure test.
|
Redirection condition |
- The fetcher tests only if a page contains a redirection to another page and
does not follow the redirection (a page and its redirection are two different
documents). - Many sites redirect unauthenticated requests to a login
page. Assuming the first redirection destination is
/login/login.html , a good condition would be
Failure if redirection matches
/login/login/html . Matching only the presence of a
redirection is usually a bad idea. Sites often contain lots of invisible
redirections, even when authenticated.
|
Limitations: login procedures requiring javascript are not supported, except in the
simple case of auto-submitted forms (usually used to transfer cookies or session IDs
across domains without setting large query parameters).
Common Misconceptions
-
Authentication process and authentication test (the Success/Failure condition) are
two independent processes. The authentication test condition is tested every time a
document is fetched, but the authentication itself is executed only when the condition
returns a failure.
-
The crawler and the fetcher are independent. The crawler does not know about
authentication. The process is transparent. The fetcher is like a black box, accepting
URLs as input and returning authenticated documents. That means that login pages must
never be indexed. If it happens, it means that your authentication process is wrong:
either the authentication process has failed and the fetcher is not authenticated, or
the condition always returns success erroneously (and never triggers the
authentication).
-
Cookies are automatically enabled and handled, there is no need to enable them
manually. Use the Enable cookies for all sites option only if
an unauthenticated site requires them.
Cookies are handled transparently by the fetcher. It behaves as much as possible like
a standard web browser.
|