Configure the Fetcher
-
In the
Fetcher tab, expand
More options.
-
Select Enable cookies for all sites to enable cookies for all
sites, not only site authenticated with an HTML form and a cookie.
-
In
User agent, enter the User-Agent header
string to send in HTTP requests.
-
In
From, enter the email address sent in the
HTTP header.
-
In
Proxy address, enter the proxy server
address.
-
Define your proxy server credentials in
Proxy username and
Proxy password.
-
Define the timeout parameters:
-
Read timeout (s): The read timeout
expressed in seconds.
-
Write timeout (s): The write timeout
expressed in seconds.
-
Connect timeout (s): The connect
timeout expressed in seconds.
-
Download timeout (s): The maximum
connection time before canceling a download in seconds.
-
Click
Apply.
Crawl Secured Sites
This subsection describes the parameters that you can configure to crawl secured sites,
from Fetcher > Rules Configuration.
Some configuration parameters can be rule-based. Each configuration
whose rules match the URL to be fetched is applied.
Ensure that there are no conflicts in the definition of your Fetcher configuration rules.
For example, do not define two different authentication schemes with overlapping rules. Each
rule-based configuration depends on the type of rule.
Choose an Authentication Type
From the Auth type select box, you can choose one of the
following authentication types.
Auth type
|
requires...
|
Basic
|
Valid user name
Clear text password
|
Digest
|
Valid user name
Clear text password
Realm
|
Ntlm
|
Valid user name
Clear text password
NTLM (Microsoft Windows NT LAN Manager) domain and host to
authenticate users to Microsoft Windows 2000 domains.
|
Form
|
A login through an HTML authentication form with cookies
handling.
|
Configure Authentication Parameters
Once an Authentication type is selected, configure its related parameters. Below
are example procedures for basic and form
authentication type.
Configure Basic Authentication
-
Click
Add rule.
- Enter a rule name, for example
user .
- For
Type, select
Authentication
- Click
Accept.
-
From Auth type, select Basic.
-
In
Username, enter a valid user name for the
authentication type.
-
In
Password, enter a valid password for the
user name.
-
In
URL
patterns, enter the site prefix to be
authenticated. For example,
http://example.com/private/basic
-
Click
Add pattern to enter additional site
prefixes using the current authentication configuration.
-
Click
Apply.
The configurations of the Digest and
Ntlm authentication types are similar. For
Digest, also define the realm
authentication. For Ntlm, also define the
domain name and the host name.
Configure Form Authentication
-
Click
Add rule.
- Enter a rule name, for example
user .
- For
Type, select
Authentication
- Click
Accept.
-
From Auth type, select Form.
-
In
Login page URL, enter the URL of the page on
which to find the authentication form.
-
If there are several html forms on the login page, specify either a CSS id or a
class to identify which form must be submitted in Form Id,
Form Class or Form
Name. These parameters are not required if there is a single
form.
-
From the Form fields pane, configure the fields required
to authenticate on the authentication form, using the Name
and Value fields. Specify the input
name of the field, not its label.
Note:
The best way to find input names is to open the source HTML code of the page
and search for input nodes with attributes
type="text" and type="password" , and copy the
values of their name="<input name>" attributes.
Click Add form field for additional fields or
Add encrypted form field if required.
For example, you can add:
-
username – John
-
password – ******
-
lang – EN
Note:
These fields are case-sensitive.
-
The Crawler Connector must know whether a fetched page is
successfully authenticated or not. This is expressed with a condition based on
Authentication success or
Authentication failure. Pages that fail the
test trigger a new authentication and are fetched again.
Select the Success/Failure condition based on
authentication success or failure:
- if a redirection occurs
- if status is <value>
- if header with name <name> is present or equals
<value>
- if body contains <value>
OR click
Edit condition as xml (experts only) to
enter a complex condition.
Examples:
- Set the condition to authentication failure if redirection URL
matches "/login.php", that is to say, if the
authentication fails and redirects you to the login page.
- For a private forum, you can set the condition to: authentication failure if
body contains "Please login" OR authentication success if body contains "Welcome
John".
-
In
URL
patterns, enter the site prefixes to be
authenticated. For example,
http://www.example.com/ .
Note:
It might seem redundant with the URL specified in the
Configuration tab but the fetcher needs to
know for which URLs of the site, authentication is required.

-
Click
Add pattern for additional site prefixes.
-
Click
Apply.
About Success/Failure
Conditions
We must know whether a fetched page is successfully authenticated or not. This is
expressed with a condition based on authentication success or
authentication failure. Pages that fail the test, trigger a new
authentication, and are then fetched again.
Condition
|
Description
|
redirection
|
Redirections include HTTP status codes 301 ,
302 , 303 , 307 .
The connector can also test redirections to a URL containing
a specific string.
|
status
|
Checks the HTTP status code for a specific value.
For example, you can specify the 200 OK status to indicate
that pages are considered as authenticated.
|
header with name
|
Checks for a specific header in the HTTP response.
is present – checks whether the
header name is present or not.
contains – check whether the header
name is present with a given value.
|
body contains
|
Looks for a specific string in the body of the HTTP
response.
Applies only to non-empty documents with text/* content types.
You can test it with a string to find either on
non-authenticated pages (for example, a login form or prompt) or on
authenticated pages ("welcome $username" , logout form).
|
|