The Deterministic refresh mode is available through a MAMI command
accessible by cvcommand
.
The command is called RefreshDocuments
and can take two parameters:
-
maxAgeS
allows you to select only the URLs that have not been
refreshed for a period of time (in seconds),
-
prefix
allows you to select only the URLs matching a prefix. This
prefix is not a regular expression.
The command extracts all matching URLs from the crawler storage, and adds them in the fifos
(the persistent structure containing the URLs to crawl). You can use the product scheduler
to run this command periodically.
Note:
This refresh can only be guaranteed if the crawler has enough computing and
network resources to crawl all refreshed URLs before the next refresh. The number of URLs
to refresh on a given host and the crawl delay (2.5 seconds by default) can also limit the
refresh rate.
In the following example, the command refreshes all pages that are older than a day on
example.com
:
./bin/cvcommand :$MAMIPORT /mami/crawl refreshDocuments crawlerName=$CRAWLER
maxAgeS=86400 prefix=http://www.example.com/
Output in run/crawler-exa0/log.log
:
[2019/07/11-15:52:34.164] [info] [exa.bee.BeeJob 1127] [crawler.feedfetcher]Refreshing
8756 out of 10000 documents in prefix <http://www.example.com/>
[2019/07/11-15:52:38.781] [info] [exa.bee.BeeJob 1127] [crawler.feedfetcher]Refreshing
17216 out of 20000 documents in prefix <http://www.example.com/>
[2019/07/11-15:52:49.790] [info] [exa.bee.BeeJob 1127] [crawler.feedfetcher]Refreshing
25782 out of 30000 documents in prefix <http://www.example.com/>
...
[2019/07/11-15:54:35.684] [info] [exa.bee.BeeJob 1127] [crawler.feedfetcher]Refreshed
145344 out of 172549 documents in prefix <http://www.example.com/>
Note:
In this log, the last message means that the URLs have been scheduled for refresh but
are not necessarily refreshed yet.
The URLs to refresh are added to a fifo, by default, it uses the highest priority. See
How Priorities Work.
You can check the refresh progress in:
-
The Monitoring Console (in <HOST> > Services > Exalead >
Connectors > CRAWLER NAME)
-
The Crawler connector > Operations & Inspection tab, if you
observe the source.0.size
crawler probe values.
You can also run a deterministic refresh job periodically (at fixed times, dates or
intervals), by editing the <DATADIR>/config/Scheduling.xml
file,
and specifying a Quartz cron expression. Exalead CloudView executes the MAMI commands defined in this xml file
automatically.
In the following example, the refresh_crawler
command uses a cron expression
that runs the crawler refresh at "0 0 0 * * ?"
, meaning daily at
midnight.
You can specify several schedules by adding as many
<TriggerConfigGroup>
as required.
<master:SchedulingConfig xmlns="exa:com.exalead.mercury.mami.master.v10">
<master:JobConfigGroup name="refresh_crawlers">
<master:DispatchJobConfig name="refresh_crawler">
<master:DispatchMessage xmlns="exa:exa.bee" messageName="refreshDocuments"
serviceName="/mami/crawl">
<master:messageContent>
<master:KeyValue key="crawlerName" value="CRAWLERNAME" />
<master:KeyValue key="maxAgeS" value="3600" /> <!-- optional, don't refresh
documents younger than 1 hour, else all documents regardless of age -->
<master:KeyValue key="prefix" value="URLPREFIX" /> <!-- optional, only
refresh documents with URL beginning with prefix, else all documents -->
</master:messageContent>
</master:DispatchMessage>
</master:DispatchJobConfig>
</master:JobConfigGroup>
<master:TriggerConfigGroup name="exa">
<master:CronTriggerConfig cronExpression="0 0 0 * * ?" misfireInstruction="do_nothing"
jobName="refresh_crawler" jobGroupName="refresh_crawlers"
name="refresh_crawlers" calendarName="dummy_calendar" />
</master:TriggerConfigGroup>
</master:SchedulingConfig>