Analyze
The critical element in analysis is the number of threads to use, to maximize your
indexing machine's CPU.
Most deployments have a dedicated machine for indexing building. An index constitutes of
slices. You normally want to set up your analysis to use one CPU per index slice. This is to
maximize performance at both index-time and search-time.
Best Practice
Analysis can maximize CPU use when there is a dedicated machine for index building. The
best option is one thread per slice. When you have a lot of CPU but a relatively slow hard
disk drive, you can increase this to two threads per slice.
An exception to this best practice is when you have a document processor that calls a web
service. While waiting for the response from the web service, Exalead CloudView does not consume CPU. In this case, put more threads than there are CPUs to
compensate for the periods when the CPU is not working at all.
Why 1 CPU Per Slice?
Say that we have an index with 4 slices. At search time, when a user sends a query for
processing, the search server sends the query to each slice. Query evaluation is
mono-threaded in each slice, which means one CPU for each slice, and maximized query
performance.
At index building time, you must import the new index generation into each slice. Recommendation:
Have at least one CPU per thread for indexing time.
Compact
Compacting is the merging of multiple indexing generations, known as slots. Doing this
regularly helps to keep good search performance.
Note:
Changes made in Compact do not require re-indexing. A full
compact is enough to clean index slots.
Regular Vs Full Compact
There are two types of compacting you must configure for Exalead CloudView:
-
"Regular" compacts: these are for daily housekeeping on the index. They regularly
merge slots for more efficient use of index space.
-
Full compacts: by contrast, these are for spring cleaning on the index. They are
triggered once regular compacting has lost its effectiveness.
"Regular" Compact
Each analyzer imports the new index generation into each slice. To ensure consistency,
Exalead CloudView creates new files, or slots, for each generation. Since more slots slow search
performance, you do not want to add new slots indefinitely. You sometimes need to merge
these slots by compacting the index.
In general, small slots are faster to compact, while large slots maintain a good search
performance. See the table below for a description of the available compact policies.
Compact policy
|
Description and options
|
Number of slots (default)
|
Compacts as soon as there are “No. slots” slots. This is
a pyramidal system. It leads to frequent compacting of small slots and less
frequent compacting of large slots.
-
No. slots: Number of slots with the same number of
index generations to trigger a compact. Default is 4.
-
Max slot size (MB): Once a slot reaches this size, it
can never be compacted again unless you activate a full compact policy.
Default is 1000.
|
Latency reduction
|
Compact policy designed to improve realtime indexing performance. Small slots
(small size on disk) provide fast compacts, while large slots (large size on
disk) maintain a good search performance.
Whenever an index generation is created, this compact policy sorts the slots by
size: with No. large slots large slots and Max
small slots small slots.
Use this mode when most of your index imports are for incremental changes,
which typically create small slots.
-
No. large slots: Keeps at least N large slots in the
index. Default is 10.
-
Max small slots: Keeps no more than N small slots.
Exceeding N small slots triggers a compact. Default is 20.
|
Slots size
|
Compact policy based on size that produces slots with similar sizes.
-
Target size for compaction (MB): Slots are compacted
until they reach this size. They are no longer compacted afterward, except if
you run a full compact operation. Default is 200.
-
Min size for compaction (MB): Minimum size for a slot
to be compacted. Default is 50.
-
Min. slots: Minimum number of slots to trigger a
compact.
|
No compact
|
Compact policy that does not run compact operations, and fill the smallest slot
at each import.
Use for initial indexing, when all you are doing is importing. Follow this with
a Full compact (see below).
|
Full Compact
A full compact is like spring cleaning for your index.
To ensure good index latency, Exalead CloudView creates lots of small slots, one for each generation of the index. For better search
latency, every so often these slots are compacted into a large slot. Later on, once you
have lots of larger slots, these too get compacted. For the sake of clarity, let us call
this a 'regular' compact.
Once a slot reaches 1 GB, however, regular compaction stops. This is a safeguard put in
place to ensure that regular compaction does not impact other Exalead CloudView operations. This means that over time, your index becomes full of 1 GB slots. This
is particularly wasteful when you are indexing the same docs repeatedly, since each slot
contains new versions of the same documents.
This is when it is time to do a full compact, which takes all these 1 GB slots and merge
them into a single slot.
By default, a full compact is triggered using the size of the largest slot in your index
as the threshold. Once the size of the rest of your index (excluding the largest slot)
exceeds the size of the largest slot, it triggers a full compact. Note:
You can also
trigger a full compact for a given build group directly, from Administration
Console > Home using Full compact.
Full compact policy
|
Description
|
Size
|
This full compact policy is launched when the cumulated size of small slots
exceeds N percent of the largest slot.
Recommendation:
Set the Min slots option so that full compact
operations will not be launched too frequently, as it is costly in disk
consumption.
- Percentage: Minimum percentage to start a full
compact. It compacts all slots into a single one whenever the tail of small
slots exceeds a specific percentage of the largest slot.
- Min slots: Minimum number of slots to trigger a full
compact. Default is 2.
|
Number of slots
|
This full compact policy applies to Compacts based on Number of slots. Since
the pyramidal system tends to compact large slots less frequently, this policy
allows you to define the max arity of long tails before triggering a full
compact.
Max arity: Whenever the long tail total arity reaches
this Max arity, a full compact is launched. The long tails are the slots whose
span has an arity inferior to this parameter. Default is 256.
|
No full compact
(default)
|
Disables full compact operations.
|
Schedule Full Compacts
During full compacts, index queries may be slower than usual if the service index is on
the same machine as the indexingservice process. To mitigate this,
schedule full compacts when there is less traffic on the system. Depending on the update
volume, you may want to trigger full compacts every night, or once a week.
You can do this in Scheduling.xml .
For example, to trigger a full compact every night at 01:00:
<master:SchedulingConfig version="1381920589000" xmlns:bee="exa:exa.bee"
xmlns:cdesc="exa:com.exalead.mercury.component.config.descriptor"
xmlns:secs="exa:com.exalead.security.sources.common"
xmlns:config="exa:exa.bee.config"
xmlns:master="exa:com.exalead.mercury.mami.master.v10">
<master:JobConfigGroup name="full_compact">
<master:DispatchJobConfig name="launch_full_compact">
<bee:DispatchMessage messageName="fullCompactIndex" serviceName="/mami/indexing">
<bee:messageContent>
<bee:KeyValue key="buildGroup" value="bg0" />
</bee:messageContent>
</bee:DispatchMessage>
</master:DispatchJobConfig>
</master:JobConfigGroup>
<master:TriggerConfigGroup name="full_compact">
<!-- schedule full compact -->
<master:CronTriggerConfig name="launch_full_compact" startTime="0" endTime="0"
jobGroupName="full_compact" jobName="launch_full_compact" cronExpression="00 00 1 * * ?"/>
</master:TriggerConfigGroup>
</master:SchedulingConfig>
Synchronous Option
By default, compacting is asynchronous. When importing the latest generation of the
index, Exalead CloudView creates the slot. While replicating this slot to all slices, Exalead CloudView can start a compact, but does not wait for this compact to fully replicate before
responding to the user queries.
With synchronous compacting, Exalead CloudView ensures that compaction is fully replicated before starting an import. This prevents
machines from being overloaded with multiple compacting or importing jobs.
Commit
Exalead CloudView indexes documents on the fly, all in memory. As soon as a connector pushes documents
to Exalead CloudView, their data processing analysis begins. During the analysis, if a threshold is
reached, a commit of processed documents is made to the index (on disk). The commit is what
creates a new generation (slot) on the index.
The commit thresholds are commit conditions. You can define them to occur:
-
At regular intervals (periodic condition)
-
After the process of N MB (size-based condition)
-
After the process of N tasks (documents) (number of tasks condition)
-
After N seconds of inactivity (inactivity condition)
Note:
You can also explicitly choose to commit from the Home page >
Indexing section> Force commit.
By minimizing writing to disk during the analysis phase, indexing is significantly faster
and the end is reduced index latency. In other words, your result users can search these
updated documents sooner.
Keep in mind, though, that while scanning (pushing) documents on a data source, connectors
also make commits. For example, at the end of every scan, a managed connector performs a
commit, and consequently, an import to the index. Importing too frequently could negate the
advantages of RAM-based analysis.
Commit based on...
|
Description
|
Max. RAM threshold
|
- Enabled (default option): Commits when the RAM size
reaches the Threshold value specified (by default, 2048
MB).
- Auto: Commits when the RAM size reaches 2048 MB.
When reaching the RAM value specified, analysis stops and analyzed documents are
written to the index. Then analysis starts again.
|
Inactivity
|
Commits by inactivity when:
- there is no new data for the specified time period (n seconds)
- and at least n tasks have been analyzed
|
No. of tasks
|
Commits after n tasks have been issued.
|
Elapsed time
|
Commits every N second after the first push order launched after the last commit.
|
Size
|
Commits when the total number of documents to be processed reaches
n MB.
|
|