About the Indexing Process
Indexing is the process of scanning a document to create an index and to store its metas (content such as fields) into a collection.
The following diagram explains the process to update an index and its replicas.
Phase 1: Push
The connectors retrieve documents from data sources, convert them into PAPI documents, and send these documents to the Indexing Server process.
For each document, the Indexing Server assigns:
-
a Document Identifier (DID)
-
an index slice
Important: You can add the Consolidation Phase between Phase 1 and Phase 2. It allows you to transform and aggregate documents coming from different sources before pushing them to the Indexing Server. For more details on consolidation, see the Exalead CloudView Consolidation Server Guide.
Phase 2: Analysis
Analysis is the process of formatting content and extracting information from documents pushed by connectors before storing them in the index.
In Exalead CloudView, an indexing job starts as soon as it receives a document. The analysis pipeline processes it immediately, using several threads for better performance.
Processors play an important role during the analysis phase. The document and semantic processors parse each document in the job to perform text extraction, semantic processing, custom operations, and mapping.
Commit Triggers define the conditions that prompts the saving of the analysis to the index.
When you commit, the results of the analysis create:
-
An import to the index, which merges the data computed during the analysis with the data present in the index. This results in a new generation of the index. The new data resides in a new, separate slot in the index.
-
Semantic annotations (linguistic statistical data) about the corpus to the dictionary builder of the indexing server process. This data can be used for query expansion and index-time semantic processing.
The index is now committed to disk.
Phase 3: Replicate
After the new index data is committed, the new index generation is replicated on all index slices in the deployment. Once fully replicated, the new documents are available for searching.
Once the dictionary builder has received new semantic annotations, it updates the dictionary (or dictionaries, if you configured multiple ones) on the search server.