About Data Processing

This page discusses:

About the Overall Document Lifecycle
Focus on Data Processing Phases in the Analysis Pipeline

About the Overall Document Lifecycle

This section explains the document lifecycle through the various components of the Indexing Server.

In the Push API Server

When connectors send documents to the Indexing Server, they first arrive into the Push API server (PAPI server) of the Indexing Server.

When you push a document to the Push API Server, it contains:

document URI
document stamp to indicate the version
meta data
parts containing the bytes from the document
directives for data extraction

In the Analysis Pipeline

Documents are then pushed to the analysis pipeline of the Indexing server.

The first step is to transform the content of the document parts and the metadata items into internal items that the document and semantic analysis processors can process. These internal items are called document chunks.

Note:

Each document chunk is tagged as a context which name is case-sensitive.
Created contexts depend on the extraction process. By default, it creates the text and title contexts for each part, and a context for each metadata item.

The following example shows the mapping of a document with one part and metadata items to several contexts.

Once the document is represented as contexts with one or more document chunks, the analysis processors can process it. The processors can perform one of the following:

create new contexts
transform existing contexts

If you want to perform semantic analysis, you must tokenize the context. The semantic processor can create annotations for the tokens. The following example shows a possible representation of the document after analysis.

In the Index

The final phase is the mapping of the contexts and the annotations to index fields so that the document may be used for search.

When a document matches a query the results contain hit fields, metadata (that can be different from the Push API metadata), categories, and related terms.

Focus on Data Processing Phases in the Analysis Pipeline

Data processing involves the following phases in the Exalead CloudView analysis pipeline.


#	Phase
1	Document Processing provides a set of analysis filters that are able to modify the content of the documents for indexing. For example, the document's language and MIME type are automatically detected, and it is possible to use the result of this detection as a document category.
2	Contexts appear at the end of the document processing phase. You can find new or transformed content that can be mapped in the final phase to fields in the index.
3	Semantic Processing provides a set of semantic processors to detect related terms, perform semantic query processing, categorization thesaurus, extractions, and ontology matching.
4	Annotations provide additional information about a piece of text in the form of names, attributes, descriptions, and so on. They are attached to documents by semantic processors during analysis. To search on an annotation, map to an index field. To use for a facet, map to a category.
5	Use Content and Annotation Mapping to send data from multiple metas of the documents to the same field, or to send a given part of the document to multiple index fields, with multiple options. You can find in the index fields the heterogeneous content and annotations gathered from the connectors and created during the document analysis