Text Processing Pipeline
Semantic processors allow you to analyze, transform, and annotate document texts.
They are usually assembled sequentially to build a text processing pipeline.
MOTPipe
Exalead CloudView's text processing
pipeline is named MOTPipe
(Mining Of Text Pipe).
The MOTPipe architecture is designed to annotate natural language documents. Annotations
can have different kinds: part of speech, named entity, ontology entry, etc. It is similar
to the UIMA or Gates frameworks. This pipe contains a set of "processors" that are applied
in a given sequence, and a set of linguistic resources they rely on.
The main difference between the MOTPipe and other frameworks is that it handles documents as an annotated token stream (for performance purposes).
A MOTPipe is composed by:
-
A converter, which handles text segmentation.
-
A list of resources (thread-safe, shared by several processors).
-
A list of processors using resources (thread local).
With this approach, the performance of each successive component depends on the
performance of each of the components that preceded it in the pipeline.
Note:
Errors made by
an "upstream" processor, like a part-of-speech tagging system, can negatively impact the
performance of each "downstream" processor (such as a named entities
recognizer).
Dependencies
A given processor can use results of previous processors in the pipeline. For
example, an OntologyMatcher processor can annotate last names and a RulesMatcher (that is,
NamedEntitiesMatcher) can thus exploit this kind of information to extract people's
names.
The RulesMatcher processor needs to have an efficient way to retrieve the last names annotation added by OntologyMatcher processor.
The MOTPipe embeds a Referential component designed to share information between
processors efficiently. When the MOTPipe is initialized, each processor of the pipe
registers the annotation it will add to the Referential, and gets the corresponding
AnnotationId
(each annotation is identified by an Id for performance
purposes). When a downstream processor requires an annotation added by an upstream
processor, it asks for it to the Referential during its registration step.