Write a Java Custom Semantic Processor

The Java custom semantic processor allows you to plug your code as the last semantic processor in the pipeline.

You use as input a flow of annotated tokens from the pipeline, have an opportunity to add or remove any annotation, and then send the tokens back to the indexing chain.

Note: For more information, see About Semantic Processors in the Exalead CloudView Configuration Guide .

This page discusses:

Write a Java Custom Semantic Processor
Caveats for Semantic Processors
Sample Semantic Processor

Write a Java Custom Semantic Processor

The difference with the Java Custom Tokenizer is in the input:

The tokenizer receives a text chunk to process.
While for the Java Custom Semantic Processor, you have to get the tokens from the pipeline (see Sample Semantic Processor).

Derive your MySemanticProcessor class from com.exalead.pdoc.analysis.JavaCustomSemanticProcessor and implement:

@PropertyLabel(value = "JavaCustomSemanticProcessor")
@CVComponentConfigClass(configClass = MySemanticProcessorConfig.class)
@CVComponentDescription(value = "My semantic processor in Java")

public class MySemanticProcessor extends com.exalead.pdoc.analysis.JavaCustomSemanticProcessor
{
    public MySemanticProcessor(MySemanticProcessorConfig config) throws Exception;

    /**
     * Called when a new document is about to get processed.
     */
    public void newDocument();

    /**
     * Called when there is no more input to process in the current document.
     * This is the last chance to attach annotations to the document if needed.
     */
    public void endDocument();

     /**
     * Called at initialization to retrieve the annotation tags that are planned to be used during processing.
     * Only declared annotations will be accessible on tokens retrieved with getNextToken().
     * @return the list of all annotation tags needed or null if none
     */
     public String[] declareAnnotations();

     /**
     * Called when a new input chunk is to be processed.
     * The processor must pump tokens from pipe using getNextToken() 
     * and return them once processed to the pipe with pushToken().
     *
     * @param text the chunk text
     * @param language the chunk language
     * @param context the chunk context
     * @see getNextToken(), newAnnotation(), pushToken()
     */
     public void processChunk(String chunk, int language, String context) throws Exception;
 }

The JavaCustomSemanticProcessor provides you with helpers too:

package com.exalead.pdoc.analysis;

public abstract class JavaCustomSemanticProcessor extends CustomDocumentProcessor
 {
    ...
    /**
     * Pump the next token from the input stream.
     * @return the next token from the pipe or null if end of input is reached
     */
    protected final AnnotatedToken getNextToken();
    /**
     * Allocate a new annotation with the provided tag, value and length.
     * The annotation is either created or recycled from a previous use.
     *
     * @param tag the new annotation tag
     * @param value the new annotation value
     * @param nbTokens the new annotation length
     * @return a fresh or recycled annotation
     * @pre tag must have been declared in declareAnnotation()
     * @pre value is not null
     * @pre nbTokens > 0
     */
     protected final Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws
 InvalidAnnotationException;
    /**
     * Send a token to the output stream.
     *
     * - validity of the token is checked
     * - the token is added to the output buffer
     * - if needed, the output buffer is flushed
     * - the token is recycled
     *
     * In all cases, the token and its annotations are not usable anymore after the call.
     *
     * @param token A token allocated through a call to newToken()
     * @pre token is not null
     * @pre token form is not null nor empty
     * @pre token type is defined
     * @see newToken(), newAnnotation()
     */
     protected final void pushToken(AnnotatedToken token) throws InvalidTokenException;
    /**
     * Attach an annotation to the currently processed document
     *
     * @param annotation the annotation to attach
     * @pre annotation must have been allocated with newAnnotation()
     * @see newAnnotation()
     */
    protected final void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException;
    ...
}

Caveats for Semantic Processors

When creating new annotations, you have to define their tag, display form, number of tokens and possibly their trust level. If the annotation is malformed annotation, newAnnotation() or addDocumentAnnotation() throw an InvalidAnnotationException.
To remove an annotation from a token, assign it a null value in the annotations[] array.
For the custom tokenizer, do not allocate annotations by yourself but always use newAnnotation() as to save RAM, annotations are pooled and recycled once pushed to the pipeline.
Since tokens and annotations are recycled, they are not usable anymore once pushed to the pipeline. The only way to get a new token is through getNextToken(). Always allocate annotations through newAllocation().
Keep as few tokens in memory as possible before pushing them back to the pipeline. Ideally, a token must be pushed before getting the next one from the pipeline.
Your code is executed in a multi-threaded environment, where each thread has its own processor, so that your code does not have to be thread-safe. However, threads share static objects, which can possibly lead to trouble in case of concurrent modifications.

Sample Semantic Processor

This sample demonstrates how to write a custom semantic processor that:

Gets tokens from the semantic pipeline
Looks for a first name that is part of its dictionary
Checks if the following word starts with a capitalized letter
Add an annotation people if there is not an annotation NE.people already.
Pushes tokens back to the pipeline

package com.exalead.customcode.analysis;

/**
 * This processor can be configured with:
 *  <CustomDocumentProcessor classId="com.exalead.customcode.analysis.CustomSemanticProcessor">
 *   <KeyValue key="Meta" value="mymeta" />
 *  </CustomDocumentProcessor>
 */

@CVComponentConfigClass(configClass = com.exalead.customcode.analysis.CustomSemanticProcessor.
CustomSemanticProcessorConfig.class)
@CVComponentDescription(value = "My semantic processor in Java")

public class CustomSemanticProcessor extends JavaCustomSemanticProcessor
 {
    public static class CustomSemanticProcessorConfig implements CVComponentConfig
     {
        private String meta;
        public String getMeta() {
            return meta;
        }
        @IsMandatory(true)
        public void setMeta(String meta) {
            this.meta = meta;
        }
     }
     public CustomSemanticProcessor(CustomSemanticProcessorConfig config) throws Exception {
        super(config);
        for(String s : firstNames) {
            names.add(s);
        }
    }

    @Override
    public void newDocument() {
        logger.debug("I'm about to start processing a new document!");
    }

    @Override
    public void endDocument() {
        logger.debug("Done with this document!");
    }

    @Override
    public String[] declareAnnotations() {
        return new String[] { "people" };
    }

    @Override
    public void processChunk(String chunk, int language, String context) throws Exception {
        for (AnnotatedToken token = getNextToken(); token != null; token = getNextToken()) {
            if (token.kind == AnnotatedToken.TOKEN_ALPHA && names.contains(token.token) && 
token.getAnnotationsWithTag("NE.people").isEmpty()) {
                AnnotatedToken next = getNextToken();
                if (next != null) {
                    if (next.kind == AnnotatedToken.TOKEN_SEP_SPACE) {
                        AnnotatedToken next2 = getNextToken();
                        if (next2 != null) {
                            if (next2.kind == AnnotatedToken.TOKEN_ALPHA && next2.token.matches
("\\p{Lu}.+")) {
                                Annotation annotation = newAnnotation("people", token.token
 + " " + next2.token, 3);
                                token.annotations = (Annotation[]) ArrayUtils.add
(token.annotations, annotation);
                            }
                            pushToken(token);
                            pushToken(next);
                            pushToken(next2);
                            continue;
                        }
                     }
                     pushToken(token);
                    pushToken(next);
                    continue;
                }
            }
             pushToken(token);
        }
    }

    private Logger logger = Logger.getLogger("custom-semantic-processor");
    private static String[] firstNames = { "John", "Bill", "Steve", "Robert", "George", "William",
 "Frank" };
    private HashSet<String> names = new HashSet<String>();
}