Write a Java Custom Semantic Processor
The difference with the Java Custom Tokenizer is in the input:
-
The tokenizer receives a text chunk to process.
-
While for the Java Custom Semantic Processor, you have to get the tokens from the pipeline (see Sample Semantic Processor).
Derive your MySemanticProcessor class from
com.exalead.pdoc.analysis.JavaCustomSemanticProcessor and implement:
@PropertyLabel(value = "JavaCustomSemanticProcessor")
@CVComponentConfigClass(configClass = MySemanticProcessorConfig.class)
@CVComponentDescription(value = "My semantic processor in Java")
public class MySemanticProcessor extends com.exalead.pdoc.analysis.JavaCustomSemanticProcessor
{
public MySemanticProcessor(MySemanticProcessorConfig config) throws Exception;
/**
* Called when a new document is about to get processed.
*/
public void newDocument();
/**
* Called when there is no more input to process in the current document.
* This is the last chance to attach annotations to the document if needed.
*/
public void endDocument();
/**
* Called at initialization to retrieve the annotation tags that are planned to be used during processing.
* Only declared annotations will be accessible on tokens retrieved with getNextToken().
* @return the list of all annotation tags needed or null if none
*/
public String[] declareAnnotations();
/**
* Called when a new input chunk is to be processed.
* The processor must pump tokens from pipe using getNextToken()
* and return them once processed to the pipe with pushToken().
*
* @param text the chunk text
* @param language the chunk language
* @param context the chunk context
* @see getNextToken(), newAnnotation(), pushToken()
*/
public void processChunk(String chunk, int language, String context) throws Exception;
}
The JavaCustomSemanticProcessor provides you with
helpers too:
package com.exalead.pdoc.analysis;
public abstract class JavaCustomSemanticProcessor extends CustomDocumentProcessor
{
...
/**
* Pump the next token from the input stream.
* @return the next token from the pipe or null if end of input is reached
*/
protected final AnnotatedToken getNextToken();
/**
* Allocate a new annotation with the provided tag, value and length.
* The annotation is either created or recycled from a previous use.
*
* @param tag the new annotation tag
* @param value the new annotation value
* @param nbTokens the new annotation length
* @return a fresh or recycled annotation
* @pre tag must have been declared in declareAnnotation()
* @pre value is not null
* @pre nbTokens > 0
*/
protected final Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws
InvalidAnnotationException;
/**
* Send a token to the output stream.
*
* - validity of the token is checked
* - the token is added to the output buffer
* - if needed, the output buffer is flushed
* - the token is recycled
*
* In all cases, the token and its annotations are not usable anymore after the call.
*
* @param token A token allocated through a call to newToken()
* @pre token is not null
* @pre token form is not null nor empty
* @pre token type is defined
* @see newToken(), newAnnotation()
*/
protected final void pushToken(AnnotatedToken token) throws InvalidTokenException;
/**
* Attach an annotation to the currently processed document
*
* @param annotation the annotation to attach
* @pre annotation must have been allocated with newAnnotation()
* @see newAnnotation()
*/
protected final void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException;
...
}