Write a Java Custom Semantic Processor
The difference with the Java Custom Tokenizer is in the input:
-
The tokenizer receives a text chunk to process.
-
While for the Java Custom Semantic Processor, you have to get the tokens from the pipeline (see Sample Semantic Processor).
Derive your MySemanticProcessor
class from
com.exalead.pdoc.analysis.JavaCustomSemanticProcessor
and implement:
@PropertyLabel(value = "JavaCustomSemanticProcessor") @CVComponentConfigClass(configClass = MySemanticProcessorConfig.class) @CVComponentDescription(value = "My semantic processor in Java") public class MySemanticProcessor extends com.exalead.pdoc.analysis.JavaCustomSemanticProcessor { public MySemanticProcessor(MySemanticProcessorConfig config) throws Exception; /** * Called when a new document is about to get processed. */ public void newDocument(); /** * Called when there is no more input to process in the current document. * This is the last chance to attach annotations to the document if needed. */ public void endDocument(); /** * Called at initialization to retrieve the annotation tags that are planned to be used during processing. * Only declared annotations will be accessible on tokens retrieved with getNextToken(). * @return the list of all annotation tags needed or null if none */ public String[] declareAnnotations(); /** * Called when a new input chunk is to be processed. * The processor must pump tokens from pipe using getNextToken() * and return them once processed to the pipe with pushToken(). * * @param text the chunk text * @param language the chunk language * @param context the chunk context * @see getNextToken(), newAnnotation(), pushToken() */ public void processChunk(String chunk, int language, String context) throws Exception; }
The JavaCustomSemanticProcessor
provides you with
helpers too:
package com.exalead.pdoc.analysis; public abstract class JavaCustomSemanticProcessor extends CustomDocumentProcessor { ... /** * Pump the next token from the input stream. * @return the next token from the pipe or null if end of input is reached */ protected final AnnotatedToken getNextToken(); /** * Allocate a new annotation with the provided tag, value and length. * The annotation is either created or recycled from a previous use. * * @param tag the new annotation tag * @param value the new annotation value * @param nbTokens the new annotation length * @return a fresh or recycled annotation * @pre tag must have been declared in declareAnnotation() * @pre value is not null * @pre nbTokens > 0 */ protected final Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws InvalidAnnotationException; /** * Send a token to the output stream. * * - validity of the token is checked * - the token is added to the output buffer * - if needed, the output buffer is flushed * - the token is recycled * * In all cases, the token and its annotations are not usable anymore after the call. * * @param token A token allocated through a call to newToken() * @pre token is not null * @pre token form is not null nor empty * @pre token type is defined * @see newToken(), newAnnotation() */ protected final void pushToken(AnnotatedToken token) throws InvalidTokenException; /** * Attach an annotation to the currently processed document * * @param annotation the annotation to attach * @pre annotation must have been allocated with newAnnotation() * @see newAnnotation() */ protected final void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException; ... }