Write a Java Custom Tokenizer
If you derive your MyTokenizer
class from
com.exalead.pdoc.analysis.JavaCustomTokenizer
, you have to implement at
least the following:
@PropertyLabel(value = "JavaCustomTokenizer") @CVComponentConfigClass(configClass = MyTokenizerConfig.class) @CVComponentDescription(value = "My tokenizer in Java") public class MyTokenizer extends com.exalead.pdoc.analysis.JavaCustomTokenizer { /** * CustomDocumentProcessor requires a constructor accepting a custom configuration * This constructor must call JavaCustomTokenizer constructor */ public MyTokenizer(MyTokenizerConfig config); /** * Called when a new document is about to get processed. */ public void newDocument(); /** * Called when there is no more input to process in the current document. * This is the last chance to attach annotations to the document if needed. */ public void endDocument(); /** * Called when a new input chunk is to be processed. * The processor must: * - produce tokens from the text using newToken() and newAnnotation() * - send the said tokens to the semantic pipe with pushToken() * * @param text the chunk text * @param language the chunk language * @param context the chunk context * @throws InvalidTokenException * @see newToken(), newAnnotation(), pushToken() * @post Concatenation of the token forms must be strictly equal to the input string */ public void processChunk(String text, int language, String context) throws Exception; /** * Called at initialization to retrieve the annotation tags that are planned to be produced during tokenization. * @return the list of all annotation tags needed or null if none */ public String[] declareAnnotations(); }
The JavaCustomTokenizer
provides a handful of helper
methods:
package com.exalead.pdoc.analysis; public abstract class JavaCustomTokenizer extends CustomDocumentProcessor { ... /** * Allocate a new token of the provided form. * The token is either created or recycled from a previous use. * The token type and language are deduced from the form and the chunk input language (they can be overridden though). * @param form the new token form * @return a fresh or recycled token * @pre form is not null * @pre form is not empty * @post token kind is set using default typing */ protected AnnotatedToken newToken(String form) throws InvalidTokenException; /** * Allocate a new annotation with the provided tag, value and length. * The annotation is either created or recycled from a previous use. * @param tag the new annotation tag * @param displayForm the new annotation value * @param nbTokens the new annotation length * @return a fresh or recycled annotation * @pre tag must have been declared in declareAnnotation() * @pre displayForm is not null * @pre nbTokens > 0 */ protected Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws InvalidAnnotationException; /** * Send a token to the output stream. * * - validity of the token is checked * - the token is added to the output buffer * - if needed, the output buffer is flushed * - the token is recycled * In all cases, the token and its annotations are no longer usable after the call. * @param token A token allocated through a call to newToken() * @pre token is not null * @pre token form is not null nor empty * @pre token type is defined * @see newToken(), newAnnotation() */ protected void pushToken(AnnotatedToken token) throws InvalidTokenException; /** * Attach an annotation to the currently processed document after checking its validity. * @param annotation the annotation to attach * @pre annotation must have been allocated with newAnnotation() * @see newAnnotation() */ protected void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException; ... }