Write a Java Custom Tokenizer
If you derive your MyTokenizer class from
com.exalead.pdoc.analysis.JavaCustomTokenizer, you have to implement at
least the following:
@PropertyLabel(value = "JavaCustomTokenizer")
@CVComponentConfigClass(configClass = MyTokenizerConfig.class)
@CVComponentDescription(value = "My tokenizer in Java")
public class MyTokenizer extends com.exalead.pdoc.analysis.JavaCustomTokenizer
{
/**
* CustomDocumentProcessor requires a constructor accepting a custom configuration
* This constructor must call JavaCustomTokenizer constructor
*/
public MyTokenizer(MyTokenizerConfig config);
/**
* Called when a new document is about to get processed.
*/ public void newDocument();
/**
* Called when there is no more input to process in the current document.
* This is the last chance to attach annotations to the document if needed.
*/ public void endDocument();
/** * Called when a new input chunk is to be processed.
* The processor must:
* - produce tokens from the text using newToken() and newAnnotation()
* - send the said tokens to the semantic pipe with pushToken()
*
* @param text the chunk text
* @param language the chunk language
* @param context the chunk context
* @throws InvalidTokenException
* @see newToken(), newAnnotation(), pushToken()
* @post Concatenation of the token forms must be strictly equal to the input string
*/
public void processChunk(String text, int language, String context) throws Exception;
/**
* Called at initialization to retrieve the annotation tags that are planned to be produced during
tokenization.
* @return the list of all annotation tags needed or null if none
*/
public String[] declareAnnotations();
}
The JavaCustomTokenizer provides a handful of helper
methods:
package com.exalead.pdoc.analysis;
public abstract class JavaCustomTokenizer extends CustomDocumentProcessor
{
...
/**
* Allocate a new token of the provided form.
* The token is either created or recycled from a previous use.
* The token type and language are deduced from the form and the chunk input language
(they can be overridden though).
* @param form the new token form
* @return a fresh or recycled token
* @pre form is not null
* @pre form is not empty
* @post token kind is set using default typing
*/
protected AnnotatedToken newToken(String form) throws InvalidTokenException;
/**
* Allocate a new annotation with the provided tag, value and length.
* The annotation is either created or recycled from a previous use.
* @param tag the new annotation tag
* @param displayForm the new annotation value
* @param nbTokens the new annotation length
* @return a fresh or recycled annotation
* @pre tag must have been declared in declareAnnotation()
* @pre displayForm is not null
* @pre nbTokens > 0
*/
protected Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws
InvalidAnnotationException;
/**
* Send a token to the output stream.
*
* - validity of the token is checked
* - the token is added to the output buffer
* - if needed, the output buffer is flushed
* - the token is recycled
* In all cases, the token and its annotations are no longer usable after the call.
* @param token A token allocated through a call to newToken()
* @pre token is not null
* @pre token form is not null nor empty
* @pre token type is defined
* @see newToken(), newAnnotation()
*/
protected void pushToken(AnnotatedToken token) throws InvalidTokenException;
/**
* Attach an annotation to the currently processed document after checking its validity.
* @param annotation the annotation to attach
* @pre annotation must have been allocated with newAnnotation()
* @see newAnnotation()
*/
protected void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException;
...
}