Write a Java Custom Tokenizer

A Java Custom Tokenizer is useful for processing the text with an external analyzer or for implementing a specific behavior. The JavaCustomTokenizer allows you to write your own code for splitting the input and possibly adding annotations to the produced tokens.

These tokens then follow their way in the indexing chain as usual (see Sample Tokenizer ).

This page discusses:

Write a Java Custom Tokenizer
Caveats for Tokenizers
Sample Tokenizer

Write a Java Custom Tokenizer

If you derive your MyTokenizer class from com.exalead.pdoc.analysis.JavaCustomTokenizer, you have to implement at least the following:

@PropertyLabel(value = "JavaCustomTokenizer")
@CVComponentConfigClass(configClass = MyTokenizerConfig.class)
@CVComponentDescription(value = "My tokenizer in Java")

public class MyTokenizer extends com.exalead.pdoc.analysis.JavaCustomTokenizer
{
    /**
     * CustomDocumentProcessor requires a constructor accepting a custom configuration
     * This constructor must call JavaCustomTokenizer constructor
     */
    public MyTokenizer(MyTokenizerConfig config);

    /**
     * Called when a new document is about to get processed.
     */    public void newDocument();

    /**
     * Called when there is no more input to process in the current document.
     * This is the last chance to attach annotations to the document if needed.
     */    public void endDocument();

    /**     * Called when a new input chunk is to be processed.
     * The processor must:
     * - produce tokens from the text using newToken() and newAnnotation()
     * - send the said tokens to the semantic pipe with pushToken()
     *
     * @param text the chunk text
     * @param language the chunk language
     * @param context the chunk context
     * @throws InvalidTokenException
     * @see newToken(), newAnnotation(), pushToken()
     * @post Concatenation of the token forms must be strictly equal to the input string
     */
    public void processChunk(String text, int language, String context) throws Exception;

    /**
     * Called at initialization to retrieve the annotation tags that are planned to be produced during
 tokenization.
     * @return the list of all annotation tags needed or null if none
     */
    public String[] declareAnnotations();
}

The JavaCustomTokenizer provides a handful of helper methods:

package com.exalead.pdoc.analysis;

public abstract class JavaCustomTokenizer extends CustomDocumentProcessor
 {
    ...
    /**
     * Allocate a new token of the provided form.
     * The token is either created or recycled from a previous use.
     * The token type and language are deduced from the form and the chunk input language 
(they can be overridden though).
     * @param form the new token form
     * @return a fresh or recycled token
     * @pre form is not null
     * @pre form is not empty
     * @post token kind is set using default typing
     */
    protected AnnotatedToken newToken(String form) throws InvalidTokenException;
    /**
     * Allocate a new annotation with the provided tag, value and length.
     * The annotation is either created or recycled from a previous use.
     * @param tag the new annotation tag
     * @param displayForm the new annotation value
     * @param nbTokens the new annotation length
     * @return a fresh or recycled annotation
     * @pre tag must have been declared in declareAnnotation()
     * @pre displayForm is not null
     * @pre nbTokens > 0
     */
    protected Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws 
InvalidAnnotationException;
    /**
     * Send a token to the output stream.
     *
     * - validity of the token is checked
     * - the token is added to the output buffer
     * - if needed, the output buffer is flushed
     * - the token is recycled
     * In all cases, the token and its annotations are no longer usable after the call.
     * @param token A token allocated through a call to newToken()
     * @pre token is not null
     * @pre token form is not null nor empty
     * @pre token type is defined
     * @see newToken(), newAnnotation()
     */
    protected void pushToken(AnnotatedToken token) throws InvalidTokenException;
    /**
     * Attach an annotation to the currently processed document after checking its validity.
     * @param annotation the annotation to attach
     * @pre annotation must have been allocated with newAnnotation()
     * @see newAnnotation()
     */
    protected void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException;
    ...
}

Caveats for Tokenizers

When creating new tokens, you have to specify their form (attribute token) and possibly their annotation set. The kind, language, and offset are automatically defined. You may want to override the kind if the default typing does not suit your needs. Be careful with the token kind, as it has a huge impact on the way a token is indexed (or not). If a token is malformed, newToken() or pushToken() throw InvalidTokenException.
When creating new annotations, you have to specify their tag, display form, number of tokens, and possibly their trust level. If an annotation is malformed, newAnnotation() or addDocumentAnnotation() throw an InvalidAnnotationException.
Do not allocate annotated tokens and annotations by yourself, always use newToken() and newAnnotation() as they are pooled and recycled once pushed to the pipeline, to save RAM.
Since tokens and annotations are recycled, they are not usable anymore once pushed to the pipeline. Request a new token/annotation through newToken()/newAnnotation() if required.
Avoid allocating too many tokens and annotations before pushing them to the pipeline. Ideally, to guarantee optimal RAM consumption, push a token before allocating a new one.
Your code is executed in a multi-threaded environment. Each thread has its own tokenizer so that your code does not have to be thread safe. However, threads share static objects, and this may lead to issues in case of concurrent modifications.

Sample Tokenizer

This sample demonstrates how to write a custom tokenizer that:

Splits the input text into alphabetical, numerical, punctuation, or blank tokens.
Annotates with "Capitalized" each token that starts with an upper-case letter.
Annotates with "Number" each token made of digits.
Pushes the produced tokens to the semantic pipeline.

package com.exalead.customcode.analysis;

/**
 * This processor can be configured with:
 *  <CustomDocumentProcessor classId="com.exalead.customcode.analysis.CustomTokenizer">
 *   <KeyValue key="Meta" value="mymeta" />
 *  </CustomDocumentProcessor>
 */
@CVComponentConfigClass(configClass = com.exalead.customcode.analysis.CustomTokenizer.
CustomTokenizerConfig.class)
@CVComponentDescription(value = "My tokenizer in Java")

public class CustomTokenizer extends JavaCustomTokenizer
 {
    public static class CustomTokenizerConfig implements CVComponentConfig
     {
        private String meta;
        public String getMeta() {
            return meta;
        }
        @IsMandatory(true)
        public void setMeta(String meta) {
            this.meta = meta;
        }
    }
        public CustomTokenizer(CustomTokenizerConfig config) throws Exception {
        super(config);
    }
    @Override
    public void newDocument() {
         logger.debug("I'm about to start tokenizing a new document!");
    }
    @Override
    public void endDocument() {
        logger.debug("Done with this document!");
    }
    @Override
    public void processChunk(String text, int language, String context) throws Exception {
        logger.debug("Tokenizing [" + text + "] in context [" + context + "] with language [" + Language.name
(language) + "]");
        Matcher matcher = pattern.matcher(text);
        while (matcher.find()) {
            AnnotatedToken token = newToken(text.substring(matcher.start(), matcher.end()));
            if (token.token.matches("^\\p{Lu}.*$")) {
                Annotation[] a = { newAnnotation("Capitalized", token.token.toLowerCase(), 1) };
                token.annotations = a;
                       } else if (token.token.matches("^[0-9]+$")){
                Annotation[] a = { newAnnotation("Number", token.token.toLowerCase(), 1) };
                token.annotations = a;
            }
            pushToken(token);
        }
    }
    @Override
    public String[] declareAnnotations() {
        return new String[] { "Capitalized", "Number" };
    }

    private static Pattern pattern = Pattern.compile("\\p{L}+|\\p{N}+|\\p{Z}+|\\p{P}|[^\\p{L}\\p{N}\\p{Z}\\
p{P}]+");
    private Logger logger = Logger.getLogger("mycustomtokenizer");
}