Implementing Format Plugins

The Text Extractor (all mime types) component is used in the analysis to extract text and metadata from various file types (such as Office files, PDF, etc.). A similar component is also used to produce HTML previews from the same set of file types, and generate thumbnails when displaying results.

This system can be extended using the plugins mechanism, to support more file types for extraction at indexing time, but also for HTML preview and thumbnails calculation.

This page discusses:

Technical Overview

A format plugin is a regular plugin and its main component can have an associated configuration object. For more information, see Packaging the connector as a plugin.

The format component must implement:

  • DocumentPartTransformer (com.exalead.pdoc.plugins.DocumentPartTransformer)

  • and, as usual, CVComponent (com.exalead.mercury.component.CVComponent).

It must provide one of the following constructors:

  • Either a default constructor (taking no argument), if the component has no associated configuration.

  • Or, as usual, a constructor taking an object implementing CVComponentConfig. In that case, a CVComponentConfigClass annotation must be present. See Top level component class(es).

    Note: You may want to extend the DocumentPartTransformer.DefaultImpl class (com.exalead.pdoc.plugins.DocumentPartTransformer.DefaultImpl) which already provides default methods.

Two methods must be implemented as described in the following sections.

First method

The first method allows you to advertise the supported output MIME types, that is to say, the MIME type of data produced after transformation, for the two kinds of transformation (extraction of data, or display).

/**
  * Get the list of supported MIME output formats per transformation kind.
  * For TransformKind.display, the types advertise the first produced part
  * MIME type. (an hypertext text/html document may have additional parts
  * with different formats)
  *
  * @returns an empty array if the transformation is not supported
  * @param kind
  * the kind of transformation: Note: the returned array may
  * include Format.MIME_GENERIC to advertise a filter producing
  * any type of document or an unknown subset of document types.
**/
     public List<String> getSupporterOutputMime(TransformKind kind);

Example

If the component is able to produce plain text when extracting, and HTML for display, the code can be:

public List<String> getSupporterOutputMime(TransformKind kind)
     {
         final List<String> list = new ArrayList<String>();
         if (kind == TransformKind.extract) {
             list.add("text/plain");
         } else if (kind == TransformKind.display) {
             list.add("text/html");
         }
         return list;
     }

Second method

The second method will process the input document, and produce extracted data or previews.

/**
  * Transform a part 'part' using the format 'format' into a destination
  * document. Use getSupporterOutputMime() to get the list of supported
  * output MIME types.
  *
  * @param part
  *        the input part.
  * @param format
  *        the transformation format
  * @param input
  *        the source document (part is probably in this document)
  * @param output
  *        the target document (parts are stored inside this document)
  * @throws UnsupportedInputFormatException
  *        if the input format is not supported by the filter 
           (in this case, the upstream client will give up with the current filter)
  * @throws UnsupportedOutputFormatException
  *        if the output format is not supported by the filter 
           (in this case, the upstream client may retry with a different format, using the same transformer)
  * @throws NoSuchMethodException
  *        if the method is not supported
  * @throws TransformationException
  *        upon error (in this case, the upstream client may choose to
  *        give up on the input, or select another filter) 
  *        Note: input and output may be the same objects
**/
  public void transform(DocumentPart part, ProcessableDocument input,
          ProcessableDocument output, Format format)
          throws TransformationException, UnsupportedInputFormatException,
          UnsupportedOutputFormatException, NoSuchMethodException;

This function takes:

  • a part as input (the part contains data, and associated metadata),

  • the related input document, which is generally unused,

  • the output document where multiple parts might be added,

  • and the requested format (whether information extraction is requested for the display processing and the output MIME type).

When producing content:

  • Each produced file must be embedded in a Part document, created through the output object addPart method.

  • For referenced parts, parts must have proper MIME type advertised, and proper filenames. If the first HTML part embeds relative links to resources, the given resources must be properly named, using the same relative filenames.

    Note: The part name is usually preview for a preview, and document for extracted metadata, but the naming is free. The first part must be the leading part. However, if the produced content is an HTML preview, the first part must be the master document.

The following example shows the skeleton of a transform() method:

@Override
public void transform(DocumentPart part, ProcessableDocument input,
         ProcessableDocument output, Format format)
         throws TransformationException, UnsupportedInputFormatException,
         UnsupportedOutputFormatException
{
     // Validate requested output format
     final boolean isText = format.getMime().equalsIgnoreCase(
             Format.MIME_TEXT);
     final boolean isHtml = format.getMime().equalsIgnoreCase(
             Format.MIME_HTML);
     final String outMime = format.getMime();
     if (!isText && !isHtml) {
         throw new UnsupportedOutputFormatException("unsupported format");
     }
 
    // Validate input format
     String mime = part.getComputedMime();
     if (!isNotMyFormat(part.getFilename(), part.getForcedMime())) {
         throw new UnsupportedInputFormatException("unsupported MIME type");
     }

     // Transform
     try {
         byte[] data = part.getContentAsBytes();
...
         if (isHtml) {
             // Prepare final part
             final DocumentPart dp = output.addPart("preview");
             dp.setEncoding("utf-8");
             dp.setForcedMime(format.getMime());
...
             dp.setContent(xml.toString().getBytes("UTF-8"));
         }
     } catch (IOException io) {
         throw new TransformationException(io);
     }
}