Text Extraction

HTML Relevant Content Extractor

Extracts the most relevant parts of an HTML document.

Generally, the relevant part of an HTML document is the article on the middle of the page. The header, the footer and the menus are often the same on all pages and should not be indexed.

The extraction can be tuned using different attributes (see below).

An annotation readability:concat is added (on the 1st chunk) when several relevant chunks must be processed together.

Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input, and keeps only chunks having a score greater than "minScore".

This score is based on different weighting factors:

word length
paragraph length
HTML tags
CSS classes
word/hyperlink ratio

It copies relevant HTML chunks to a new context defined by 'relevantChunkContext' and either:

Copies irrelevant chunks to another context defined by irrelevantChunkContext (default).
Annotates irrelevant chunks with an annotation defined by irrelevantChunkAnnotation.

By default, irrelevant chunks are indexed as text with a lower score. Remove the 'excludedcontent' index mapping if you do not want to index them.

Chunks from other types of documents (non-HTML documents) are all copied to the relevant context.

No more 'text' chunks are available after this processor. Use 'htmlcontent' if you want to manipulate the original text.

MIME Detector

Operates on each DocumentPart for which a MIME-type is not available.

For each DocumentPart in the PAPI, you can specify the MIME-type.

For DocumentPart, the 'bytes' and the 'filename' are used to guess the real MIME-type and charset.

The guessed MIME-type and the charset are then set as attributes of the DocumentPart.

Input: The DocumentPart of the document.

Output: 'mime' and 'encodingToUse' attributes of DocumentParts.

This document processor does not create any document chunks.

Mime Type Setter

Manually specify the mime type.

Semantic Web Document Processor

Extracts microdata/format from indexed documents.

Standard Parts Merger

This processor needs one DocumentPart called the 'Master Part'. If there is only one part, this part is the 'Master Part'.

If there are multiple parts, the part named after the 'masterPart' attribute is the 'Master Part'.

The DocumentChunks from the 'Master Part' are removed if there is a root DocumentChunk (that is, not within any part) with the same ContextName except for those whose ContextName appears in 'partSpecificContexts'. For example, if there is a DocumentChunk with ContextName 'title' in the root document, any DocumentChunk with ContextName 'title' in the Master Part is deleted. This behavior allows you to keep the meta-data extracted from the master part, unless overridden through a global 'meta' in the PAPI.
The DocumentChunks from the other DocumentParts are deleted, except for those whose ContextName appears in 'partSpecificContexts'. This avoids the index fields being polluted by additional DocumentPart's metadata. For example, when processing an Email with attachment, we keep the date of the email as a Date, rather than any date extracted as a metadata of any attachment.

Input: All DocumentChunks' from parts.

Output: This processor does not create any new DocumentChunk. It deletes existing DocumentChunks.

Text Extractor (All Mime Types)

Performs text content extraction for all MIME-types (300+ file formats are handled).

Text, HTML, and built-in data types must be processed by the 'NativeTextExtractor'.

Make sure to have a 'NativeTextExtractor' before the ConvertTextExtractor in your pipeline.

Input: Document Binary Part.

Output: One or more DocumentChunks are created for each part, using the text and metadata of the binary content. Each DocumentChunk is associated with a ContextName. There can be one or more ContextNames depending on the extraction process. The ContextNames created depend on each mime-type.

Text Extractor (text, html, exalead)

Extraction is performed for the following data types:

text/plain for Text files.
text/html for HTML Files.
application/x-exalead-document for CloudView 4.6 document format (com.exalead.document)
application/x-exalead-ndoc for CloudView 5 internal document format, binary.
application/x-exalead-ndoc-v10+xml for CloudView internal document format, XML.

Input: Each DocumentPart. The 'mimeType' and 'charset' of the DocumentPart are used by the extractor. Only the DocumentParts for which the mime-type matches the specified list are processed.

Output: Creates one or more chunks for each part, using the text and metadata for the binary content. Each DocumentChunk is associated with a ContextName.

There can be one or more ContextNames depending on the extraction process.

The 'bytes' of each Document Binary Part are deleted.

Xpath Extractor

Extraction is performed for the following data types:

text/html. HTML Files.
application/xml. XML Files.

Input: Each DocumentPart. The 'mimeType' and 'charset' of the DocumentPart are used by the extractor. Only DocumentParts for which mime-type matches the specified list are processed.

Output: DocumentChunks are created for each Xpath Rule. Each DocumentChunk is associated with the specified 'Meta name' ContextName.

Note: The output is not processed through the TextExtractors, if you want only the textual content, you must add //text() to your rule.

Warning: Must be set before the NativeTextExtractor because the 'bytes' of each Document Binary Part are deleted by the NativeTextExtractor.

Limitations: This extractor handles node set and string functions, not number and Boolean. You can use number or Boolean functions inside your xpath <code>//img[starts-with(@src, "http://")]</code> because this xpath return a set of nodes (<code><img></code>) but xpath <code>count(//img)</code> does not work because it returns a number.

Xpath Fragment Extractor

Input: All DocumentChunks associated with the specified 'inputContext' ContextNames. Input can be XML or HTML fragment.

Output: DocumentChunks are created for each Xpath Fragment Rule. Each DocumentChunk is associated with the specified 'Meta name' ContextName.

Warning: To put before the NativeTextExtractor because the 'bytes' of each Document Binary Part are deleted by the NativeTextExtractor.