HTML Relevant Content Extractor
Extracts the most relevant parts of an HTML document.
Generally, the relevant part of an HTML document is the article on the middle of the page. The header, the footer and the menus are often the same on all pages and should not be indexed.
The extraction can be tuned using different attributes (see below).
An annotation readability:concat
is added (on the 1st chunk) when several
relevant chunks must be processed together.
Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input, and keeps only chunks having a score greater than "minScore".
This score is based on different weighting factors:
-
word length
-
paragraph length
-
HTML tags
-
CSS classes
-
word/hyperlink ratio
It copies relevant HTML chunks to a new context defined by 'relevantChunkContext' and either:
-
Copies irrelevant chunks to another context defined by irrelevantChunkContext (default).
-
Annotates irrelevant chunks with an annotation defined by irrelevantChunkAnnotation.
By default, irrelevant chunks are indexed as text with a lower score. Remove the 'excludedcontent' index mapping if you do not want to index them.
Chunks from other types of documents (non-HTML documents) are all copied to the relevant context.
No more 'text' chunks are available after this processor. Use 'htmlcontent' if you want to manipulate the original text.