About Decompounding

Decompounding is splitting up a word into components so that when searching for one of these components, the whole compound is found. For instance, we want Lastwagen or Fahrer to match Lastwagenfahrer.

To achieve this, Exalead CloudView splits compound words into at most two components using a dictionary. Three issues may occur during the process:

  • Since the dictionary cannot cover the integrality of a language (very specific/technical words may not be known from the algorithm), the compound may not be split because its components are not found in the dictionary.
  • When there are more than one way to split the word, the most likely one is selected but this may not be the expected one.
  • Some compound words should not be decompounded. The algorithm uses statistics to figure it out but some words may end up decompounded though it doesn't make much sense (Handschu, Volkswagen).

In these cases, the user can enrich the dictionary with his own rules by defining a list of words which should not be decompounded and a list of explicit decompoundings taking precedence over the Exalead CloudView-provided resource.

This page discusses:

Custom Resource Creation

The user dictionary is a UTF-8 text file (user.txt) in the directory KITDIR/resource/all-arch/subtokenizer/ID where KITDIR is the root of CloudView unzipped kit directory and ID is one of following language identifiers: de (german), nl (dutch), no (norwegian).

The file format must be as follows (lines failing to match this format are ignored):

  • Lines starting with # are comments (ignored)
  • Lines containing one word define uncompoundable words
  • Lines containing three words explicitly define how the 1st word of the line must be decompounded into the 2nd and the 3rd words

Note that words are matched case-insensitively but accents matter (Kuchen is not Küchen).

# this is an ignored comment
# this line states that Volkswagen shouldn't be split:
volkswagen
# this line forces decompounding of Lastwagenfahrer into Lastwagen+Fahrer:
lastwagenfahrer lastwagen fahrer

Applying Changes

When you specify a new tokenization config for document analysis at index-time, you must also specify the same tokenization config for interpreting queries at search-time.

After editing the user dictionary:

  • the indexingserver and the searchserver must be restarted for the modifications to be taken into account
  • documents have to be reindexed

If issues should appear, turn the logging level of the indexingserver to debug and restart it. Search for [subtokenizer] in the logs to filter relevant lines.