About Decompounding

Custom Resource Creation

The user dictionary is a UTF-8 text file (user.txt) in the directory KITDIR/resource/all-arch/subtokenizer/ID where KITDIR is the root of CloudView unzipped kit directory and ID is one of following language identifiers: de (german), nl (dutch), no (norwegian).

The file format must be as follows (lines failing to match this format are ignored):

Lines starting with # are comments (ignored)
Lines containing one word define uncompoundable words
Lines containing three words explicitly define how the 1st word of the line must be decompounded into the 2nd and the 3rd words

Note that words are matched case-insensitively but accents matter (Kuchen is not Küchen).

# this is an ignored comment
# this line states that Volkswagen shouldn't be split:
volkswagen
# this line forces decompounding of Lastwagenfahrer into Lastwagen+Fahrer:
lastwagenfahrer lastwagen fahrer

Applying Changes

When you specify a new tokenization config for document analysis at index-time, you must also specify the same tokenization config for interpreting queries at search-time.

After editing the user dictionary:

the indexingserver and the searchserver must be restarted for the modifications to be taken into account
documents have to be reindexed

If issues should appear, turn the logging level of the indexingserver to debug and restart it. Search for [subtokenizer] in the logs to filter relevant lines.