Configuring Form Indexing

Form indexing configuration defines pairs of semantic annotations and matching modes (or index levels). Here, semantic annotation values are indexed at the defined matching mode. The matching mode is an arbitrary integer required to access inverted lists (word, level), which gives access to the word positions in all the documents.

Three matching modes have a predefined meaning: 0 is exact, 1 is lowercase, 2 is normalized. The rest is up to the user.

For example, there is a hidden form indexing configuration (NORMALIZE, 2) that defines that the normalizer's NORMALIZE annotations must be indexed at level 2. Then at query time, if normalized is the prefix handler matching mode, these annotations permit access to the index and to look for the requested words.

This task shows you how to:

Use Form indexing for Over-Indexing Acronyms

The form indexing customization helps over-indexing. For example, we want that the query NASA matches occurrences of NASA and N.A.S.A.. That is to say, each time N.A.S.A. appears in a document, we want to over-index it with NASA.

  1. Add an acronym detector in the analysis pipe.
    1. Go to Data Processing > Semantic Processors.
    2. Drag the Acronym Detector in the analysis pipeline.
  2. Add a form indexing (acronym, 2) so that the acronym detector's annotations are indexed at level 2.
    1. Go to Index > Linguistics > Tokenizations > Advanced.
    2. Click Add form.
    3. For Tag, enter acronym and for Matching Mode, enter 2 (normalized).

Since our prefix handler targets matching mode 2, any query word can match any over-indexed value coming from the acronym detector.

Set Weight

To set a distance (or weight) in Form indexing configuration, you may specify an additional Trust level in Index > Linguistics > Tokenizations > Advanced. This attribute ranges from 1 to 100, 100 being the highest and default weight. The query expander uses it to compute a weight for expansion.

  1. Let us say that in the Linguistic.xml file, the trustLevel parameter corresponds to a weight of 50.
    <ling:LinguisticConfig version="0" xmlns:bee="exa:exa.bee" xmlns="exa:com.exalead.linguistic.v10" 
    xmlns:config="exa:exa.bee.config">
      <ling:TokenizationConfig name="tok0">
        <ling:StandardTokenizer concatNumAlpha="true" concatAlphaNum="true">
          <ling:BasisTechTokenizationCompatibility languages="en,de,fr,sv,es,it,nl,pt,no,fi,da,bg,ca,cs,el,hr,
    hu,pl,sk,sl,sr"/>
          <ling:GermanDisagglutiner/>
          <ling:DutchDisagglutiner/>
          <ling:NorwegianDisagglutiner/>
          <ling:charOverrides/>
          <ling:patternOverrides>
            <ling:StandardTokenizerOverride toOverride="[[:alnum:]][&amp;][[:alnum:]]" type="token"/>
            <ling:StandardTokenizerOverride toOverride="[[:alnum:]]*[.](?i:net)" type="token"/>
            <ling:StandardTokenizerOverride toOverride="[[:alnum:]]+[+]+" type="token"/>
            <ling:StandardTokenizerOverride toOverride="[[:alnum:]]+#" type="token"/>
          </ling:patternOverrides>
        </ling:StandardTokenizer>
        <ling:JapaneseTokenizer addMorphology="false" addRomanji="true"/>
        <ling:ChineseTokenizer addSimplified="false"/>
        <ling:BasisTechTokenizer language="ko"/>
        <ling:FormIndexingConfig>
          <ling:Form tag="SubTokenizerLowercase" indexKind="1"/>
          <ling:Form tag="SubTokenizerNormalize" indexKind="2"/>
          <ling:Form tag="SubTokenizerConcatLowercase" indexKind="1"/>
          <ling:Form tag="SubTokenizerConcatNormalize" indexKind="2"/>
          <ling:Form tag="cjk" indexKind="2"/>
          <ling:Form tag="lemma" indexKind="3"/>
          <ling:Form tag="jafactorized" indexKind="42"/>
          <ling:Form tag="jaexpanded" indexKind="43"/>
          <ling:Form tag="jaromanji" indexKind="44"/>
          <ling:Form tag="jaradicalfactor" indexKind="45"/>
          <ling:Form tag="jaradicalexpand" indexKind="46"/>
          <ling:Form tag="compound" indexKind="2" trustLevel="50"/>
        </ling:FormIndexingConfig>
        <ling:NormalizerConfig useGermanExceptions="false" disableBasisTechNormalizerForLanguages="ko,zh" 
    useNormalizationExceptions="true" transliteration="true"/>
      </ling:TokenizationConfig>
    </ling:LinguisticConfig>
  2. In the resulting expansion, decompounding has a weight of 0.5 using BasisTech Korean tokenizer:
    http://localhost:10010/search-api/search?q=산악회&l=ko
    
    #query{nbdocs=0, text_relevance.expr="@term.score * @proximity + @b", proximity.maxDistance=1000, 
    term.score=RANK_TFIDF}
    (#or{*.policy=MAX}
    (#alphanum{source="MOT",seqid=0,groupid=0,k=2}(text,"산악회")
     #alphanum{w=0.5,source="MOT",seqid=0,groupid=0,k=2}(text,"산악")
     #alphanum{w=0.5,source="MOT",seqid=0,groupid=0,k=2}(text,"회") ))"