Use Form indexing for Over-Indexing Acronyms
The form indexing customization helps over-indexing. For example, we want that the
query NASA matches occurrences of NASA and
N.A.S.A. . That is to say, each time N.A.S.A. appears in
a document, we want to over-index it with NASA .
-
Add an acronym detector in the analysis pipe.
-
Go to Data Processing > Semantic Processors.
-
Drag the Acronym Detector in the analysis pipeline.
-
Add a form indexing (
acronym , 2 ) so that the
acronym detector's annotations are indexed at level 2.
-
Go to Index > Linguistics > Tokenizations >
Advanced.
-
Click Add form.
-
For Tag, enter acronym and for
Matching Mode, enter 2
(normalized).
Since our prefix handler targets matching mode 2, any query word can match any
over-indexed value coming from the acronym detector.
Set Weight
To set a distance (or weight) in Form indexing configuration, you may specify an
additional Trust level in Index > Linguistics >
Tokenizations > Advanced. This attribute ranges from 1 to
100 , 100 being the highest and default weight. The query
expander uses it to compute a weight for expansion.
-
Let us say that in the
Linguistic.xml file, the
trustLevel parameter corresponds to a weight of
50 .
<ling:LinguisticConfig version="0" xmlns:bee="exa:exa.bee" xmlns="exa:com.exalead.linguistic.v10"
xmlns:config="exa:exa.bee.config">
<ling:TokenizationConfig name="tok0">
<ling:StandardTokenizer concatNumAlpha="true" concatAlphaNum="true">
<ling:BasisTechTokenizationCompatibility languages="en,de,fr,sv,es,it,nl,pt,no,fi,da,bg,ca,cs,el,hr,
hu,pl,sk,sl,sr"/>
<ling:GermanDisagglutiner/>
<ling:DutchDisagglutiner/>
<ling:NorwegianDisagglutiner/>
<ling:charOverrides/>
<ling:patternOverrides>
<ling:StandardTokenizerOverride toOverride="[[:alnum:]][&][[:alnum:]]" type="token"/>
<ling:StandardTokenizerOverride toOverride="[[:alnum:]]*[.](?i:net)" type="token"/>
<ling:StandardTokenizerOverride toOverride="[[:alnum:]]+[+]+" type="token"/>
<ling:StandardTokenizerOverride toOverride="[[:alnum:]]+#" type="token"/>
</ling:patternOverrides>
</ling:StandardTokenizer>
<ling:JapaneseTokenizer addMorphology="false" addRomanji="true"/>
<ling:ChineseTokenizer addSimplified="false"/>
<ling:BasisTechTokenizer language="ko"/>
<ling:FormIndexingConfig>
<ling:Form tag="SubTokenizerLowercase" indexKind="1"/>
<ling:Form tag="SubTokenizerNormalize" indexKind="2"/>
<ling:Form tag="SubTokenizerConcatLowercase" indexKind="1"/>
<ling:Form tag="SubTokenizerConcatNormalize" indexKind="2"/>
<ling:Form tag="cjk" indexKind="2"/>
<ling:Form tag="lemma" indexKind="3"/>
<ling:Form tag="jafactorized" indexKind="42"/>
<ling:Form tag="jaexpanded" indexKind="43"/>
<ling:Form tag="jaromanji" indexKind="44"/>
<ling:Form tag="jaradicalfactor" indexKind="45"/>
<ling:Form tag="jaradicalexpand" indexKind="46"/>
<ling:Form tag="compound" indexKind="2" trustLevel="50"/>
</ling:FormIndexingConfig>
<ling:NormalizerConfig useGermanExceptions="false" disableBasisTechNormalizerForLanguages="ko,zh"
useNormalizationExceptions="true" transliteration="true"/>
</ling:TokenizationConfig>
</ling:LinguisticConfig>
-
In the resulting expansion, decompounding has a weight of 0.5 using BasisTech Korean
tokenizer:
http://localhost:10010/search-api/search?q=산악회&l=ko
#query{nbdocs=0, text_relevance.expr="@term.score * @proximity + @b", proximity.maxDistance=1000,
term.score=RANK_TFIDF}
(#or{*.policy=MAX}
(#alphanum{source="MOT",seqid=0,groupid=0,k=2}(text,"산악회")
#alphanum{w=0.5,source="MOT",seqid=0,groupid=0,k=2}(text,"산악")
#alphanum{w=0.5,source="MOT",seqid=0,groupid=0,k=2}(text,"회") ))"
|