Indexing Mixed Characters

Consider, for example, the Japanese content provided in the five following files:

DOC00.txt: 文書
DOC01.txt: 文書 and 01
DOC02.txt: 文書 and ０１ (" ０１" normalized to "01")
DOC4.txt: 文書 and ０１ (" ０１" normalized to "01") and 大
DOC5.txt: 大文書 and ０１

It is normal for 文書０１大 to match both DOC4 and DOC5 because the 01 part splits the Japanese sentence, and the 大 becomes a free token that can reattach.

If you examine how the 文書０１大 query is processed, it becomes ( (文 NEXT 書) OR (文書) ) AND 01 AND 大.

Therefore, it is normal for both DOC4 and DOC5 to match it.

Query 大文書０１ gets processed to ( (大 NEXT 文 NEXT 書) OR (大文書) ) AND 01, so DOC4 does not match because it does not contain the correct sequence of three ideograms.

The processing of Japanese requests is context-dependent. At index time, it might be indexed as either 文書 or 文書. Our specific language processing knows enough of Japanese grammar to make this decision. At search time, it depends on the query, but a query is generally too small to have the correct context, so the performed query becomes: (the sequence of ideograms) OR (recognized valid concatenations of ideograms that form valid Japanese words).

At index and search time, we also perform some Katakana/Hiragana conversions and Romanji (Latin forms) conversions.

Note: 3DSpace Index uses the following linguistic settings:

<LinguisticConfig xmlns="exa:com.exalead.linguistic.v10" version="1330354820991"> 
  <TokenizationConfig name="tok0"> 
    <StandardTokenizer concatAlphaNum="true" concatNumAlpha="true">
      <GermanDesagglutination /> 
      <DutchDesagglutination /> 
      <NorwegianDesagglutination /> 
      <charOverrides /> 
      <patternOverrides>
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]][&][[:alnum:]]" /> 
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]]*[.](?i:net)" /> 
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+[+]+" /> 
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+#" /> 
        <StandardTokenizerOverride type="token" toOverride="０"/>
        <StandardTokenizerOverride type="token" toOverride="１"/>
        <StandardTokenizerOverride type="token" toOverride="２"/>
        <StandardTokenizerOverride type="token" toOverride="３"/>
        <StandardTokenizerOverride type="token" toOverride="４"/>
        <StandardTokenizerOverride type="token" toOverride="５"/>
        <StandardTokenizerOverride type="token" toOverride="６"/>
        <StandardTokenizerOverride type="token" toOverride="７"/>
        <StandardTokenizerOverride type="token" toOverride="８"/>
        <StandardTokenizerOverride type="token" toOverride="９"/>
      </patternOverrides>
    </StandardTokenizer>
    <JapaneseTokenizer addRomanji="true" addMorphology="false" /> 
    <ChineseTokenizer addSimplified="false" /> 
    <FormIndexingConfig>
      <Form tag="SubTokenizerLowercase" indexKind="1" /> 
      <Form tag="SubTokenizerNormalize" indexKind="2" /> 
      <Form tag="SubTokenizerConcatLowercase" indexKind="1" /> 
      <Form tag="SubTokenizerConcatNormalize" indexKind="2" /> 
      <Form tag="cjk" indexKind="2" /> 
      <Form tag="jafactorized" indexKind="42" /> 
      <Form tag="jaexpanded" indexKind="43" /> 
      <Form tag="jaromanji" indexKind="44" /> 
      <Form tag="jaradicalfactor" indexKind="45" /> 
      <Form tag="jaradicalexpand" indexKind="46" /> 
    </FormIndexingConfig>
    <NormalizerConfig useNormalizationExceptions="true" useGermanExceptions="false" /> 
  </TokenizationConfig>
</LinguisticConfig>

When indexing a string with a combination of single-byte and double-type characters (for example, aiuあいうeok), the string is tokenized and indexed as follows:

aiu
あいう
eok

The various forms of the query sent then match the various indexed tokens.