Indexing Mixed Characters

3DSpace Index can index content with mixed characters (both double-byte and single byte).

See Also
Allowing Double-Byte Characters
Indexing Options

Consider, for example, the Japanese content provided in the five following files:

  • DOC00.txt: 文書
  • DOC01.txt: 文書 and 01
  • DOC02.txt: 文書 and 01 (" 01" normalized to "01")
  • DOC4.txt: 文書 and 01 (" 01" normalized to "01") and 大
  • DOC5.txt: 大文書 and 01

It is normal for 文書01大 to match both DOC4 and DOC5 because the 01 part splits the Japanese sentence, and the becomes a free token that can reattach.

If you examine how the 文書01大 query is processed, it becomes ( (文 NEXT 書) OR (文書) ) AND 01 AND 大.

Therefore, it is normal for both DOC4 and DOC5 to match it.

Query 大文書01 gets processed to ( (大 NEXT 文 NEXT 書) OR (大文書) ) AND 01, so DOC4 does not match because it does not contain the correct sequence of three ideograms.

The processing of Japanese requests is context-dependent. At index time, it might be indexed as either 文書 or 文書. Our specific language processing knows enough of Japanese grammar to make this decision. At search time, it depends on the query, but a query is generally too small to have the correct context, so the performed query becomes: (the sequence of ideograms) OR (recognized valid concatenations of ideograms that form valid Japanese words).

At index and search time, we also perform some Katakana/Hiragana conversions and Romanji (Latin forms) conversions.

Note: 3DSpace Index uses the following linguistic settings:
<LinguisticConfig xmlns="exa:com.exalead.linguistic.v10" version="1330354820991"> 
  <TokenizationConfig name="tok0"> 
    <StandardTokenizer concatAlphaNum="true" concatNumAlpha="true">
      <GermanDesagglutination /> 
      <DutchDesagglutination /> 
      <NorwegianDesagglutination /> 
      <charOverrides /> 
      <patternOverrides>
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]][&][[:alnum:]]" /> 
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]]*[.](?i:net)" /> 
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+[+]+" /> 
        <StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+#" /> 
        <StandardTokenizerOverride type="token" toOverride="0"/>
        <StandardTokenizerOverride type="token" toOverride="1"/>
        <StandardTokenizerOverride type="token" toOverride="2"/>
        <StandardTokenizerOverride type="token" toOverride="3"/>
        <StandardTokenizerOverride type="token" toOverride="4"/>
        <StandardTokenizerOverride type="token" toOverride="5"/>
        <StandardTokenizerOverride type="token" toOverride="6"/>
        <StandardTokenizerOverride type="token" toOverride="7"/>
        <StandardTokenizerOverride type="token" toOverride="8"/>
        <StandardTokenizerOverride type="token" toOverride="9"/>
      </patternOverrides>
    </StandardTokenizer>
    <JapaneseTokenizer addRomanji="true" addMorphology="false" /> 
    <ChineseTokenizer addSimplified="false" /> 
    <FormIndexingConfig>
      <Form tag="SubTokenizerLowercase" indexKind="1" /> 
      <Form tag="SubTokenizerNormalize" indexKind="2" /> 
      <Form tag="SubTokenizerConcatLowercase" indexKind="1" /> 
      <Form tag="SubTokenizerConcatNormalize" indexKind="2" /> 
      <Form tag="cjk" indexKind="2" /> 
      <Form tag="jafactorized" indexKind="42" /> 
      <Form tag="jaexpanded" indexKind="43" /> 
      <Form tag="jaromanji" indexKind="44" /> 
      <Form tag="jaradicalfactor" indexKind="45" /> 
      <Form tag="jaradicalexpand" indexKind="46" /> 
    </FormIndexingConfig>
    <NormalizerConfig useNormalizationExceptions="true" useGermanExceptions="false" /> 
  </TokenizationConfig>
</LinguisticConfig>

When indexing a string with a combination of single-byte and double-type characters (for example, aiuあいうeok), the string is tokenized and indexed as follows:

  • aiu
  • あいう
  • eok
The various forms of the query sent then match the various indexed tokens.