Default Ranking Model
The text_relevance virtual field is used in a single
SortBy clause. The expression of this virtual field is:
@term.score * @proximity + @b
Where
-
@term.score – A value assigned to each alphanumeric node in a query. A
node's term.score value is determined by the textual ranking algorithm
for the node. For more information, see Term Scoring.
-
@proximity – proximity boost, applied to the document as a whole. For
more information, see Proximity Boost.
-
@b – boost. It is a node property that is commonly used to indicate
that elements that match a particular term must be boosted. Boost is defined on a
query-by-query basis. For more information, see Boost.
Term Scoring
Each alphanumeric node in a query has a special property, called a
term.score . A node’s term.score value is the result of
the textual ranking algorithm for the node.
The term.score uses the default merge policy, which is to sum its values
over the whole query. In the Administration Console, it can be set in Search > Search Logics > Sort & Relevance >
Term Scoring.
Scoring Algorithms
The following table describes the available scoring algorithms.
Note:
TF-IDF, IDF, and BM25 are standards, and not described in this section. For more
information about them, look for documentation on the internet.
Algorithm
|
Description
|
No Ranking
|
With No ranking, the term score is always 0, for all
alphanumeric nodes of the query.
When term scoring is not really required, disabling term.score
significantly improves hit matching performance (by up to +30%).
|
Rank
|
The Rank term score uses only the statically defined
rank of each word. In the index, each word can have a rank for each document.
The term.score value is rank * w .
w is a special node option, which can be set on each
alphanumeric value, both in ELLQL and in UQL:
- in UQL:
a OR b{w=2}
- in ELLQL:
#alphanum{w=2}(text, "a")
The default value of w is 1.0 Use w to
increase, decrease, or cancel the importance of the presence of one word with a
specific rank. For example, for query a AND b we have
two matching documents:
- doc 1:
a[rank=4] b[rank=6]
- doc 2:
a[rank=6] b[rank=4]
With the default configuration, both doc1 and doc2 have
term.score=10 . With the query a AND
b{w=2} :
- doc1 has
term.score=16
- doc2 has
term.score=14
You can also use w to ignore a given word for the textual
relevance calculation, by setting w=0 .
|
Rank IDF
|
The Rank IDF term score adds the notion of IDF, or
Inverse Document Frequency. The IDF represents the relative rarity of a
word in a corpus. The more frequently the word appears in the corpus, the lower
its IDF. The idea behind this algorithm, is the rarer the word, the greater its
importance. For example, on query the OR economy , we want
the documents matching economy first, because they are more
specific. For a given word, IDF(word) = 1 + log( number of docs in
corpus / number of docs containing this word)
The term.score of one word with this algorithm is:
rank * w * idf * 10000
IDF is a positive double above 1.0 (for a word that is in all
documents). For example, for a word present in only one document out of a
corpus of one million, IDF = 20.9
|
Rank TF-IDF
|
The Rank TF-IDF term score adds the notion of Term
Frequency. To represent the importance of a term within a document, it takes
into account term density instead of term occurrences. For example, we
have the query: iphone and the following documents:
- Doc1: {iPhone}
- Doc2: {iPhone accessories}
Both have the same number of iphone occurrences, but doc1 is
more dense with iphone and intuitively a better match. We
consider that the number of occurrences is not as meaningful as the term
density. To use this algorithm, go to Data Model > Advanced
Schema, click the index field to modify, select Compute
TF, and click Apply. For a word
w in a document d , a simple version of TF
would be:
SimpleTF(w, d) = (number of occurrences of w in d).
To avoid overranking documents where a word occurs frequently, Exalead CloudView uses a more advanced version of the formula:
TF(w, d) = (2.2 * SimpleTF(w, d) / (1.2 + SimpleTF(w,d))
The term.score of one word with this algorithm is:
rank * w * tf * idf * 10000
TF varies between 1 (for a word present only once) to 2.2. Therefore, the
term.score varies between rank * w * 10000
and rank * w * 10000 * 2.2 .
|
BM25F
|
TF-IDF does not use the length of the document to normalize the term frequency.
The BM25 term score uses a more complete version the TF
formula:
SimpleTF(w, d) = (number of occurrences of w in d) * (length of the document) /
(average length of all the documents)
As for TF-IDF, this SimpleTF is normalized to avoid overranking. Moreover, Exalead CloudView combines this term frequency with the TF-IDF value, using the following
formula:
The term.score of one word with this algorithm is:
rank * w * tf * idf * 10000 .
TF varies between almost 0 (for a word that occurs once in a very large
document) and 2.2 (for a word that occurs once in a small document and where all
the other documents are large).
|
Custom
|
You can define your own custom ranking by selecting the
Custom scoring algorithm and defining a formula.
|
Ranks Remapping
During indexing, a static rank or relevance class is set for each meta. This
relevance class is a numerical value that is used to rank search results.
You can display current relevance classes for each meta in Index > Data
Processing > Mappings > Details > Relevancy options > Relevance
class. Nine values are available (from 0 to 8), 8 being the highest
rank:
-
0: No score
-
1: Hidden text
-
2: Text
-
3: Boosted text
-
4: Relevant text
-
5: Boosted relevant text
-
6: Title
-
7: Boosted title
-
8: URL
After indexing, relevance classes cannot be modified anymore. To change the relevance
class set for a meta, you must use the Ranks remapping field in
Search > Search Logics > Sort & Relevance > Term
scoring. Use numerical values in increasing order and separated by commas to
set the new rank of existing relevance classes. Example: I want to give more weight on
titles. I must specify that titles (relevance class=6) have now the highest rank
(relevance class=8). I fill the Ranks remapping field as follows:
0,1,2,3,4,5,6,9,10 .
Proximity Boost
A special ranking element called proximity is the result of the proximity
algorithm.
Proximity is a double value, between 0 and 10, where 0 is out of range.
To set proximity boost in the Administration Console, go to Search > Search Logics > Sort
& Relevance > Proximity boost.
Boost
Boost, or b , is summed over all nodes.
A common use of b is to assign a score for nodes that do not normally have
them, like categories:
a AND b AND (source:important_source{b=1000} OR
source:less_important_source{b=0})
The score of an alphanumeric value can be forced, replacing the term.score, by setting
{w=0,b=DESIRED_SCORE} .
b can also be negative, to unboost certain terms.
Custom Ranking Elements
For advanced ranking use cases, you can create custom numerical key-value pairs
attached to each node of the query tree to use as ranking elements.
For example, to create a behavior similar to the boost one, we can define the following
query:
-
in UQL: fruit{interest=1} tomato{interest=10}
-
in ELLQ: #and(#alphanum{interest=1}(text, "fruit") #alphanum{interest=10}(text,
"tomato"))
With this query, a hit that matches:
By adding a sort expression on @interest , we get the interesting hits
first.
The default policy is to sum the values for numerical ranking keys, from all children nodes
where it matches. You can also keep the maximum or minimum values among children:
-
#and{interest.policy=MAX}(#alphanum{interest=10}(text, "tomato")
#alphanum{interest=1}(text, "fruit")) (score 10)
-
#and{interest.policy=MIN}(#alphanum{interest=10}(text, "tomato")
#alphanum{interest=1}(text, "fruit")) (score 1)
Reusing Ranking Elements in Virtual Fields
You can re-use node properties and predefined ranking elements as expressions in the
virtual field syntax for:
For more information, see Calculating Results On-The-Fly with Virtual Fields.
For example, if you define a complex ranking element to calculate the relevance of a hit,
you may want to reuse this calculated value to compute an aggregation, which is the sum of
this relevance score for each value of a facet, indicating the "total relevance" of this
facet value.
The syntax to access a given ranking element is @elementname .
As the ranking elements are computed once a hit has been identified, there is a major
restriction, which is that they generally cannot be used in a virtual field for querying.
For example, you cannot use #attrnum(@proximity, ==, 42) because when we
want to evaluate whether the hit matches, the proximity has not yet been computed.
The main consequence is that if you define a numerical facet, which uses a ranking element,
you cannot refine on it. For example, if you defined a numerical facet with expression
#floor(@proximity) , you can use this to obtain a histogram of the
documents by the proximity of the query terms within them. However, you cannot refine,
because a query "I want all documents where the computed proximity score is between 1.3 and
1.7" is not supported.
One exception is the #filter ELLQL node, see Filtering Search Results in ELLQL.
|