The similarity measure varies depending on the function used to compare vectors two
by two.
To choose a function, use the following syntax:
#attrsimilar{name=SIGNATURE_SCORE_OUTPUT,
function=euclidian_normed}(SIGNATURE_INDEX_FIELD,SIGNATURE_FLOAT_VALUES)
Similarity is calculated as follows: similarity = 1 - distance
. For all
_normed
functions, we can summarize the calculation as:
similarity = 1 <--> close; similarity = 0 <--> far
dist = 1 <--> far; dist = 0 <--> close
For non-normed similarity functions (for example Manhattan
,
Euclidian
, etc.), the calculation is identical but the distance
milestones change from [0;1]
to [0,Infinity]
and
similarity is delimited by [-Infinity;1]
.
The cosine
similarity function is the exception, with milestones
-1
(unsimilar) and 1
(similar). The
angular
similarity function allows you to bring cosine
similarity between 0
and 1
, and be consistent with other
similarity functions.
Function |
Use |
manhattan (default function) |
For L1-normalized vectors.
Formula: sim = 1 - (Sum{abs(x1[i] - x2[i])}/2)
The similarity is between 0 and 1 .
|
manhattan_normed
|
Same as manhattan with L1-normalized vectors first.
Formula: sim = 1 - (Sum{abs(x1[i]/NormL1(x1) - x2[i]/NormL1(x2))}/2),
NormL1(x)=sum_i{abs(x[i])}
The similarity is between 0 and 1 .
|
manhattan_dist
|
For any vectors.
Formula: dist = Sum {abs(x1[i] - x2[i])}
The distance is between 0 and infinity.
|
multi_manhattan_normed
|
Compares 2 sets of vectors having the same dimension.
For example, 2 vectors of 8 floats and 3 vectors of 8 floats, using the
exclusive min between all MANHATTAN_NORMED distances.
The similarity is between 0 and 1 .
|
euclidian
|
For L2-Normalized vectors.
Formula: sim = 1 - sqrt((Sum_i{(x1[i]-x2[i])^2})/2)
The similarity is between 0 and 1 .
|
euclidian_normed
|
Same as euclidian with L2-normalized vectors first.
Formula: sim = NormL2(x)=sqrt(sum_i{x[i]^2})
The similarity is between 0 and 1 .
|
euclidian_dist
|
For any vectors.
Formula: dist = sqrt(Sum_i{(x1[i]-x2[i])^2})
The distance is between 0 and infinity.
|
cosine
|
Angle between 2 vectors.
Formula: COSINE = (Sum {x1[i]*x2[i]/(NormL2(x1)*NormL2(x2))})
Similarity is between -1 and 1 , where
-1 is unsimilar and 1 is similar.
|
angular
|
Formula: arccos(COSINE) / PI
The similarity is between 0 and 1 .
|
dice
|
For binary bits strings. It computes the intersection between bits to 1 of 2
sequences.
Formula: D = (2*|X inter Y| / (|X| + |Y|))
The similarity is between 0 and 1 .
|
jaccard
|
For binary bits strings. It computes the intersection between bits to 1 of 2
sequences.
Formula: J = (2*|X inter Y| / (|X| + |Y| - |X inter Y|))
The similarity is between 0 and 1 .
Note: jaccard is sometimes called TANIMOTO
|
hamming
|
For binary bits strings. It computes the number of ones in an XOR of bits
sequence.
Formula: H = 1 - (|XOR(X,Y)|/lenBit(X))
The distance is between 0 and length(vectors).
|