Kendall tau distance

Metric to compare ordering

The Kendall tau rank distance is a metric (distance function) that counts the number of pairwise disagreements between two ranking lists. The larger the distance, the more dissimilar the two lists are. Kendall tau distance is also called bubble-sort distance since it is equivalent to the number of swaps that the bubble sort algorithm would take to place one list in the same order as the other list. The Kendall tau distance was created by Maurice Kendall.

Definition

The Kendall tau ranking distance between two lists τ 1 {\displaystyle \tau _{1}} and τ 2 {\displaystyle \tau _{2}} is

K d ( τ 1 , τ 2 ) = | { ( i , j ) : i < j , [ τ 1 ( i ) < τ 1 ( j ) τ 2 ( i ) > τ 2 ( j ) ] [ τ 1 ( i ) > τ 1 ( j ) τ 2 ( i ) < τ 2 ( j ) ] } | . {\displaystyle K_{d}(\tau _{1},\tau _{2})=|\{(i,j):i<j,[\tau _{1}(i)<\tau _{1}(j)\wedge \tau _{2}(i)>\tau _{2}(j)]\vee [\tau _{1}(i)>\tau _{1}(j)\wedge \tau _{2}(i)<\tau _{2}(j)]\}|.}
where τ 1 ( i ) {\displaystyle \tau _{1}(i)} and τ 2 ( i ) {\displaystyle \tau _{2}(i)} are the rankings of the element i {\displaystyle i} in τ 1 {\displaystyle \tau _{1}} and τ 2 {\displaystyle \tau _{2}} respectively.

K d ( τ 1 , τ 2 ) {\displaystyle K_{d}(\tau _{1},\tau _{2})} will be equal to 0 if the two lists are identical and 1 2 n ( n 1 ) {\textstyle {\frac {1}{2}}n(n-1)} (where n {\displaystyle n} is the list size) if one list is the reverse of the other.

Kendall tau distance may also be defined as

K d ( τ 1 , τ 2 ) = { i , j } P , i < j K ¯ i , j ( τ 1 , τ 2 ) {\displaystyle K_{d}(\tau _{1},\tau _{2})=\sum _{\{i,j\}\in P,i<j}{\bar {K}}_{i,j}(\tau _{1},\tau _{2})}

where

  • P is the set of unordered pairs of distinct elements in τ 1 {\displaystyle \tau _{1}} and τ 2 {\displaystyle \tau _{2}}
  • K ¯ i , j ( τ 1 , τ 2 ) {\displaystyle {\bar {K}}_{i,j}(\tau _{1},\tau _{2})} = 0 if i and j are in the same order in τ 1 {\displaystyle \tau _{1}} and τ 2 {\displaystyle \tau _{2}}
  • K ¯ i , j ( τ 1 , τ 2 ) {\displaystyle {\bar {K}}_{i,j}(\tau _{1},\tau _{2})} = 1 if i and j are in the opposite order in τ 1 {\displaystyle \tau _{1}} and τ 2 . {\displaystyle \tau _{2}.}

Kendall tau distance can also be defined as the total number of discordant pairs.

Kendall tau distance in Rankings: A permutation (or ranking) is an array of N integers where each of the integers between 0 and N-1 appears exactly once.

The Kendall tau distance between two rankings is the number of pairs that are in different order in the two rankings. For example, the Kendall tau distance between 0 3 1 6 2 5 4 and 1 0 3 6 4 2 5 is four because the pairs 0-1, 3-1, 2-4, 5-4 are in different order in the two rankings, but all other pairs are in the same order.[1]

The normalized Kendall tau distance K n {\displaystyle K_{n}} is K d 1 2 n ( n 1 ) = 2 K d n ( n 1 ) {\displaystyle {\frac {K_{d}}{{\frac {1}{2}}n(n-1)}}={\frac {2K_{d}}{n(n-1)}}} and therefore lies in the interval [0,1].

If Kendall tau distance function is performed as K ( L 1 , L 2 ) {\displaystyle K(L1,L2)} instead of K ( τ 1 , τ 2 ) {\displaystyle K(\tau _{1},\tau _{2})} (where τ 1 {\displaystyle \tau _{1}} and τ 2 {\displaystyle \tau _{2}} are the rankings of L 1 {\displaystyle L1} and L 2 {\displaystyle L2} elements respectively), then triangular inequality is not guaranteed. The triangular inequality fails sometimes also in cases where there are repetitions in the lists. So then we are not dealing with a metric anymore.

Generalised versions of Kendall tau distance have been proposed to give weights to different items and different positions in the ranking.[2]

Comparison to Kendall tau rank correlation coefficient

The  Kendall tau distance ( K d {\displaystyle K_{d}} ) must not be confused with the Kendall tau rank correlation coefficient ( K c {\displaystyle K_{c}} )  used in statistics.

They are related by   K c = 1 4 K d / ( n ( n 1 ) ) {\displaystyle K_{c}=1-4K_{d}/(n(n-1))} , K d = ( 1 K c ) ( n ( n 1 ) ) / 4 {\displaystyle K_{d}=(1-K_{c})(n(n-1))/4}

Or simpler by   K c = 1 2 K n , K n = ( 1 K c ) / 2 {\displaystyle K_{c}=1-2K_{n},K_{n}=(1-K_{c})/2} where K n {\displaystyle K_{n}} is the normalised distance 2 K d / ( n ( n 1 ) ) {\displaystyle 2K_{d}/(n(n-1))} see above)

The distance is a value between 0 and n ( n 1 ) / 2 {\displaystyle n(n-1)/2} . (The normalised distance is between 0 and 1)

The correlation is between -1 and 1.

The distance between equals is 0, the correlation between equals is 1.

The distance between reversals is n ( n 1 ) / 2 {\displaystyle n(n-1)/2} , the correlation between reversals is -1


For example comparing the rankings A>B>C>D and A>B>C>D the distance is 0 the correlation is 1.

Comparing the rankings A>B>C>D and D>C>B>A the distance is 6 the correlation is -1

Comparing the rankings A>B>C>D and B>D>A>C the distance is 3 the correlation is 0

Example

Suppose one ranks a group of five people by height and by weight:

Person A B C D E ranking
Rank by height 1 2 3 4 5 A>B>C>D>E
Rank by weight 3 4 1 2 5 C>D>A>B>E

Here person A is tallest and third-heaviest, B is the second -tallest and fourth-heaviest and so on.

In order to calculate the Kendall tau distance, pair each person with every other person and count the number of times the values in list 1 are in the opposite order of the values in list 2.

Pair Height Weight Count
(A,B) 1 < 2 3 < 4
(A,C) 1 < 3 3 > 1 X
(A,D) 1 < 4 3 > 2 X
(A,E) 1 < 5 3 < 5
(B,C) 2 < 3 4 > 1 X
(B,D) 2 < 4 4 > 2 X
(B,E) 2 < 5 4 < 5
(C,D) 3 < 4 1 < 2
(C,E) 3 < 5 1 < 5
(D,E) 4 < 5 2 < 5

Since there are four pairs whose values are in opposite order, the Kendall tau distance is 4. The normalized Kendall tau distance is

4 5 ( 5 1 ) / 2 = 0.4. {\displaystyle {\frac {4}{5(5-1)/2}}=0.4.}

A value of 0.4 indicates that 40% of pairs differ in ordering between the two lists.

Computing the Kendall tau distance

A naive implementation in Python (using NumPy) is:

import numpy as np

def normalised_kendall_tau_distance(values1, values2):
    """Compute the Kendall tau distance."""
    n = len(values1)
    assert len(values2) == n, "Both lists have to be of equal length"
    i, j = np.meshgrid(np.arange(n), np.arange(n))
    a = np.argsort(values1)
    b = np.argsort(values2)
    ndisordered = np.logical_or(np.logical_and(a[i] < a[j], b[i] > b[j]), np.logical_and(a[i] > a[j], b[i] < b[j])).sum()
    return ndisordered / (n * (n - 1))

However, this requires n 2 {\displaystyle n^{2}} memory, which is inefficient for large arrays.

Given two rankings τ 1 , τ 2 {\displaystyle \tau _{1},\tau _{2}} , it is possible to rename the items such that τ 1 = ( 1 , 2 , 3 , . . . ) {\displaystyle \tau _{1}=(1,2,3,...)} . Then, the problem of computing the Kendall tau distance reduces to computing the number of inversions in τ 2 {\displaystyle \tau _{2}} —the number of index pairs i , j {\displaystyle i,j} such that i < j {\displaystyle i<j} while τ 2 ( i ) > τ 2 ( j ) {\displaystyle \tau _{2}(i)>\tau _{2}(j)} . There are several algorithms for calculating this number.

  • A simple algorithm based on merge sort requires time O ( n log n ) {\displaystyle O(n\log n)} .[3]
  • A more advanced algorithm requires time O ( n log n ) {\displaystyle O(n{\sqrt {\log {n}}})} .[4]

Here is a basic C implementation.

#include <stdbool.h>

int kendallTau(short x[], short y[], int len) {
    int i, j, v = 0;
    bool a, b;

    for (i = 0; i < len; i++) {
        for (j = i + 1; j < len; j++) {

            a = x[i] < x[j] && y[i] > y[j];
            b = x[i] > x[j] && y[i] < y[j];

            if (a || b)
                v++;

        }
    }

    return abs(v);
}

float normalize(int kt, int len) {
    return kt / (len * (len - 1) / 2.0);
}

See also

References

  1. ^ "Sorting Applications".
  2. ^ Ravi Kumar and Sergei Vassilvitskii (2010). Generalized Distances between Rankings (PDF).
  3. ^ Ionescu, Vlad. "calculating the number of "inversions" in a permutation". Stack overflow. Retrieved 24 February 2017.
  4. ^ Chan, Timothy M.; Pătraşcu, Mihai (2010). "Counting Inversions, Offline Orthogonal Range Counting, and Related Problems". Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. p. 161. CiteSeerX 10.1.1.208.2715. doi:10.1137/1.9781611973075.15. ISBN 978-0-89871-701-3.
  • Fagin, R.; Kumar, R.; Sivakumar, D. (2003). "Comparing top k lists". SIAM Journal on Discrete Mathematics. 17 (1): 134–160. CiteSeerX 10.1.1.86.3234. doi:10.1137/S0895480102412856. S2CID 6249357.
  • Kendall, M. (1948). Rank Correlation Methods. Charles Griffin & Company Limited.
  • Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika. 30 (1/2): 81–89. doi:10.2307/2332226. JSTOR 2332226.

External links

  • Online software: computes Kendall's tau rank correlation
  • QuickVote — A website that calculates the Kendall tau distance between two interactive ranking lists and shows the pairwise disagreements between the lists.