Variation of information

Measure of distance between two clusterings related to mutual information

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.[1][2][3]

Information diagram illustrating the relation between information entropies, mutual information and variation of information.

Definition

Suppose we have two partitions X {\displaystyle X} and Y {\displaystyle Y} of a set A {\displaystyle A} into disjoint subsets, namely X = { X 1 , X 2 , , X k } {\displaystyle X=\{X_{1},X_{2},\ldots ,X_{k}\}} and Y = { Y 1 , Y 2 , , Y l } {\displaystyle Y=\{Y_{1},Y_{2},\ldots ,Y_{l}\}} .

Let:

n = i | X i | = j | Y j | = | A | {\displaystyle n=\sum _{i}|X_{i}|=\sum _{j}|Y_{j}|=|A|}
p i = | X i | / n {\displaystyle p_{i}=|X_{i}|/n} and q j = | Y j | / n {\displaystyle q_{j}=|Y_{j}|/n}
r i j = | X i Y j | / n {\displaystyle r_{ij}=|X_{i}\cap Y_{j}|/n}

Then the variation of information between the two partitions is:

V I ( X ; Y ) = i , j r i j [ log ( r i j / p i ) + log ( r i j / q j ) ] {\displaystyle \mathrm {VI} (X;Y)=-\sum _{i,j}r_{ij}\left[\log(r_{ij}/p_{i})+\log(r_{ij}/q_{j})\right]} .

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on A {\displaystyle A} defined by μ ( B ) := | B | / n {\displaystyle \mu (B):=|B|/n} for B A {\displaystyle B\subseteq A} .

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact Lattice where the partial order induces two operations, the meet {\displaystyle \wedge } and the join {\displaystyle \vee } , where the maximum 1 ¯ {\displaystyle {\overline {\mathrm {1} }}} is the partition with only one block, i.e., all elements grouped together, and the minimum is 0 ¯ {\displaystyle {\overline {\mathrm {0} }}} , the partition consisting of all elements as singletons. The meet of two partitions X {\displaystyle X} and Y {\displaystyle Y} is easy to understand as that partition formed by all pair intersections of one block of, X i {\displaystyle X_{i}} , of X {\displaystyle X} and one, Y i {\displaystyle Y_{i}} , of Y {\displaystyle Y} . It then follows that X Y X {\displaystyle X\wedge Y\subseteq X} and X Y Y {\displaystyle X\wedge Y\subseteq Y} .

Let's define the entropy of a partition X {\displaystyle X} as

H ( X ) = i p i log p i {\displaystyle H\left(X\right)\,=\,-\sum _{i}\,p_{i}\log p_{i}} ,

where p i = | X i | / n {\displaystyle p_{i}=|X_{i}|/n} . Clearly, H ( 1 ¯ ) = 0 {\displaystyle H({\overline {\mathrm {1} }})=0} and H ( 0 ¯ ) = log n {\displaystyle H({\overline {\mathrm {0} }})=\log \,n} . The entropy of a partition is a monotonous function on the lattice of partitions in the sense that X Y H ( X ) H ( Y ) {\displaystyle X\subseteq Y\Rightarrow H(X)\geq H(Y)} .

Then the VI distance between X {\displaystyle X} and Y {\displaystyle Y} is given by

V I ( X , Y ) = 2 H ( X Y ) H ( X ) H ( Y ) {\displaystyle \mathrm {VI} (X,Y)\,=\,2H(X\wedge Y)\,-\,H(X)\,-\,H(Y)} .

The difference d ( X , Y ) | H ( X ) H ( Y ) | {\displaystyle d(X,Y)\equiv |H\left(X\right)-H\left(Y\right)|} is a pseudo-metric as d ( X , Y ) = 0 {\displaystyle d(X,Y)=0} doesn't necessarily imply that X = Y {\displaystyle X=Y} . From the definition of 1 ¯ {\displaystyle {\overline {\mathrm {1} }}} , it is V I ( X , 1 ) = H ( X ) {\displaystyle \mathrm {VI} (X,\mathrm {1} )\,=\,H\left(X\right)} .

If in the Hasse diagram we draw an edge from every partition to the maximum 1 ¯ {\displaystyle {\overline {\mathrm {1} }}} and assign it a weight equal to the VI distance between the given partition and 1 ¯ {\displaystyle {\overline {\mathrm {1} }}} , we can interpret the VI distance as basically an average of differences of edge weights to the maximum

V I ( X , Y ) = | V I ( X , 1 ¯ ) V I ( X Y , 1 ¯ ) | + | V I ( Y , 1 ¯ ) V I ( X Y , 1 ¯ ) | = d ( X , X Y ) + d ( Y , X Y ) {\displaystyle \mathrm {VI} (X,Y)\,=\,|\mathrm {VI} (X,{\overline {\mathrm {1} }})\,-\,\mathrm {VI} (X\wedge Y,{\overline {\mathrm {1} }})|\,+\,|\mathrm {VI} (Y,{\overline {\mathrm {1} }})\,-\,\mathrm {VI} (X\wedge Y,{\overline {\mathrm {1} }})|\,=\,d(X,X\wedge Y)\,+\,d(Y,X\wedge Y)} .

For H ( X ) {\displaystyle H(X)} as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H ( X , Y ) = H ( X Y ) {\displaystyle H(X,Y)\,=\,H(X\wedge Y)}

and we also have that d ( X , X Y ) = H ( X Y | X ) {\displaystyle d(X,X\wedge Y)\,=\,H(X\wedge Y|X)} coincides with the conditional entropy of the meet (intersection) X Y {\displaystyle X\wedge Y} relative to X {\displaystyle X} .

Identities

The variation of information satisfies

V I ( X ; Y ) = H ( X ) + H ( Y ) 2 I ( X , Y ) {\displaystyle \mathrm {VI} (X;Y)=H(X)+H(Y)-2I(X,Y)} ,

where H ( X ) {\displaystyle H(X)} is the entropy of X {\displaystyle X} , and I ( X , Y ) {\displaystyle I(X,Y)} is mutual information between X {\displaystyle X} and Y {\displaystyle Y} with respect to the uniform probability measure on A {\displaystyle A} . This can be rewritten as

V I ( X ; Y ) = H ( X , Y ) I ( X , Y ) {\displaystyle \mathrm {VI} (X;Y)=H(X,Y)-I(X,Y)} ,

where H ( X , Y ) {\displaystyle H(X,Y)} is the joint entropy of X {\displaystyle X} and Y {\displaystyle Y} , or

V I ( X ; Y ) = H ( X | Y ) + H ( Y | X ) {\displaystyle \mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)} ,

where H ( X | Y ) {\displaystyle H(X|Y)} and H ( Y | X ) {\displaystyle H(Y|X)} are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

V I ( X ; Y ) log ( n ) {\displaystyle \mathrm {VI} (X;Y)\leq \log(n)} ,

Or with respect to a maximum number of clusters, K {\displaystyle K^{*}} :

V I ( X ; Y ) 2 log ( K ) {\displaystyle \mathrm {VI} (X;Y)\leq 2\log(K^{*})}

Triangle inequality

To verify the triangle inequality V I ( X ; Z ) V I ( X ; Y ) + V I ( Y ; Z ) {\displaystyle \mathrm {VI} (X;Z)\leq \mathrm {VI} (X;Y)+\mathrm {VI} (Y;Z)} , expand using the identity V I ( X ; Y ) = H ( X | Y ) + H ( Y | X ) {\displaystyle \mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)} . It suffices to prove

H ( X | Z ) H ( X | Y ) + H ( Y | Z ) {\displaystyle H(X|Z)\leq H(X|Y)+H(Y|Z)}
The right side equals H ( X , Y | Z ) {\displaystyle H(X,Y|Z)} [dubious – discuss], which is no less than the left side. This is intuitive, as X , Y | Z {\displaystyle X,Y|Z} contains no less randomness than X | Z {\displaystyle X|Z} .

References

  1. ^ P. Arabie, S.A. Boorman, S. A., "Multidimensional scaling of measures of distance between partitions", Journal of Mathematical Psychology (1973), vol. 10, 2, pp. 148–203, doi: 10.1016/0022-2496(73)90012-6
  2. ^ W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)
  3. ^ Marina Meila, "Comparing Clusterings by the Variation of Information", Learning Theory and Kernel Machines (2003), vol. 2777, pp. 173–187, doi:10.1007/978-3-540-45167-9_14, Lecture Notes in Computer Science, ISBN 978-3-540-40720-1

Further reading

  • Arabie, P.; Boorman, S. A. (1973). "Multidimensional scaling of measures of distance between partitions". Journal of Mathematical Psychology. 10 (2): 148–203. doi:10.1016/0022-2496(73)90012-6.
  • Meila, Marina (2003). "Comparing Clusterings by the Variation of Information". Lecture Notes in Computer Science. Vol. 2777. pp. 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN 978-3-540-40720-1. S2CID 4341039. {{cite book}}: |journal= ignored (help); Missing or empty |title= (help)
  • Meila, M. (2007). "Comparing clusterings—an information based distance". Journal of Multivariate Analysis. 98 (5): 873–895. doi:10.1016/j.jmva.2006.11.013.
  • Kingsford, Carl (2009). "Information Theory Notes" (PDF). Retrieved 22 September 2009.
  • Kraskov, Alexander; Harald Stögbauer; Ralph G. Andrzejak; Peter Grassberger (2003). "Hierarchical Clustering Based on Mutual Information". arXiv:q-bio/0311039.

External links

  • Partanalyzer includes a C++ implementation of VI and other metrics and indices for analyzing partitions and clusterings
  • C++ implementation with MATLAB mex files