Cophenetic correlation

In statistics, and especially in biostatistics, cophenetic correlation[1] (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of DNA sequences, or other taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters.[2] This coefficient has also been proposed for use as a test for nested clusters.[3]

Calculating the cophenetic correlation coefficient

Suppose that the original data {Xi} have been modeled using a cluster method to produce a dendrogram {Ti}; that is, a simplified model in which data that are "close" have been grouped into a hierarchical tree. Define the following distance measures.

  • x ( i , j ) = | X i X j | {\displaystyle x(i,j)=|X_{i}-X_{j}|} , the Euclidean distance between the ith and jth observations.
  • t ( i , j ) {\displaystyle t(i,j)} , the dendrogrammatic distance between the model points T i {\displaystyle T_{i}} and T j {\displaystyle T_{j}} . This distance is the height of the node at which these two points are first joined together.

Then, letting x ¯ {\displaystyle {\bar {x}}} be the average of the x(i, j), and letting t ¯ {\displaystyle {\bar {t}}} be the average of the t(i, j), the cophenetic correlation coefficient c is given by[4]

c = i < j [ x ( i , j ) x ¯ ] [ t ( i , j ) t ¯ ] i < j [ x ( i , j ) x ¯ ] 2 i < j [ t ( i , j ) t ¯ ] 2 . {\displaystyle c={\frac {\sum _{i<j}[x(i,j)-{\bar {x}}][t(i,j)-{\bar {t}}]}{\sqrt {\sum _{i<j}[x(i,j)-{\bar {x}}]^{2}\sum _{i<j}[t(i,j)-{\bar {t}}]^{2}}}}.}

Software implementation

It is possible to calculate the cophenetic correlation in R using the dendextend R package.[5]

In Python, the SciPy package also has an implementation.[6]

In MATLAB, the Statistic and Machine Learning toolbox contains an implementation.[7]

See also

  • Cophenetic

References

  1. ^ Sokal, R. R. and F. J. Rohlf. 1962. The comparison of dendrograms by objective methods. Taxon, 11:33-40
  2. ^ Dorthe B. Carr, Chris J. Young, Richard C. Aster, and Xioabing Zhang, Cluster Analysis for CTBT Seismic Event Monitoring (a study prepared for the U.S. Department of Energy)
  3. ^ Rohlf, F. J. and David L. Fisher. 1968. Test for hierarchical structure in random data sets. Systematic Zool., 17:407-412 (link)
  4. ^ Mathworks statistics toolbox
  5. ^ "Introduction to dendextend".
  6. ^ "scipy.cluster.hierarchy.cophenet — SciPy v0.14.0 Reference Guide". docs.scipy.org. Retrieved 2019-07-11.
  7. ^ "Cophenetic correlation coefficient - MATLAB cophenet".

External links

  • Numerical example of cophenetic correlation
  • Computing and displaying Cophenetic distances