Cubic Clustering Criterion (CCC) Overview

advertisement
Cubic Clustering Criterion CCC
SAS Technical Report #108
Here I’ll change notation from the technical report to show the relationship between
CCC and typical regression computations.
n = number of observations
nk= number in cluster k
p = number of variables
q = number of clusters
Y = nxp data matrix
M = qxp matrix of cluster means
X = cluster indicator (xik=1 if obs. i in cluster k)
Assume each variable has mean 0 (center the data).
X’X = diag(n1, ..., nq), ˆ = (X’X)-1X’Y
SS(total) matrix (uncorrected) = T= Y’Y
SS(between clusters) matrix (uncorrected) = B = ˆ ’ X’X ˆ
SS(within clusters) matrix = W = T-B
R2 = 1 – trace(W)/trace(T)
(trace = sum of diagonal elements)
Stack columns of Y into one long column.
Regress on Kronecker product of X with pxp identity matrix
Compute R2 for this regression – same R2
The CCC idea is to compare the R2 you get for a given set of clusters with the R2 you
would get by clustering a uniformly distributed set of points in p dimensional space.
Download