advertisement

Cubic Clustering Criterion CCC SAS Technical Report #108 Here I’ll change notation from the technical report to show the relationship between CCC and typical regression computations. n = number of observations nk= number in cluster k p = number of variables q = number of clusters Y = nxp data matrix M = qxp matrix of cluster means X = cluster indicator (xik=1 if obs. i in cluster k) Assume each variable has mean 0 (center the data). X’X = diag(n1, ..., nq), ˆ = (X’X)-1X’Y SS(total) matrix (uncorrected) = T= Y’Y SS(between clusters) matrix (uncorrected) = B = ˆ ’ X’X ˆ SS(within clusters) matrix = W = T-B R2 = 1 – trace(W)/trace(T) (trace = sum of diagonal elements) Stack columns of Y into one long column. Regress on Kronecker product of X with pxp identity matrix Compute R2 for this regression – same R2 The CCC idea is to compare the R2 you get for a given set of clusters with the R2 you would get by clustering a uniformly distributed set of points in p dimensional space.