Clustering Metrics

advertisement
Quality of Clusterings
• Two metrics:
– SSE
– Dissimilarity Ratio
Computing SSE
• Save clusters. Two new columns are
created: Cluster and Distance.
• Create new column as formula. Name it as
dist-sqr and define it as Distance2
• Analyze – Distribution for dist-sqr. Get the
mean and multiply by N to obtain SSE
Computing Dissimilarity Ratio
• Dissimilarity ratio = (inter-cluster distance / intracluster distance)
• Inter-cluster distance is the smallest distance
between centroids
• Normalize centroid coordinates:
– Coordinates are given in cluster output
– Find mean and std dev for each dimension from
histogram (distribution) output
– Normalize each centroid coordinate:
• (x - mean) /st dev
– Compute distances between each pair of centroids:
• Inter-cluster distance is given by the smallest of
the normalized centroid distances
d
 (x
i
i
 yi ) 2
Dissimilarity Ratio – cont.
• Intra-cluster distance is given by the
average max dist of the clusters.
• The max dist of each cluster is found at
the clusters output in JMP.
• Computer dissimilarity ratio (DR) for each
clustering
• The higher the DR the better the
clustering.
Download