Clustering Stat 430 Fall 2011 Outline • Distance Measures • Linkage • Hierachical Clustering • KMeans Data set: Letters • from the UCI repository: Letters Data • 20,000 instances of letters • Variables: 1. lettr capital letter (26 values from A to Z) 2. x-box horizontal position of box (integer) 3. y-box vertical position of box (integer) 4. width width of box (integer) 5. high height of box (integer) 6. onpix total # on pixels (integer) 7. x-bar mean x of on pixels in box (integer) 8. y-bar mean y of on pixels in box (integer) 9. x2bar mean x variance (integer) 10. y2bar mean y variance (integer) 11. xybar mean x y correlation (integer) 12. x2ybr mean of x * x * y (integer) 13. xy2br mean of x * y * y (integer) 14. x-ege mean edge count left to right (integer) 15. xegvy correlation of x-ege with y (integer) 16. y-ege mean edge count bottom to top (integer) 17. yegvx correlation of y-ege with x (integer) Data set: Letters Data Set Information (UCI repository): Objective: identify number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts, each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000. See the article cited above for more details. Clustering • part of unsupervised classification (i.e. we do not use or have a dependent variable) • classification is based on object similarity • What is similar? tools are used to assess the results of numerical methods. Section 5.5 summarizes the chapter and revisits the data analysis strategies used in the examples. A good companion to the material presented in this chapter is Venables & Ripley (2002), which provides data and code for practical examples of cluster analysis using R.

Distances

Before we can begin finding groups of cases that are similar, we need to decide on a definition of similarity. How is similarity defined? Consider a dataset with three cases and four variables, described in matrix format as three objects, one 'T', one 'I' and one 'M'

The Euclidean distance between two cases (rows of the matrix) is defined as dEuc(Xi, Xj) = ||Xi − Xj|| i, j = 1, . . . , n

For example, the Euclidean distance between cases 1 and 2 in the above data is 1.0.

i.e. I is closer to T than to M (by a sliver), M and T are quite far apart Linkage
• Compute a distance matrix of all objects
• Now we start to connect closest objects
• First step is the same: combine the two closest objects into one cluster
• Next step: combine the two next closest objects or clusters
• How do we define the distance to a cluster? this is called linkage

Single Linkage
• Depending on different mechanisms for linkage, we find very different clusters
• Single Linkage: Distance between a cluster and an object or another cluster is the minimal distance between any of the elements in the cluster. "My friend's friend is my friend"
This leads to long and stringy clusters

Complete Linkage
• The distance between two clusters V and W is defined as the maximum distance between any of the cluster elements
• This leads to very tight clusters

Average Linkage
• The distance between two clusters V and W is defined as the average distance between any of the cluster elements:
D(V, W ) = 1/(|V| |W|) Σ |x − y|
• Cluster size is a compromise between single and complete linkage

Ward's Method
Distance between two clusters V and W is defined as the increase in the error sum of squares (i.e. variance) if the two clusters are merged.
ESS(X) = Σ |Xi − X̄|²
D(V, W ) = ESS(V ∪ W ) − (ESS(V ) + ESS(W ))
This results in spherical clusters Centers are re-calculated based on members (centroids are means or medians) Repeat this step until cluster assignments do no longer change. KMeans can deal with quite large data sets problematic: K needs to be known