Binary Classification Notes on Cluster Analysis Overview • Explanation of Observation Clustering (versus Variable Clustering) • Types of Clustering • Preparing Data for Clustering • Assessing Clustering Results • Demonstrations – Fishers Iris Data – Credit Data STAT4330/8330 PRIESTLEY Cluster Analysis A few concepts: 1. Clustering is UNSUPERVISED. 2. Used for segmentation of observations (versus variable clustering which is used for variable and redundancy reduction). 3. You will ALWAYS find a mathematical solution…which may or may not be meaningful. 4. It may not be generalizable. STAT4330/8330 PRIESTLEY Cluster Analysis Generally, we are trying to partition data into groups that are as internally similar as possible and as dissimilar as possible to other groups. STAT4330/8330 PRIESTLEY Cluster Analysis Types of questions answered by cluster analysis: 1. Are there characteristics of patients that define the state of a disease? 2. What sorts of complaints are most common in a call center? 3. What kinds of cars are people buying? And can I predict this? 4. Can I use credit attributes to create profitability segments? STAT4330/8330 PRIESTLEY Cluster Analysis Types of Clustering: 1. Hierarchical a. Agglomerative i. Each obs starts in its own cluster ii. Merge the clusters that are the most similar iii. Repeat b. Divisive i. ii. Each obs starts in a single cluster Partition the observations that are the least similar into a second cluster iii. Repeat 2. Partitive (k-means) STAT4330/8330 PRIESTLEY Hierarchical Clustering Iteration Agglomerative 1 2 3 4 STAT4330/8330 PRIESTLEY Divisive Partitive Clustering Old location X X X XX XX New location X “Seeds” X X X X Observations Initial State Final State STAT4330/8330 PRIESTLEY Cluster Analysis While there are pros and cons of each technique, Partitive (k-means) clustering is generally preferred with large datasets. However, Partitive clustering also: 1. Requires that you estimate the number of clusters present in the data (trial and error). 2. Is heavily influenced by selection of the seed – so outliers need to be “controlled”. 3. Is inappropriate for small datasets – because the solution becomes sensitive to the order in which the data is read. STAT4330/8330 PRIESTLEY Cluster Analysis All of the clustering techniques depend upon measurements of similarity or “distance” to assess assignment of an observation to a cluster. While there are many types of distance measurements, consider the common Euclidean Distance measurement… STAT4330/8330 PRIESTLEY Euclidean Distance Similarity Metric DE d 2 x w i i i 1 • Pythagorean Theorem: The square of the hypotenuse is equal to the sum of the squares of the other two sides. (x1, x2) 2 h xi2 2 i 1 x2 (0, 0) x STAT4330/83301 PRIESTLEY Euclidean Distance Consider the impact on distance between RBAL and Age in their original units. STAT4330/8330 PRIESTLEY Cluster Analysis To control the impact of scale, standardization of the variables is recommended. The PROC STDIZE procedure provides for a wide range of standardization options including: Mean Median Sum Euclidean Length Standard Deviation (Z) Range MidRange (Range/2) MaxABS – Max Absolute Value IQR MAD AHUBER – Huber estimate AWAVE – Wave estimate L(p) – Minkowski distances STAT4330/8330 PRIESTLEY Cluster Analysis Again, each of these options has pros and cons. I encourage you to test multiple options – must like we did with the imputation options. For the current exercises, we will use the Range option and the Median option. Note that the Std (Z) option is highly sensitive to outliers. STAT4330/8330 PRIESTLEY Determining Optimal Number of Clusters We will use three basic metrics to determine the optimal number of clusters: 1. Cubic Clustering Criterion (CCC) 2. Pseudo-F Statistic (PSF) 3. Pseudo-T Statistic (PST) STAT4330/8330 PRIESTLEY FISHER’S IRIS DATA The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features of the flowers were measured from each sample: the length and the width of the sepals and the petals. Based on the combination of these four features, Fisher developed a discriminant model to distinguish the species from each other. STAT4330/8330 PRIESTLEY Clustering Fishers Data Step 1: Standardize the Data Step 2: Proc Cluster (hierarchical) Step 3: Analyze the ccc, pst and psf to determine the optimal clusters Step 4: Use the information from Step 3 and run Proc Fastclus (k-means) STAT4330/8330 PRIESTLEY Credit Data Step 1: Standardize the Data (use ~ 10 variables) Step 2: Proc Cluster (hierarchical) Step 3: Analyze the ccc, pst and psf to determine the optimal clusters Step 4: Use the information from Step 3 and run Proc Fastclus (k-means) Step 5: Determine the profitability by cluster STAT4330/8330 PRIESTLEY