Watch out for the occasional paper clip!
©2009 Philip J. Ramsey, Ph.D.
1
K-means clustering is quite different in its approach from hierarchical clustering. It is based on disjoint clusters.
K-means clustering tends to have the same biases as Ward’s method in terms of characteristics of the clusters.
In K-means clustering we may start with a predetermined number of clusters K and assign observations to the clusters in such a way that the within cluster variation is minimized and the between cluster variation is maximized.
It is for this reason that Everitt and Dunn refer to this as an optimization method. The idea is to minimize the distances between observations within each cluster while maximizing the distance between the K clusters.
This method should typically only be used for larger datasets, say at least 200. The solutions are highly unstable with smaller datasets.
©2009 Philip J. Ramsey, Ph.D.
2
The basic K-means cluster algorithm is to start with some prespecified number of clusters or use an algorithm to pick an optimal number of clusters if the user has no a priori idea how many clusters should exist in the data.
The first step is to designate a set of N seed values and assign observations to clusters based on the nearness of the seed values.
Next compute the initial K centroids based on the groupings from the seed points. Then points are reassigned based upon nearness to one of the centroids.
Keep reassigning observations to clusters based upon the nearness of the centroids. Each time a point is moved from one cluster to another, recalculate the centroids for the clusters involved in the swap.
Continue the process until no further reassignments take place.
©2009 Philip J. Ramsey, Ph.D.
3
Example: We use the JMP sample data set Cytometry.JMP
. We will perform K-means clustering with K = 4 clusters specified. The variables CD3 and CD8 are used for the initial clustering attempt.
©2009 Philip J. Ramsey, Ph.D.
4
Example: In K-Means cluster report window we are given a number of options under the Control
Panel, however the JMP defaults are recommended – see the JMP help file if you wish to change some of the defaults. We can click on “Go” to have JMP find a solution, or the user can click on “Step” to see each iteration of the process.
©2009 Philip J. Ramsey, Ph.D.
5
Example: After 16 iterations
JMP converges to a solution for
4 clusters.
©2009 Philip J. Ramsey, Ph.D.
6
Example: Under the hotspot at the top of the report you can choose to see a biplot that displays the variable rays and the clusters.
©2009 Philip J. Ramsey, Ph.D.
7
Example: An important question is how many clusters to begin without a scientific basis for specifying the number of clusters. One good approach is try some scatter plots and see if any natural clustering seems to occur. Of course this is limited to 2D or 3D plots but is a valid basis when no a priori information exists.
Bivariate Fit of CD8 By CD3
500
400
300
200
A Fit Y by X plot indicates that 4 clusters may exist. A 3D Plot in
CD3, CD8, and CD4 also indicated 4 clusters.
100
0
0 100 200
CD3
300 400 500
©2009 Philip J. Ramsey, Ph.D.
8
Example: A 3D plot of CD3, CD8, and MCB indicated the possibility of 5 clusters.
©2009 Philip J. Ramsey, Ph.D.
9
Example: K-means clustering was repeated with CD3, CD8, and
MCB and 5 clusters specified.
©2009 Philip J. Ramsey, Ph.D.
10
Example: Below is a 3D biplot for the five clusters and three variables.
©2009 Philip J. Ramsey, Ph.D.
11
JMP provides a fairly modern version of K-means clustering based on Self Organizing Maps (SOMs). K-means clustering can be thought of as a special case of SOMs methodology. JMP provides a limited but useful application of SOMs to cluster analysis.
SOMs are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks. SOMs reduce dimensions by producing a 2D map which plots the similarities of the data by grouping similar data items together.
The goal of a SOM is to form clusters in a particular layout on a cluster grid, such that points in clusters that are near each other in the
SOM grid are also near each other in multivariate space.
In classical K means clustering, the structure of the clusters is arbitrary, but in SOMs the clusters have the grid structure. This grid structure may be of value in interpreting the clusters.
©2009 Philip J. Ramsey, Ph.D.
12
Example: We continue with the Cytometry example.
The bandwidth option is a weighting function used in estimating the clusters. JMP help does not provide details on its form. However, the choice of bandwidth can impact the final cluster. Try various values of the bandwidth and see if a more interpretable cluster analysis occurs.
©2009 Philip J. Ramsey, Ph.D.
Iterative Clustering
Control Panel
Self Organizing Map
Standardize data by Std Dev
Color while clustering
Shift distances using sampling rates
Use within-cluster std deviations
Bandwidth: 0.75
Cluster Summary
Step
18
Criterion
0
Cluster
1
2
3
4
Cluster Means
Count
1785
988
815
1412
Max Dist
2.75310404
2.21211425
1.77335177
1.30760703
Cluster
1
2
3
4
CD3
314.022388
344.491485
139.366061
218.502141
Cluster Standard Deviations
CD8
291.282325
117.459102
135.557004
108.402641
Cluster
1
2
3
4
CD3
22.1007724
28.5748537
33.4412448
23.7036989
CD8
25.7207367
37.0754192
40.2863212
35.3287975
13
Example: Below is the initial SOM layout of the four cluster locations for the Cytometry example.
Biplot
2
CD8
1
3 1
0
-1
4 2
CD3
-2
-3
-4 -3 -2
Eigenvalues
1.4604529
0.5395471
-1 0
Prin 1
1
©2009 Philip J. Ramsey, Ph.D.
2 3 4
14
Example: Below is the final SOM for the Cytometry example using a bandwidth of 0.75 (left biplot) and 0.5 (right biplot). Changing the bandwidth can significantly change the map.
Biplot
2
1
0
-1
-2
-3
-4 -3
Eigenvalues
1.4604529
0.5395471
-2
3
4
-1 0
Prin 1
2
1
CD8
1
CD3
2 3
Biplot
2
1
0
3
4
1
CD8
-1
2
-2 CD3
4
-3
-4 -3 -2
Eigenvalues
1.4604529
0.5395471
-1 0
Prin 1
1 2 3 4
©2009 Philip J. Ramsey, Ph.D.
15
Example: Below is the final SOM for the Cytometry example using a bandwidth of 0.90. Notice that with a high bandwidth we do not even have four clusters – there is too much smoothing.
Biplot
2
CD8
1
4
2
0
-1
3
1
-2 CD3
-3
-4 -3 -2
Eigenvalues
1.4604529
0.5395471
-1 0
Prin 1
1 2 3 4
©2009 Philip J. Ramsey, Ph.D.
16
Normal Mixtures is an iterative technique implemented in the K means clustering platform in JMP, although it is not a traditional Kmeans clustering algorithm.
Both K-means and Hierarchical clustering methods attempt to group observations into clusters. However the normal mixture approach is more of an estimation method to characterize the cluster groups.
If clusters overlap, assigning each observation to one cluster is problematic. In the overlap areas, there are observations from several clusters sharing the same space.
Rather than classifying each observation into a cluster, the mixtures method estimates the probability that an observation is in each cluster based upon the assumption of a multivariate normal distribution for each cluster. The user must specify the number of clusters.
The method produces fuzzy clusters that may overlap based on the assignments using the probabilities of cluster membership.
©2009 Philip J. Ramsey, Ph.D.
17
The concept of normal mixtures is that the observed multivariate data are generated as a finite mixture of K multivariate normal distributions. With each distribution contributing some proportion p to the total number of observations N .
In cluster analysis the user specifies the number of multivariate distributions that must be estimated.
Once the parameters of the K multivariate distributions are estimated, then it is possible to calculate probabilities that each of the
N observations came from one of the distributions.
Generally, the observation can be assigned to the cluster for which it has the highest estimated probability of belonging. In this sense the observations are not strictly assigned to clusters, but rather can be assigned classifications based on the probabilities.
An algorithm estimates the distribution parameters and proportions.
©2009 Philip J. Ramsey, Ph.D.
18
To illustrate the concept of finite normal mixtures, suppose we have the bivariate case – there are two variables of interest.
These two variables take on values that come from one of two distributions ( K = 2) with proportions p and ( 1-p ) respectively.
For each of the two distributions we have to estimate the centroid
μ
, the covariance matrix
, and the proportion (unless known in advance). In clustering we assume the proportions are unknown.
Our overall bivariate distribution is estimated from the observations as a combination of the two underlying normal distributions.
1 2
A A
1 p
MVN
B B
For some number of distributions m > 2 and a vector of variables X , the formula generalizes to f
i m
1 p i
MVN
i i
©2009 Philip J. Ramsey, Ph.D.
19
The parameters and proportions for the distributions are estimated using the method of maximum likelihood. We will not delve into the mathematical details, however a sketch of the approach is provided by
Everitt and Dunn text in the Cluster Analysis chapter.
Example: We will use the Cytometry data. We have K = 4 clusters for two variables CD3 and CD8, so we have to estimate m = 4 bivariate normal distributions and proportions. We will use the Kmeans, Normal Mixtures option to estimate the proportions and parameters of the distributions.
©2009 Philip J. Ramsey, Ph.D.
20
Example continued: The estimated proportions and distribution parameters are given below. JMP only provides correlation estimates.
Cluster Summary
Step
500
Cluster
1
2
3
4
Criterion
3.793e-9
Proportion
0.39396573
0.17119194
0.34261552
0.09222681
Mixture Count
1905.11684
844.329154
1686.30176
564.252252
Cluster Means
Cluster
1
2
3
4
CD3
218.339067
339.684674
318.018017
120.184407
CD8
129.156625
88.7314432
306.756268
96.4038892
Cluster Standard Deviations
Cluster
1
2
3
4
CD3
52.5587037
25.8393687
19.8353249
39.0601389
CD8
45.1128235
33.2087219
20.8102507
26.6025811
Note: StdDev slightly inflated for regularization.
©2009 Philip J. Ramsey, Ph.D.
Correlations for Normal Mixtures
Cluster1
CD3
CD8
CD3
1.0000
0.3751
CD8
0.3751
1.0000
Cluster2
CD3
CD8
CD3
1.0000
0.0121
CD8
0.0121
1.0000
Cluster3
CD3
CD8
CD3
1.0000
0.3801
CD8
0.3801
1.0000
Cluster4
CD3
CD8
CD3
1.0000
0.2413
CD8
0.2413
1.0000
21
Example continued: Below is a 3D view of the mixture density.
Notice that cluster 4 does not resolve due to a low proportion.
#3
#1
#2
#4
©2009 Philip J. Ramsey, Ph.D.
22
Example continued: Below is a Fit Y by X plot of the clusters showing 95% density ellipses for each cluster.
Bivariate Fit of CD8 By CD3
500
400
300
3
200
1
100 4
0
0 100 200
CD3
300
Bivariate Normal Ellipse P=0.900 Cluster==1
©2009 Philip J. Ramsey, Ph.D.
Bivariate Normal Ellipse P=0.900 Cluster==2
Bivariate Normal Ellipse P=0.900 Cluster==3
Bivariate Normal Ellipse P=0.900 Cluster==4
2
400 500
23
Example continued: Below is a partial view of the spreadsheet with the calculated probabilities of cluster membership for the observations.
©2009 Philip J. Ramsey, Ph.D.
24
It is possible to use K-means clustering with predetermined or fixed clusters. As an example, there may be known species that make up a multivariate dataset and the species are the basis for the clusters.
The estimated cluster formula could then be saved to the spreadsheet and then applied to a new set of measurements where the identity of the species is not known. This takes clustering into the realm of classification (supervised learning).
We use the JMP sample dataset OwlDiet. JMP to demonstrate fixed clustering using Normal Mixtures. The data consists of measurements on seven species of Malaysian rats that are eaten by owls. There are two data sets. The first (classification) data set had 179 observations and contained measures of skull length, teeth row, palentine foramen, and jaw width for seven known species. The second (observational) set had 115 observations and no information regarding species. We will try to identify the unknown species.
©2009 Philip J. Ramsey, Ph.D.
25
Example:
For the owl data we specify species as the “Centers” variable in the Cluster Analysis launch window. The number of clusters for K-means clustering is now set by “Species.”
©2009 Philip J. Ramsey, Ph.D.
26
Example: The observations with no designated species will be classified into their nearest cluster. Select the “Save Clusters” option to see the classifications for the observations missing a species designation.
Cluster Summary
Step
0
Criterion
0
Cluster
1 species annandalfi
2 argentiventer
3
4
5
6
7 exulans rajah surifer tiomanicus whiteheadi
Proportion
0.14285714
0.14285714
0.14285714
0.14285714
0.14285714
0.14285714
0.14285714
Cluster centers defined by species
Mixture Count
0
0
0
0
0
0
0
Cluster Summary
Step
999
Criterion
3.43e-13
Cluster
1 species annandalfi
2 argentiventer
3
4
5
6
7 exulans rajah surifer tiomanicus whiteheadi
Proportion
0.36207334
0.12937488
0.05468071
0.17687071
0.01698639
0.04568538
0.2143286
Cluster centers defined by species
Mixture Count
28.4645603
28.680347
13.4915849
52
3.8828717
6.96155194
160.519084
©2009 Philip J. Ramsey, Ph.D.
27
Example : Notice in the partial spreadsheet view that the observations without a species are assigned a cluster (or species).
©2009 Philip J. Ramsey, Ph.D.
28
Example : If you would like the Cluster column to display species designations, change the column data type to character (Column Info window), while the cluster column is highlighted click on the Column menu and select “recode.” Now recode the cluster numbers to species.
©2009 Philip J. Ramsey, Ph.D.
29
Example : A biplot with the 7 clusters marked on the plot. Palantine foramen is not highly correlated with the other variables. The species whiteheadi (cluster 7) has a very small palatine foramen. The species exulans (cluster 3) have a small skull length, teeth, and jaw. Notice considerable overlap in the clusters.
Exulans whiteheadi
©2009 Philip J. Ramsey, Ph.D.
palatine foramen jaw length teeth row skull length
30
Example : A 3D biplot with the 7 clusters marked on the plot as ellipsoids.
Scatterplot 3D
Exulans whiteheadi
©2009 Philip J. Ramsey, Ph.D.
Principal Components Prin1 Prin2 Prin3
31