ClusterAnalysisPart2

advertisement

Advanced Statistical Methods for Research Math 736/836

Cluster Analysis

Part 2:

K-means clustering.

Watch out for the occasional paper clip!

©2009 Philip J. Ramsey, Ph.D.

1

K-means clustering is quite different in its approach from hierarchical clustering. It is based on disjoint clusters.

K-means clustering tends to have the same biases as Ward’s method in terms of characteristics of the clusters.

In K-means clustering we may start with a predetermined number of clusters K and assign observations to the clusters in such a way that the within cluster variation is minimized and the between cluster variation is maximized.

It is for this reason that Everitt and Dunn refer to this as an optimization method. The idea is to minimize the distances between observations within each cluster while maximizing the distance between the K clusters.

This method should typically only be used for larger datasets, say at least 200. The solutions are highly unstable with smaller datasets.

©2009 Philip J. Ramsey, Ph.D.

2

The basic K-means cluster algorithm is to start with some prespecified number of clusters or use an algorithm to pick an optimal number of clusters if the user has no a priori idea how many clusters should exist in the data.

The first step is to designate a set of N seed values and assign observations to clusters based on the nearness of the seed values.

Next compute the initial K centroids based on the groupings from the seed points. Then points are reassigned based upon nearness to one of the centroids.

Keep reassigning observations to clusters based upon the nearness of the centroids. Each time a point is moved from one cluster to another, recalculate the centroids for the clusters involved in the swap.

Continue the process until no further reassignments take place.

©2009 Philip J. Ramsey, Ph.D.

3

Example: We use the JMP sample data set Cytometry.JMP

. We will perform K-means clustering with K = 4 clusters specified. The variables CD3 and CD8 are used for the initial clustering attempt.

©2009 Philip J. Ramsey, Ph.D.

4

Example: In K-Means cluster report window we are given a number of options under the Control

Panel, however the JMP defaults are recommended – see the JMP help file if you wish to change some of the defaults. We can click on “Go” to have JMP find a solution, or the user can click on “Step” to see each iteration of the process.

©2009 Philip J. Ramsey, Ph.D.

5

Example: After 16 iterations

JMP converges to a solution for

4 clusters.

©2009 Philip J. Ramsey, Ph.D.

6

Example: Under the hotspot at the top of the report you can choose to see a biplot that displays the variable rays and the clusters.

©2009 Philip J. Ramsey, Ph.D.

7

Example: An important question is how many clusters to begin without a scientific basis for specifying the number of clusters. One good approach is try some scatter plots and see if any natural clustering seems to occur. Of course this is limited to 2D or 3D plots but is a valid basis when no a priori information exists.

Bivariate Fit of CD8 By CD3

500

400

300

200

A Fit Y by X plot indicates that 4 clusters may exist. A 3D Plot in

CD3, CD8, and CD4 also indicated 4 clusters.

100

0

0 100 200

CD3

300 400 500

©2009 Philip J. Ramsey, Ph.D.

8

Example: A 3D plot of CD3, CD8, and MCB indicated the possibility of 5 clusters.

©2009 Philip J. Ramsey, Ph.D.

9

Example: K-means clustering was repeated with CD3, CD8, and

MCB and 5 clusters specified.

©2009 Philip J. Ramsey, Ph.D.

10

Example: Below is a 3D biplot for the five clusters and three variables.

©2009 Philip J. Ramsey, Ph.D.

11

JMP provides a fairly modern version of K-means clustering based on Self Organizing Maps (SOMs). K-means clustering can be thought of as a special case of SOMs methodology. JMP provides a limited but useful application of SOMs to cluster analysis.

SOMs are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks. SOMs reduce dimensions by producing a 2D map which plots the similarities of the data by grouping similar data items together.

The goal of a SOM is to form clusters in a particular layout on a cluster grid, such that points in clusters that are near each other in the

SOM grid are also near each other in multivariate space.

In classical K means clustering, the structure of the clusters is arbitrary, but in SOMs the clusters have the grid structure. This grid structure may be of value in interpreting the clusters.

©2009 Philip J. Ramsey, Ph.D.

12

Example: We continue with the Cytometry example.

The bandwidth option is a weighting function used in estimating the clusters. JMP help does not provide details on its form. However, the choice of bandwidth can impact the final cluster. Try various values of the bandwidth and see if a more interpretable cluster analysis occurs.

©2009 Philip J. Ramsey, Ph.D.

Iterative Clustering

Control Panel

Self Organizing Map

Standardize data by Std Dev

Color while clustering

Shift distances using sampling rates

Use within-cluster std deviations

Bandwidth: 0.75

Cluster Summary

Step

18

Criterion

0

Cluster

1

2

3

4

Cluster Means

Count

1785

988

815

1412

Max Dist

2.75310404

2.21211425

1.77335177

1.30760703

Cluster

1

2

3

4

CD3

314.022388

344.491485

139.366061

218.502141

Cluster Standard Deviations

CD8

291.282325

117.459102

135.557004

108.402641

Cluster

1

2

3

4

CD3

22.1007724

28.5748537

33.4412448

23.7036989

CD8

25.7207367

37.0754192

40.2863212

35.3287975

13

Example: Below is the initial SOM layout of the four cluster locations for the Cytometry example.

Biplot

2

CD8

1

3 1

0

-1

4 2

CD3

-2

-3

-4 -3 -2

Eigenvalues

1.4604529

0.5395471

-1 0

Prin 1

1

©2009 Philip J. Ramsey, Ph.D.

2 3 4

14

Example: Below is the final SOM for the Cytometry example using a bandwidth of 0.75 (left biplot) and 0.5 (right biplot). Changing the bandwidth can significantly change the map.

Biplot

2

1

0

-1

-2

-3

-4 -3

Eigenvalues

1.4604529

0.5395471

-2

3

4

-1 0

Prin 1

2

1

CD8

1

CD3

2 3

Biplot

2

1

0

3

4

1

CD8

-1

2

-2 CD3

4

-3

-4 -3 -2

Eigenvalues

1.4604529

0.5395471

-1 0

Prin 1

1 2 3 4

©2009 Philip J. Ramsey, Ph.D.

15

Example: Below is the final SOM for the Cytometry example using a bandwidth of 0.90. Notice that with a high bandwidth we do not even have four clusters – there is too much smoothing.

Biplot

2

CD8

1

4

2

0

-1

3

1

-2 CD3

-3

-4 -3 -2

Eigenvalues

1.4604529

0.5395471

-1 0

Prin 1

1 2 3 4

©2009 Philip J. Ramsey, Ph.D.

16

Normal Mixtures is an iterative technique implemented in the K means clustering platform in JMP, although it is not a traditional Kmeans clustering algorithm.

Both K-means and Hierarchical clustering methods attempt to group observations into clusters. However the normal mixture approach is more of an estimation method to characterize the cluster groups.

If clusters overlap, assigning each observation to one cluster is problematic. In the overlap areas, there are observations from several clusters sharing the same space.

Rather than classifying each observation into a cluster, the mixtures method estimates the probability that an observation is in each cluster based upon the assumption of a multivariate normal distribution for each cluster. The user must specify the number of clusters.

The method produces fuzzy clusters that may overlap based on the assignments using the probabilities of cluster membership.

©2009 Philip J. Ramsey, Ph.D.

17

The concept of normal mixtures is that the observed multivariate data are generated as a finite mixture of K multivariate normal distributions. With each distribution contributing some proportion p to the total number of observations N .

In cluster analysis the user specifies the number of multivariate distributions that must be estimated.

Once the parameters of the K multivariate distributions are estimated, then it is possible to calculate probabilities that each of the

N observations came from one of the distributions.

Generally, the observation can be assigned to the cluster for which it has the highest estimated probability of belonging. In this sense the observations are not strictly assigned to clusters, but rather can be assigned classifications based on the probabilities.

An algorithm estimates the distribution parameters and proportions.

©2009 Philip J. Ramsey, Ph.D.

18

To illustrate the concept of finite normal mixtures, suppose we have the bivariate case – there are two variables of interest.

These two variables take on values that come from one of two distributions ( K = 2) with proportions p and ( 1-p ) respectively.

For each of the two distributions we have to estimate the centroid

μ

, the covariance matrix

, and the proportion (unless known in advance). In clustering we assume the proportions are unknown.

Our overall bivariate distribution is estimated from the observations as a combination of the two underlying normal distributions.

1 2

   

A A

 

1 p

 

MVN

B B

For some number of distributions m > 2 and a vector of variables X , the formula generalizes to f

 i m 

1 p i

MVN

  i i

©2009 Philip J. Ramsey, Ph.D.

19

The parameters and proportions for the distributions are estimated using the method of maximum likelihood. We will not delve into the mathematical details, however a sketch of the approach is provided by

Everitt and Dunn text in the Cluster Analysis chapter.

Example: We will use the Cytometry data. We have K = 4 clusters for two variables CD3 and CD8, so we have to estimate m = 4 bivariate normal distributions and proportions. We will use the Kmeans, Normal Mixtures option to estimate the proportions and parameters of the distributions.

©2009 Philip J. Ramsey, Ph.D.

20

Example continued: The estimated proportions and distribution parameters are given below. JMP only provides correlation estimates.

Cluster Summary

Step

500

Cluster

1

2

3

4

Criterion

3.793e-9

Proportion

0.39396573

0.17119194

0.34261552

0.09222681

Mixture Count

1905.11684

844.329154

1686.30176

564.252252

Cluster Means

Cluster

1

2

3

4

CD3

218.339067

339.684674

318.018017

120.184407

CD8

129.156625

88.7314432

306.756268

96.4038892

Cluster Standard Deviations

Cluster

1

2

3

4

CD3

52.5587037

25.8393687

19.8353249

39.0601389

CD8

45.1128235

33.2087219

20.8102507

26.6025811

Note: StdDev slightly inflated for regularization.

©2009 Philip J. Ramsey, Ph.D.

Correlations for Normal Mixtures

Cluster1

CD3

CD8

CD3

1.0000

0.3751

CD8

0.3751

1.0000

Cluster2

CD3

CD8

CD3

1.0000

0.0121

CD8

0.0121

1.0000

Cluster3

CD3

CD8

CD3

1.0000

0.3801

CD8

0.3801

1.0000

Cluster4

CD3

CD8

CD3

1.0000

0.2413

CD8

0.2413

1.0000

21

Example continued: Below is a 3D view of the mixture density.

Notice that cluster 4 does not resolve due to a low proportion.

#3

#1

#2

#4

©2009 Philip J. Ramsey, Ph.D.

22

Example continued: Below is a Fit Y by X plot of the clusters showing 95% density ellipses for each cluster.

Bivariate Fit of CD8 By CD3

500

400

300

3

200

1

100 4

0

0 100 200

CD3

300

Bivariate Normal Ellipse P=0.900 Cluster==1

©2009 Philip J. Ramsey, Ph.D.

Bivariate Normal Ellipse P=0.900 Cluster==2

Bivariate Normal Ellipse P=0.900 Cluster==3

Bivariate Normal Ellipse P=0.900 Cluster==4

2

400 500

23

Example continued: Below is a partial view of the spreadsheet with the calculated probabilities of cluster membership for the observations.

©2009 Philip J. Ramsey, Ph.D.

24

It is possible to use K-means clustering with predetermined or fixed clusters. As an example, there may be known species that make up a multivariate dataset and the species are the basis for the clusters.

The estimated cluster formula could then be saved to the spreadsheet and then applied to a new set of measurements where the identity of the species is not known. This takes clustering into the realm of classification (supervised learning).

We use the JMP sample dataset OwlDiet. JMP to demonstrate fixed clustering using Normal Mixtures. The data consists of measurements on seven species of Malaysian rats that are eaten by owls. There are two data sets. The first (classification) data set had 179 observations and contained measures of skull length, teeth row, palentine foramen, and jaw width for seven known species. The second (observational) set had 115 observations and no information regarding species. We will try to identify the unknown species.

©2009 Philip J. Ramsey, Ph.D.

25

Example:

For the owl data we specify species as the “Centers” variable in the Cluster Analysis launch window. The number of clusters for K-means clustering is now set by “Species.”

©2009 Philip J. Ramsey, Ph.D.

26

Example: The observations with no designated species will be classified into their nearest cluster. Select the “Save Clusters” option to see the classifications for the observations missing a species designation.

Cluster Summary

Step

0

Criterion

0

Cluster

1 species annandalfi

2 argentiventer

3

4

5

6

7 exulans rajah surifer tiomanicus whiteheadi

Proportion

0.14285714

0.14285714

0.14285714

0.14285714

0.14285714

0.14285714

0.14285714

Cluster centers defined by species

Mixture Count

0

0

0

0

0

0

0

Cluster Summary

Step

999

Criterion

3.43e-13

Cluster

1 species annandalfi

2 argentiventer

3

4

5

6

7 exulans rajah surifer tiomanicus whiteheadi

Proportion

0.36207334

0.12937488

0.05468071

0.17687071

0.01698639

0.04568538

0.2143286

Cluster centers defined by species

Mixture Count

28.4645603

28.680347

13.4915849

52

3.8828717

6.96155194

160.519084

©2009 Philip J. Ramsey, Ph.D.

27

Example : Notice in the partial spreadsheet view that the observations without a species are assigned a cluster (or species).

©2009 Philip J. Ramsey, Ph.D.

28

Example : If you would like the Cluster column to display species designations, change the column data type to character (Column Info window), while the cluster column is highlighted click on the Column menu and select “recode.” Now recode the cluster numbers to species.

©2009 Philip J. Ramsey, Ph.D.

29

Example : A biplot with the 7 clusters marked on the plot. Palantine foramen is not highly correlated with the other variables. The species whiteheadi (cluster 7) has a very small palatine foramen. The species exulans (cluster 3) have a small skull length, teeth, and jaw. Notice considerable overlap in the clusters.

Exulans whiteheadi

©2009 Philip J. Ramsey, Ph.D.

palatine foramen jaw length teeth row skull length

30

Example : A 3D biplot with the 7 clusters marked on the plot as ellipsoids.

Scatterplot 3D

Exulans whiteheadi

©2009 Philip J. Ramsey, Ph.D.

Principal Components Prin1 Prin2 Prin3

31

Download