Cluster Analysis Grouping Cases or Variables Clustering Cases • Goal is to cluster cases into groups based on shared characteristics. • Start out with each case being a one-case cluster. • The clusters are located in k-dimensional space, where k is the number of variables. • Compute the squared Euclidian distance between each case and each other case. Squared Euclidian Distance 2 v X i 1 i Yi • the sum across variables (from i = 1 to v) of the squared difference between the score on variable i for the one case (Xi) and the score on variable i for the other case (Yi) Agglomerate • The two cases closest to each other are agglomerated into a cluster. • The distances between entities (clusters and cases) are recomputed. • The two entities closest to each other are agglomerated. • This continues until all cases end up in one cluster. What is the Correct Solution? • You may have theoretical reasons to expect a certain k cluster solution. • Look at that solution and see if it matches your expectations. • Alternatively, you may try to make sense out of solutions at two or more levels of the analysis. Faculty Salaries • Subjects were faculty in Psychology at ECU. • Variables were rank, experience, number of publications, course load, and salary. • Data are at ClusterAnonFaculty.sav • Also see the statistical output Analyze, Classify, Hierarchical Cluster Statistics Plots Method Save Proximity Matrix • We did not request this, but if we had it would display a measure of dissimilarity for each pair of entities. • The pair of cases with the smallest squared Euclidian distance are clustered. Look at the Agglomeration Schedule. Cases 32 and 33 are clustered. They are very similar (distance = 0.000) Stage 1 Cluster Combined Cluster 1 32 Coefficients Cluster 2 33 Cluster 1 .000 Steps 2 Through 5 Agglomeration Schedule Stage 1 2 3 4 5 6 Cluster Combined Cluster 1 32 41 43 37 37 41 Cluster 2 33 42 44 38 39 43 Coefficient Stage Cluster First s Appears Cluster 1 Cluster 2 Cluster 1 .000 0 0 .000 0 0 .000 0 0 .000 0 0 .001 4 0 .002 2 3 Next Stage Cluster 2 9 6 6 5 7 27 Stages 2-5 • The agglomeration schedule show that in Stage 2 cases 41 and 42 are clustered. • In Stage 3 cases 43 and 44 are clustered. • In Stage 4 cases 37 and 38 are clustered. • In Stage 5 case 39 is added to the cluster that contains cases 37 and 38. • And so on. Vertical Icicle, Two Clusters • Look at the top of the display (next slide). • You can see two clusters – On the left Boris through Willy – On the right, Deanna through Sunila • The 2 cluster solution was adjuncts versus full time faculty. Vertical Icicle, Three Clusters • Look at the icicle second highest white bar. • Now there are three clusters – Adjuncts – Junior faculty (Deanna through Mickey) – Senior faculty (Lawrence through Roslyn) Vertical Icicle, Four Clusters • Look at the white bar furthest to the right. • Now there are four clusters – Adjuncts – Junior faculty – The acting chair (Lawrence) – The rest of the senior faculty (Catalina through Roslyn) The Dendogram • At the far right you can see the two cluster solution. • The next step to the left shows the three cluster solution. • The next step to the left shows the four cluster solution. • And so on. • Truncated and rotated dendogram on next slide. Compare Two Clusters • The 2 cluster solution was adjuncts versus everybody else. • Look at the t tests in the output • Adjuncts had lower rank, experience, number of publications, course load, and salary. Compare Three Clusters • Look at the ANOVAs and plots. • The senior faculty had higher salary, experience, rank, and number of pubs. Compare Four Clusters • The acting chair had a higher salary and number of publications. I Could Not Help Myself • With these data on hand, I could not resist predicting salary from the other variables. • Salary was well correlated with Rank, FTEs, Publications, and Experience. • In the multiple regression, only Rank and FTEs had significant unique effects. • The residuals suggest who was being overpaid and who underpaid. Split by Sex • For men, the unique effect of number of publications was positive – more publications, higher salary. • For women it was negative – more publications, lower salary. • Curious. Workaholism • Aziz & Zickar (2005) • Workaholics may be defined as those – High in work involvement, – High in drive to work, and – Low in work enjoyment. • For each case, a score was obtained for each of these three dimensions. The Three Cluster Solution • Workaholics – High work involvement – High drive to work – Low work enjoyment • Positively engaged workers – High work involvement – Medium drive to work – High work enjoyment • Unengaged workers – Low work involvement – Low drive to work – Low work enjoyment • Past research/theory indicated there should be six clusters, but the theorized six clusters were not obtained. Clustering Variables • FactBeer.sav • The statistical output. • Analyze, Classify, Hierarchical Cluster Statistics Plots Method Proximity Matrix • Is simply the intercorrelation matrix • The two most correlated variables are Color and Aroma (r = .909) – they are clustered on the first step. • Stage 2: Size and Alcohol (r = .904) are clustered. • Stage 3: Taste added to the cluster that already contains Color and Aroma Also See Other Tables & Plots • Stage 4: Cost added to the cluster that already contains Size and Alcohol. • Stage 5: The two clusters are combined – But they are not very similar (similarity coefficient = .038) – Now we have one cluster with six variables and one with one (Reputation)