PowerPoint

Cluster Analysis
Grouping Cases or Variables
Clustering Cases
• Goal is to cluster cases into groups based
on shared characteristics.
• Start out with each case being a one-case
cluster.
• The clusters are located in k-dimensional
space, where k is the number of variables.
• Compute the squared Euclidian distance
between each case and each other case.
Squared Euclidian Distance
2
v
 X
i 1
i
 Yi 
• the sum across variables (from i = 1 to v)
of the squared difference between the
score on variable i for the one case (Xi)
and the score on variable i for the other
case (Yi)
Agglomerate
• The two cases closest to each other are
agglomerated into a cluster.
• The distances between entities (clusters
and cases) are recomputed.
• The two entities closest to each other are
agglomerated.
• This continues until all cases end up in
one cluster.
What is the Correct Solution?
• You may have theoretical reasons to
expect a certain k cluster solution.
• Look at that solution and see if it matches
your expectations.
• Alternatively, you may try to make sense
out of solutions at two or more levels of
the analysis.
Faculty Salaries
• Subjects were faculty in Psychology at
ECU.
• Variables were rank, experience, number
of publications, course load, and salary.
• Data are at ClusterAnonFaculty.sav
• Also see the statistical output
Analyze, Classify, Hierarchical Cluster
Statistics
Plots
Method
Save
Proximity Matrix
• We did not request this, but if we had it
would display a measure of dissimilarity
for each pair of entities.
• The pair of cases with the smallest
squared Euclidian distance are clustered.
Look at the Agglomeration Schedule.
Cases 32 and 33 are clustered. They
are very similar (distance = 0.000)
Stage
1
Cluster Combined
Cluster 1
32
Coefficients
Cluster 2
33
Cluster 1
.000
Steps 2 Through 5
Agglomeration Schedule
Stage
1
2
3
4
5
6
Cluster Combined
Cluster 1
32
41
43
37
37
41
Cluster 2
33
42
44
38
39
43
Coefficient Stage Cluster First
s
Appears
Cluster 1 Cluster 2 Cluster 1
.000
0
0
.000
0
0
.000
0
0
.000
0
0
.001
4
0
.002
2
3
Next
Stage
Cluster 2
9
6
6
5
7
27
Stages 2-5
• The agglomeration schedule show that in
Stage 2 cases 41 and 42 are clustered.
• In Stage 3 cases 43 and 44 are clustered.
• In Stage 4 cases 37 and 38 are clustered.
• In Stage 5 case 39 is added to the cluster
that contains cases 37 and 38.
• And so on.
Vertical Icicle, Two Clusters
• Look at the top of the display (next slide).
• You can see two clusters
– On the left Boris through Willy
– On the right, Deanna through Sunila
• The 2 cluster solution was adjuncts versus
full time faculty.
Vertical Icicle, Three Clusters
• Look at the icicle second highest white
bar.
• Now there are three clusters
– Adjuncts
– Junior faculty (Deanna through Mickey)
– Senior faculty (Lawrence through Roslyn)
Vertical Icicle, Four Clusters
• Look at the white bar furthest to the right.
• Now there are four clusters
– Adjuncts
– Junior faculty
– The acting chair (Lawrence)
– The rest of the senior faculty (Catalina
through Roslyn)
The Dendogram
• At the far right you can see the two cluster
solution.
• The next step to the left shows the three
cluster solution.
• The next step to the left shows the four
cluster solution.
• And so on.
• Truncated and rotated dendogram on next
slide.
Compare Two Clusters
• The 2 cluster solution was adjuncts versus
everybody else.
• Look at the t tests in the output
• Adjuncts had lower rank, experience,
number of publications, course load, and
salary.
Compare Three Clusters
• Look at the ANOVAs and plots.
• The senior faculty had higher salary,
experience, rank, and number of pubs.
Compare Four Clusters
• The acting chair had a higher salary and
number of publications.
I Could Not Help Myself
• With these data on hand, I could not resist
predicting salary from the other variables.
• Salary was well correlated with Rank,
FTEs, Publications, and Experience.
• In the multiple regression, only Rank and
FTEs had significant unique effects.
• The residuals suggest who was being
overpaid and who underpaid.
Split by Sex
• For men, the unique effect of number of
publications was positive – more
publications, higher salary.
• For women it was negative – more
publications, lower salary.
• Curious.
Workaholism
• Aziz & Zickar (2005)
• Workaholics may be defined as those
– High in work involvement,
– High in drive to work, and
– Low in work enjoyment.
• For each case, a score was obtained for
each of these three dimensions.
The Three Cluster Solution
• Workaholics
– High work involvement
– High drive to work
– Low work enjoyment
• Positively engaged workers
– High work involvement
– Medium drive to work
– High work enjoyment
• Unengaged workers
– Low work involvement
– Low drive to work
– Low work enjoyment
• Past research/theory indicated there
should be six clusters, but the theorized
six clusters were not obtained.
Clustering Variables
• FactBeer.sav
• The statistical output.
• Analyze, Classify, Hierarchical Cluster
Statistics
Plots
Method
Proximity Matrix
• Is simply the intercorrelation matrix
• The two most correlated variables are
Color and Aroma (r = .909) – they are
clustered on the first step.
• Stage 2: Size and Alcohol (r = .904) are
clustered.
• Stage 3: Taste added to the cluster that
already contains Color and Aroma
Also See Other Tables & Plots
• Stage 4: Cost added to the cluster that
already contains Size and Alcohol.
• Stage 5: The two clusters are combined
– But they are not very similar (similarity
coefficient = .038)
– Now we have one cluster with six variables
and one with one (Reputation)