proc cluster - Learn Via Web .com

advertisement
Cluster Analysis
Some of the examples on
cluster analysis will come from
Base SAS.
What is Cluster Analysis?

What is cluster analysis?




Cluster analysis is an exploratory data analysis
tool.
Cluster analysis can classify records (people,
things, events, etc) into groups, or clusters.
The classification is performed in a manner to
produce a strong degree of association between
members of the same cluster and weak degree of
association between members of different
clusters.
Cluster analysis can illuminate associations and
structure in the data that was not previously
evident, but are logical and useful once found.
When Do We Use Cluster Analysis?

When do we use cluster analysis?


When you want to classify people/things
together into groups/clusters. Thus there are
many situations in which cluster analysis can
be helpful. Often it is useful in marketing.
Why marketing?

Often in marketing it is desirable to group
customers with similar likes and dislikes
together.


This can enable more effective marketing campaigns.
Also in marketing it is often difficult to
determine how to place customers into
categories.
Measures of Distance

Two Main Types of Distance Used for Cluster
Analysis
1.
Euclidean distance
 xr  xs   xr  xs 
'
where xr represents the values of the
variables to use for clustering with point r and
xs represents the values of the variables to use
–
for clustering with point s.
Standard Euclidean distance
–
Standardize the data first then calculate the Euclidean
distance.
zi 
xi  

Typically we will use sample
mean and sample standard
deviation for
 and 
respectively.
An Example of the Power of Cluster Analysis

1.
We will now give an example of the power of cluster analysis with SAS.
Generate data (Complex code to generate random data, expecting 3
clusters.):
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
DATA body1; *the data represents the heights and weights of 300 women.
KEEP height1 weight1;
N=100; SCALE=1; *N=observations per cluster. Scale=variance;
MX=150; MY=45; LINK GENERATE;
*mean of x,y of cluster 1;
MX=155; MY=50; LINK GENERATE; *mean of x,y of cluster 2;
MX=160; MY=40; LINK GENERATE; *mean of x,y of cluster 3;
STOP;
GENERATE:
DO I=1 TO N;
height1=RANNOR(0)*SCALE+MX; *generates from random normal distribution;
weight1=RANNOR(0)*SCALE+MY; *generates from random normal distribution;
OUTPUT;
END;
RETURN;
RUN;
An Example of the Power of Cluster Analysis

2.
How will we group the women looking at height and weight?
Now let us see the data.
•
proc print data=body1; run;
An Example of the Power of Cluster Analysis

3.
How will we group the women looking at height and weight?
Let us see some descriptive statistics for more information.
•
•
•
proc univariate data=body1;
var height1 weight1;
run;
Descriptive Statistics
on Height
An Example of the Power of Cluster Analysis

3.
How will we group the women looking at height and weight?
Let us see some descriptive statistics for more information.
•
•
•
proc univariate data=body1;
var height1 weight1;
run;
Descriptive Statistics
on Weight
An Example of the Power of Cluster Analysis


How will we group the women looking at
height and weight?
The descriptive statistics were informative
but it is still very difficult to know how to
group the women using their heights and
weights.
Input (name=body1) and output (name=tree) datasets
respectively. The SAS keywords are data and outtree.
4.
Cluster Analysis:
•
•
PROC CLUSTER DATA=body1 OUTTREE=TREE
METHOD=SINGLE PRINT=20;
The clustering
run;
method is a single
linkage method.
An Example of the Power of Cluster Analysis
4.
Cluster Analysis:
•
•
PROC TREE data=tree NOPRINT OUT=OUT N=3;
COPY height1 weight1;
Will copy the
variables height1 and
weight1 to the output
dataset called out.
•
•
•
Prevents printing of the
tree.
Specifies number of desired
clusters. For this example it is 3.
PROC GPLOT;
PLOT height1*weight1=CLUSTER;
RUN;
The vertical (y) axis and
the horizontal (x) axis
respectively.
Makes the plot distinguish
between the clusters produced.
An Example of the Power of Cluster Analysis

The results of clustering using SAS.
The Example of the Power of Cluster Analysis
Input (name=body1) and output (name=tree) datasets
respectively. The SAS keywords are data and outtree.
Cluster Analysis:
•
PROC CLUSTER DATA=body1 OUTTREE=TREE
METHOD=SINGLE PRINT=20; run;
The clustering method used in
In SAS the options for “Method” are:
the example was single linkage
1. AVE or AVERAGE for average linkage method.
method.
2. CEN or CENTROID for centroid method.
3. COM or COMPLETE for complete linkage method
4. DEN or DENSITY for density linkage methods using nonparametric probability density estimation.
5. EML use for coordinate data, much slower than the other clustering methods.
6. FLE or FLEXIBLE for Lance-Williams flexible-beta method.
7. MCQ or MCQUITTY for McQuitty’s similarity analysis.
8. MED or MEDIAN for Grower’s median method.
A lot of possible
9. SIN or SINGLE for single linkage method.
methods!?! The most
10. TWO or TWOSTAGE for two-stage density linkage.
common are
11. WAR or WARD for Ward’s minimum variance method.
Single, Average
Complete, Centroid,
and Ward’s.
Cluster Analysis:
Two Main Procedures In SAS
1.
Proc Cluster and it is Hierarchical.
•
2.
With Proc Tree
Proc FastClus and it is
Nonhierarchical.
Cluster Analysis: Hierarchical
1.
Proc Cluster

Hierarchical: The data points are placed into clusters in a
nested sequence of clustering. The most efficient type of
hierarchical clustering methods are the single link clustering
methods.

Types of single link clustering:






Nearest Neighbor Method
Furthest Neighbor Method
Many Others
Nearest Neighbor Method tends to create fewer clusters than
the Furthest Neighbor Method.
Most other methods tend to give results somewhere in between
the latter two methods mentioned.
It is good to try more than one method.

If multiple methods produce consistent/similar results then your
results are more credible.
Cluster Analysis: Nonhierarchical
2.
Proc FastClus

Nonhierarchical: Set of cluster seed points and
build clusters around the seeds using dissimilarity
measures and distance between data points and
the cluster seed points.


Example of a dissimilarity measure is Euclidean distance
Three main disadvantages:
1.
2.
3.
Must guess initially the number of clusters that exist.
Greatly influenced by the location of the cluster seed
points.
Possibly not feasible computationally due to the
multitude of possible choices for the number of clusters
and the location for the cluster seeds as well.
Reference:

Much of the next few slides are taken from
SPSS Help/Tutorial.
SPSS’s: The TwoStep Cluster Analysis

The TwoStep Cluster Analysis procedure is
an exploratory tool designed to reveal
natural groupings (or clusters) within a
data set that would otherwise not be
apparent. The algorithm employed by this
procedure has several desirable features
that differentiate it from traditional
clustering techniques:



The ability to create clusters based on both
categorical and continuous variables.
Automatic selection of the number of clusters.
The ability to analyze large data files
efficiently.
SPSS’s: The TwoStep Cluster Analysis


In order to handle categorical and continuous variables, the TwoStep
Cluster Analysis procedure uses a likelihood distance measure which
assumes that variables in the cluster model are independent. Further,
each continuous variable is assumed to have a normal (Gaussian)
distribution and each categorical variable is assumed to have a
multinomial distribution. Empirical internal testing indicates that the
procedure is fairly robust to violations of both the assumption of
independence and the distributional assumptions, but you should try to
be aware of how well these assumptions are met.
The two steps of the TwoStep Cluster Analysis procedure's algorithm:


Step 1. The procedure begins with the construction of a Cluster Features
(CF) Tree. The tree begins by placing the first case at the root of the tree in
a leaf node that contains variable information about that case. Each
successive case is then added to an existing node or forms a new node,
based upon its similarity to existing nodes and using the distance measure
as the similarity criterion. A node that contains multiple cases contains a
summary of variable information about those cases. Thus, the CF tree
provides a capsule summary of the data file.
Step 2. The leaf nodes of the CF tree are then grouped using an
agglomerative clustering algorithm. The agglomerative clustering can be
used to produce a range of solutions. To determine which number of clusters
is "best", each of these cluster solutions is compared using Schwarz's
Bayesian Criterion (BIC) or the Akaike Information Criterion (AIC) as the
clustering criterion.
SPSS’s: The TwoStep Cluster Analysis

Car manufacturers need to be able to appraise the
current market to determine the likely competition
for their vehicles. If cars can be grouped according to
available data, this task can be largely automatic
using cluster analysis.

Information for various makes and models of motor
vehicles is contained in car_sales.sav. Use the
TwoStep Cluster Analysis procedure to group
automobiles according to their prices and physical
properties.
SPSS’s: The TwoStep Cluster Analysis
SPSS’s: The TwoStep Cluster Analysis
1.
Select Vehicle
type as a
categorical
variable.
2.
Select Price in
thousands
through Fuel
efficiency as
continuous
variables.
3.
Click Plots.
SPSS’s: The TwoStep Cluster Analysis
1. Select Rank of
variable importance.
2.
3.
Select By variable in
the Rank Variables
group.
Select Confidence
level.
4.Click Continue
5.Then click Output in
the TwoStep Cluster
Analysis dialog box.
SPSS’s: The TwoStep Cluster Analysis
1. Select
Information
criterion (AIC
or BIC) in the
Statistics
group.
2. Click
Continue.
3.
Finally, click
OK in the
TwoStep
Cluster
Analysis
dialog box.
SPSS’s: The TwoStep Cluster Analysis
Auto-Clustering
Number of Clusters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Schwarz's
Bayesian
Criterion (BIC)
1214.377
974.051
885.924
897.559
931.760
968.073
1026.000
1086.815
1161.740
1237.063
1316.271
1396.192
1477.199
1559.230
1644.366
a
BIC Change
-240.326
-88.128
11.635
34.201
36.313
57.927
60.815
74.926
75.323
79.207
79.921
81.008
82.030
85.136
Ratio of BIC
b
Changes
Ratio of
Distance
c
Measures
1.000
.367
-.048
-.142
-.151
-.241
-.253
-.312
-.313
-.330
-.333
-.337
-.341
-.354
a. The changes are from the previous number of clusters in the table.
b. The ratios of chang es are relative to the change for the two cluster solution.
c. The ratios of distance measures are based on the current number of
clusters against the previous number of clusters.
1.829
2.190
1.368
1.036
1.576
1.083
1.687
1.020
1.239
1.046
1.075
1.076
1.301
1.044
1.
The Auto-clustering
table summarizes
the process by
which the number
of clusters is
chosen.
2. The clustering
criterion (in this
case the BIC) is
computed for each
potential number of
clusters. Smaller
values of the BIC
indicate better
models, and in this
situation, the "best"
cluster solution has
the smallest BIC.
SPSS’s: The TwoStep Cluster Analysis
Auto-Clustering
Number of Clusters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Schwarz's
Bayesian
Criterion (BIC)
1214.377
974.051
885.924
897.559
931.760
968.073
1026.000
1086.815
1161.740
1237.063
1316.271
1396.192
1477.199
1559.230
1644.366
a
BIC Change
-240.326
-88.128
11.635
34.201
36.313
57.927
60.815
74.926
75.323
79.207
79.921
81.008
82.030
85.136
Ratio of BIC
b
Changes
Ratio of
Distance
c
Measures
1.000
.367
-.048
-.142
-.151
-.241
-.253
-.312
-.313
-.330
-.333
-.337
-.341
-.354
a. The changes are from the previous number of clusters in the table.
b. The ratios of chang es are relative to the change for the two cluster solution.
c. The ratios of distance measures are based on the current number of
clusters against the previous number of clusters.
1.829
2.190
1.368
1.036
1.576
1.083
1.687
1.020
1.239
1.046
1.075
1.076
1.301
1.044
However, there are
clustering problems
in which the BIC
will continue to
decrease as the
number of clusters
increases, but the
improvement in the
cluster solution, as
measured by the
BIC Change, is not
worth the increased
complexity of the
cluster model, as
measured by the
number of clusters.
SPSS’s: The TwoStep Cluster Analysis
Auto-Clustering
Number of Clusters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Schwarz's
Bayesian
Criterion (BIC)
1214.377
974.051
885.924
897.559
931.760
968.073
1026.000
1086.815
1161.740
1237.063
1316.271
1396.192
1477.199
1559.230
1644.366
a
BIC Change
-240.326
-88.128
11.635
34.201
36.313
57.927
60.815
74.926
75.323
79.207
79.921
81.008
82.030
85.136
Ratio of BIC
b
Changes
Ratio of
Distance
c
Measures
1.000
.367
-.048
-.142
-.151
-.241
-.253
-.312
-.313
-.330
-.333
-.337
-.341
-.354
a. The changes are from the previous number of clusters in the table.
b. The ratios of chang es are relative to the change for the two cluster solution.
c. The ratios of distance measures are based on the current number of
clusters against the previous number of clusters.
1.829
2.190
1.368
1.036
1.576
1.083
1.687
1.020
1.239
1.046
1.075
1.076
1.301
1.044
In such situations,
the changes in
BIC and changes
in the distance
measure are
evaluated to
determine the
"best" cluster
solution. A good
solution will
have a
reasonably large
Ratio of BIC
Changes and a
large Ratio of
Distance
Measures.
SPSS’s: The TwoStep Cluster Analysis
Cluster Distribution
N
Cluster
Excluded Cases
Total
1
2
3
Combined
62
39
51
152
5
157
% of
Combined
40.8%
25.7%
33.6%
100.0%
% of Total
39.5%
24.8%
32.5%
96.8%
3.2%
100.0%
The cluster
distribution table
shows the
frequency of each
cluster. Of the 157
total cases, 5
were excluded
from the analysis
due to missing
values on one or
more of the
variables. Of the
152 cases
assigned to
clusters, 62 were
assigned to the
first cluster, 39 to
the second, and
51 to the third.
SPSS’s: The TwoStep Cluster Analysis
Centroids
Cluster
Price in
thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
1
19.61671
7.644070
2.194
.4238
143.24
30.259
102.595
4.0799
68.539
1.9366
178.235
9.6534
2.83742
.310867
14.979
1.8699
27.24
3.578
2
26.56182
10.185175
3.559
.9358
187.92
39.049
112.972
9.6537
72.744
4.1781
191.110
14.4415
3.96759
.671766
22.064
4.2894
19.51
2.910
3
37.29980
17.381187
3.700
.9493
232.96
54.408
109.022
5.7644
72.924
2.1855
194.688
10.3512
3.57890
.297204
18.443
2.0445
23.02
2.060
Combined
27.33182
14.418669
3.049
1.0498
184.81
56.823
107.414
7.7178
71.089
3.4647
187.059
13.4712
3.37618
.636593
17.959
3.9376
23.84
4.305
The centroids
show that the
clusters are
well separated
by the
continuous
variables.
SPSS’s: The TwoStep Cluster Analysis
Centroids
Cluster
Price in
thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
1
19.61671
7.644070
2.194
.4238
143.24
30.259
102.595
4.0799
68.539
1.9366
178.235
9.6534
2.83742
.310867
14.979
1.8699
27.24
3.578
2
26.56182
10.185175
3.559
.9358
187.92
39.049
112.972
9.6537
72.744
4.1781
191.110
14.4415
3.96759
.671766
22.064
4.2894
19.51
2.910
3
37.29980
17.381187
3.700
.9493
232.96
54.408
109.022
5.7644
72.924
2.1855
194.688
10.3512
3.57890
.297204
18.443
2.0445
23.02
2.060
Combined
27.33182
14.418669
3.049
1.0498
184.81
56.823
107.414
7.7178
71.089
3.4647
187.059
13.4712
3.37618
.636593
17.959
3.9376
23.84
4.305
Motor vehicles
in cluster 1
are cheap,
small, and
fuel
efficient.
SPSS’s: The TwoStep Cluster Analysis
Centroids
Cluster
Price in
thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
1
19.61671
7.644070
2.194
.4238
143.24
30.259
102.595
4.0799
68.539
1.9366
178.235
9.6534
2.83742
.310867
14.979
1.8699
27.24
3.578
2
26.56182
10.185175
3.559
.9358
187.92
39.049
112.972
9.6537
72.744
4.1781
191.110
14.4415
3.96759
.671766
22.064
4.2894
19.51
2.910
3
37.29980
17.381187
3.700
.9493
232.96
54.408
109.022
5.7644
72.924
2.1855
194.688
10.3512
3.57890
.297204
18.443
2.0445
23.02
2.060
Combined
27.33182
14.418669
3.049
1.0498
184.81
56.823
107.414
7.7178
71.089
3.4647
187.059
13.4712
3.37618
.636593
17.959
3.9376
23.84
4.305
Motor vehicles
in cluster 2
are
moderately
priced,
heavy, and
have a large
gas tank,
presumably
to
compensate
for their
poor fuel
efficiency.
SPSS’s: The TwoStep Cluster Analysis
Centroids
Cluster
Price in
thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
Mean
Std. Deviation
1
19.61671
7.644070
2.194
.4238
143.24
30.259
102.595
4.0799
68.539
1.9366
178.235
9.6534
2.83742
.310867
14.979
1.8699
27.24
3.578
2
26.56182
10.185175
3.559
.9358
187.92
39.049
112.972
9.6537
72.744
4.1781
191.110
14.4415
3.96759
.671766
22.064
4.2894
19.51
2.910
3
37.29980
17.381187
3.700
.9493
232.96
54.408
109.022
5.7644
72.924
2.1855
194.688
10.3512
3.57890
.297204
18.443
2.0445
23.02
2.060
Combined
27.33182
14.418669
3.049
1.0498
184.81
56.823
107.414
7.7178
71.089
3.4647
187.059
13.4712
3.37618
.636593
17.959
3.9376
23.84
4.305
Motor vehicles
in cluster 3
are
expensive,
large, and
are
moderately
fuel
efficient.
SPSS’s: The TwoStep Cluster Analysis
Vehicle type
Cluster
1
2
3
Combined
Automobile
Frequency
Percent
61
54.5%
0
.0%
51
45.5%
112
100.0%
Truck
Frequency
Percent
1
2.5%
39
97.5%
0
.0%
40
100.0%
The cluster
frequency
table by
Vehicle type
further
clarifies the
properties of
the clusters.
Go To SPSS Output for the Rest

Note for myself, so as not to forget.
SPSS’s: The TwoStep Cluster Analysis




Using the TwoStep Cluster Analysis procedure, you have
separated the motor vehicles into three fairly broad categories.
In order to obtain finer separations within these groups, you
should collect information on other attributes of the vehicles.
For example, you could note the crash test performance or the
options available.
The TwoStep Cluster Analysis procedure is useful for finding
natural groupings of cases or variables. It works well with
categorical and continuous variables, and can analyze very
large data files.
If you have a small number of cases, and want to choose
between several methods for cluster formation, variable
transformation, and measuring the dissimilarity between
clusters, try the Hierarchical Cluster Analysis procedure. The
Hierarchical Cluster Analysis procedure also allows you to
cluster variables instead of cases.
The K-Means Cluster Analysis procedure is limited to scale
variables, but can be used to analyze large data and allows you
to save the distances from cluster centers for each object.
Download