Cluster Analysis

advertisement
SW 983 LECTURE NOTES
CLUSTER ANALYSIS
Definition: Any of several procedures in multivariate analysis designed to determine whether
individuals (or other units of analysis) are similar or dissimilar enough to fall into groups or
clusters.
Aka: Q analysis, typology construction, classification analysis, and numerical taxonomy.
More an art than a science
The cluster variate is the set of variables representing the characteristics used to compare objects
in the cluster analysis. Cluster analysis is the only multivariate technique that does not estimate
the variate empirically but instead uses the variate as specified by the researcher. The focus of
cluster analysis is on the comparison of objects based on the variate, not on the estimation of the
variate itself.
Cluster analysis can be characterized as descriptive, atheoretical, and noninferential. Cluster
analysis has no statistical basis upon which to draw statistical inferences from a sample to a
population, and it is used primarily as an exploratory technique. The solutions are not unique,
as the cluster membership for any number of solutions is dependent upon many elements of the
procedure and many different solutions can be obtained by varying one or more elements.
Moreover, cluster analysis will always create clusters regardless of the “true existence of any
structure in the data. Finally, the cluster solution is totally dependent upon the variables used
as the basis for the similarity measure.
Related (but significantly different) to two other techniques:
1. Factor analysis – FA groups variables while CA groups subjects. Having said that you
will see that the SPSS program allows you to perform cluster analysis on variables or
cases. FA is sometimes done prior to CA with the factor scores used to perform CA.
Hair, et al, cites some research suggesting this may not be appropriate (p. 491). FA
generally has a more theoretical basis and provides statistical tests. CA is more ad hoc.
2. Discriminant analysis - DA is similar in that the goal is to classify a set of cases into
groups or categories, but in CA neither the number nor the members of the groups are
known. DA can be performed in conjunction with CA after the clusters are identified to
derive weights for the variate or clustering variables.
Steps:
1. Look for outliers – Requires multivariate test. Mahalanobis D2 provides such a test (see
p. 66 of text). Available as output variable in SPSS regression module.
2. Choose a distance measure – Squared Euclidian distance most commonly used.
3. Standardize the data (z scores) to avoid weighting importance of variables with larger
variance.
4. Choose a clustering method – hierarchical agglomerative method most commonly used.
5. Determine how many groups – Could be guided by theory but is mostly a subjective
judgment. Want clusters with fair representation (numbers of items), maximize between
group differences and minimize within cluster differences. Distance measure provides
more objective measure of this but is still relative.
6. Describe the clusters – May involve naming. What variables are hanging together for a
particular cluster? Run mean or proportion comparisons across clusters. Ability to make
sense of the clusters may influence your decision about how many clusters should exist.
Cluster Analysis with SPSS
SPSS provides three modules for doing cluster analysis. The TwoStep procedure is new and the
easiest to run. Also available are Hierarchical and K-Means cluster procedure.
The TwoStep Cluster Analysis procedure is an exploratory tool designed to reveal natural
groupings (or clusters) within a data set that would otherwise not be apparent. The algorithm
employed by this procedure has several desirable features that differentiate it from traditional
clustering techniques:



Handling of categorical and continuous variables. By assuming variables to be
independent, a joint multinomial-normal distribution can be placed on categorical and
continuous variables.
Automatic selection of number of clusters. By comparing the values of a model-choice
criterion across different clustering solutions, the procedure can automatically determine
the optimal number of clusters.
Scalability. By constructing a cluster features (CF) tree that summarizes the records, the
TwoStep algorithm allows you to analyze large data files.
Example. Retail and consumer product companies regularly apply clustering techniques to data
that describe their customers' buying habits, gender, age, income level, etc. These companies
tailor their marketing and product development strategies to each consumer group to increase
sales and build brand loyalty.
Statistics. The procedure produces information criteria (AIC or BIC) by numbers of clusters in
the solution, cluster frequencies for the final clustering, and descriptive statistics by cluster for
the final clustering.
Plots. The procedure produces bar charts of cluster frequencies, pie charts of cluster frequencies,
and variable importance charts.
Distance Measure. This selection determines how the similarity between two clusters is
computed.


Log-likelihood. The likelihood measure places a probability distribution on the variables.
Continuous variables are assumed to be normally distributed, while categorical variables
are assumed to be multinomial. All variables are assumed to be independent.
Euclidean. The Euclidean measure is the "straight line" distance between two clusters. It
can be used only when all of the variables are continuous.
Number of Clusters. This selection allows you to specify how the number of clusters is to be
determined.


Determine automatically. The procedure will automatically determine the "best" number
of clusters, using the criterion specified in the Clustering Criterion group. Optionally,
enter a positive integer specifying the maximum numbers of clusters that the procedure
should consider.
Specify fixed. Allows you to fix the number of clusters in the solution. Enter a positive
integer.
Count of Continuous Variables. This group provides a summary of the continuous variable
standardization specifications made in the Options dialog box.
Clustering Criterion. This selection determines how the automatic clustering algorithm
determines the number of clusters. Either the Bayesian Information Criterion (BIC) or the
Akaike Information Criterion (AIC) can be specified.
If you have a small number of cases, and want to choose between several methods for cluster
formation, variable transformation, and measuring the dissimilarity between clusters, try the
Hierarchical Cluster Analysis procedure. The Hierarchical Cluster Analysis procedure also
allows you to cluster variables instead of cases.
The K-Means Cluster Analysis procedure is limited to continuous variables, but can be used to
analyze large data sets and allows you to save the distances from cluster centers for each object.
Unlike the Hierarchical Cluster Analysis procedure which results in a series of solutions
corresponding to different numbers of clusters, the K-Means CA procedure produces only one
solution for the number of clusters requested by the user.
Download