Cluster Analysis (CLA)

advertisement
Cluster Analysis (CLA) –






also known as Q analysis, typology construction,
classification analysis, or numerical taxonomy
The primary objective of CLA is to classify objects into relatively homogeneous
groups based on the set of variables considered. Objects in a group are relatively
similar in terms of these variables, and different from objects from other groups.
o When used in this manner, CLA is obverse of FA (factor analysis), in that
it reduces the number of objects, not variables (as FA does), by grouping
them into a much smaller number of clusters.
CLA can also classify variables into relatively homogeneous groups based on the
set of objects considered (inverse of the above)
CLA places similar observations into groups, trying to
o Minimize within-group variance (i.e. build groups with homogeneous
contents)
and, at the same time,
o Maximize between-group variance (i.e. build heterogeneous groups)
CLA is different from MDA (Multiple Discriminant Analysis) in that groups are
suggested by the data and not defined a priori
CLA is only descriptive and exploratory (in the same way as the MDS techniques
are), has no statistical foundation nor requirements for the data (as FA, MDA, or
Multiple Regression Analysis have; e.g. normality, linearity, or homoscedasticity)
o The only concerns are:
 The sample should be representative (and without any outliers –
they should be deleted after proper screening)
 There should be no multicollinearity among the variables – related
variables may be weighted more heavily, thereby receiving
improper emphasis in the analysis.
 Naturally-occurring groups must exist in the data, but the analysis
cannot confirm the validity of these groups.
Purposes of CLA
o Data reduction (to assess structure), usually followed by other multivariate
analysis.
 For example, to describe differences in consumers’ product usage
behaviour, the consumers may first be clustered into groups, and
then the differences among the groups may be examined using
MDA (multiple discriminant analysis).
o Hypotheses development
 For example, a researcher may believe that attitudes toward the
consumption of diet vs. regular soft drinks could be used to
separate soft-drink consumers into separate groups. Hence, he/she
can use CLA to classify soft-drink consumers by their attitudes
about diet vs. regular soft drinks, and then profile the resulting
clusters for demographic similarities and/or differences.
o Classification of objects or variables
 For example, used in marketing for:





Segmenting the market (e.g. consumers may be clustered
on the basis of benefits sought from the purchase of a
product)
Understanding buyer behaviours (e.g. the buying behaviour
of each cluster may be examined separately)
Identifying new product opportunities (e.g. brands in the
same cluster compete more fiercely with each other than
with brands in other clusters)
Selecting test markets (e.g. by grouping cities into
homogeneous clusters, it is possible to select comparable
cities to test various marketing strategies)
Steps in CLA
1. Define the variables on which clustering of objects will be based
a. Note: inclusion of even one or two (not to mention more than that)
irrelevant variables may distort a clustering solution!
b. Results of CLA are only as good as the variables included in the
analysis (each variable should have a specific reason for being
included; if you cannot identify why a variable should be included
in the analysis – exclude it).
i. Remember: CLA has no means of differentiating relevant
from irrelevant variables (it is the researcher who has to
perform this task)
ii. Hint: Always examine the results and eliminate the
variables that are not distinctive across the derived clusters,
and then repeat the CLA.
c. The variables should be selected based on past research, theory, or
a consideration of the hypotheses being tested
2. Select a similarity measure from one of the following:
a. Correlational measures (analyze patterns across the variables)
i. For each pair of objects, the correlation coefficient is
calculated across all the variables
b. Distance measures (analyze the proximity between objects across
the variables)
i. The Euclidean distance (or it’s square) – the most popular
choice
ii. The Manhattan distance (or city-block distance)
iii. and many other distance measures (e.g. Chebychev
distance, Minkowski power distance, Mahalanobis
distance, cosine, chi-square (for counts data)
1) Note: the choice of the distance will have a great
impact on the clustering solution; therefore, use
different distances and compare the results.
2) When variables have different scales (e.g. semantic
differential 7-point rating scale, 9-point Likert-type
scale, 5-point Likert-type scale) or are measured in
vastly different units (e.g. percentages, dollar
amounts, frequencies), they should be re-scaled in
one of the following ways:
* Standardize each variable (i.e. from each
variable value subtract the mean value and then
divide the difference by the standard deviation –
this produces so called Z scores). [This is the
most widely used approach.]
* Divide the variable values only by their
standard deviation
* Divide the variable values only by their mean
* Divide the variable values by their range
* Divide the variable values by their maximum
*Note: Not only the variables can (or
should) be standardized, but the objects (i.e.
cases or respondents) may be standardized
as well. Standarizing respondents (so-called
“ipsitizing”) helps sometimes to remove socalled “response-style effects”.
c. Association measures (for nominal or ordinal measurements).
3. Select a clustering procedure
a. Hierarchical clustering = stepwise clustering procedure involving a
combination/division of the objects into clusters
i. Agglomerative clustering (starts with each object in a
separate cluster, and then clusters are formed by grouping
objects into bigger and bigger clusters – like snowballs)
1) Linkage methods
(i) Single linkage algorithm (based on
the minimum distance between the two
closest points of two clusters)
(ii) Complete linkage algorithm (based
on maximum distance between the two
furthest points of two clusters)
(iii) Average linkage algorithm (based
on the average of the distances
between all pairs of objects from each
of the clusters)
2) Variance methods
(i) Ward’s method (based on the
squared Euclidean distances from
each object to the cluster’s mean)
3) Centroid method (based on the distances between
the clusters’ centroids, i.e. their means for all of the
variables)
ii. Divisive clustering (just opposite: starts with all the objects
grouped in one cluster, and then clusters are divided or split
into smaller groups)
*Note: Agglomerative clustering is more
popular than divisive clustering in marketing
research
* Within agglomerative clustering, the most
popular approaches are: average linkage and
Ward’s methods. They have been shown to
perform better than the other procedures.
* Squared Euclidean distances should be
used with the Ward’s and centroid methods.
b. Non-hierarchical clustering (also referred to as K-means
clustering)
i. Sequential threshold method
ii. Parallel threshold
iii. Optimizing partitioning method
o Disadvantages of non-hierarchical methods:
 The number of clusters must be prespecified in advance and the selection of
cluster centres (seed points) is arbitrary
 The clustering results may depend on how
the centers are selected
o Advantages of non-hierarchical methods:
 Are better for large data sets
 Note: One may use hierarchical and
non-hierarchical methods in tandem:
o First, an initial clustering
solution is obtained with a
hierarchical method, for
example, such as average
linkage or Ward’s procedure
o Then, the number of clusters
and cluster centroids so
obtained (in hierarchical
method) may be used as
inputs to the optimizing
partitioning method.
4. Decide on the number of clusters – there are no hard and fast rules, the
following guidelines are available:
a. A number of clusters may be determined by theoretical, practical,
or conceptual considerations. For example, if the purpose of
clustering is to identify market segments, management may want a
particular number of clusters.
b. The relative sizes of the clusters may be taken into account for
determining the number of clusters (e.g. if with six clusters, one of
the clusters contains only one or two objects, one may decide to
create only five clusters, thus increasing the size of each of them)
c. In hierarchical clustering, the following outputs are obtained:
i. The icicle plot (will be presented and explained in class)
ii. The agglomeration schedule
iii. The dendogram
 The number of clusters can be determined based on the
analysis of the above outputs (will be explained in class)
d. In non-hierarchical clustering, the ratio of total within-group
variance to between-group variance can be plotted against the
number of clusters. The point at which an elbow (or sharp bend)
occurs indicates an appropriate number of clusters (will be
explained in class)
5. Interpret the clusters
 involves naming (assigning a label) to each cluster
 MDA may be applied to assist in labeling the clusters
6. Validate and profile the clusters
Download