Clustering Procedure in SAS - Electrical and Computer Engineering

advertisement
Clustering Procedure in SAS
Cheng Lei
Department of Electrical and Computer Engineering
University of Victoria, Canada
rexlei86@uvic.ca
I.
Introduction
The clustering is to group data into
different groups by computing the
distance between the datasets. The
distance between the datasets is
defined the closeness of the datasets.
In machine learning, the clustering is
also called unsupervised learning, in
which the datasets used do not have
labeled classes or groups. The
methods applied to compute the
distance vary from some simple
measures to very complicate ones
depending on specific research
purposes and the only difference is
how to compute the distance between
two clusters.
The general process of the
clustering is that each observation or
data record begins in a cluster by
itself; then, the two closest clusters
are merged to form a new one to
replace the two old ones; continuing
to repeat the merging the two closest
clusters until only one cluster is left.
The CLUSTER Procedure in SAS
hierarchically clusters the data
records or observations in a SAS data
set by using one of the eleven
methods. The data can be coordinates
or distances. If the data re coordinates,
PROC CLUSTER computes (possibly
squared) Euclidean distances. For
coordinate data, Euclidean distances
are computed from differences
between coordinate values as the use
of differences has several important
consequences:



For differences to be valid,
the variables must have an
interval or stronger scale of
measurements. Ordinal or
ranked data are generally
not appropriate for cluster
analysis.
For Euclidean distances to
be
comparable,
equal
differences should have
equal practical importance.
Variables
with
large
variances tend to have
more effect on the resulting
clusters than variables with
small variances.
The eleven clustering methods
supported in SAS are: average linkage,
the centroid method, complete
linkage, density linkage (including
Wong’s hybrid and kth-neighbor
methods), maximum likelihood for
mixtures of spherical multivariate
normal distributions with equal
variances but possibly unequal mixing
proportions, the flexible-beta method,
McQuitty’s similarity analysis, the
median method, single linkage, twostage density linkage, and Ward’s
minimum-variance method. All the
methods are based on the usual
agglomerative hierarchical clustering
procedure.
The CLUSTER procedure is not
practical for very large data sets
because the CPU time is roughly
proportional to the square or cube of
the number of observations. But the
FASTCLUS procedure requires the
proportional to the number of
observations. Therefore, for a very
large data set, it can be first apply the
FASTCLUS procedure to do the
preliminary clustering, then, use the
CLUSTER procedure to cluster the
preliminary clusters hierarchically.
Adding some options, the PROC
CLUSTER can display a history of the
clustering process, showing the
statistics useful for estimating the
number of clusters in the population
from which the data are sampled.
II.
CLUSTER Procedure
The CLUSTER procedure has the
following syntax available in SAS.
PROC CLUSTER METHOD=name
<options>;
BY variables;
COPY variables;
FREQ variables;
ID variables;
RMSSTD variables;
VAR variables;
The bold words are the keywords in
SAS while the italic words are the
parameters names inputted to the
procedure. In the above statements,
only PROC CLUSTER statement is
required, except that the FREQ
statement is required when the
RMSSTD statement is used; otherwise
the FREQ is optional.
BY statement is used to obtain
separate analyses of observations in
groups that are defined by the BY
variables.
The variables in the COPY
statement are copied from the input
data set to the OUTTREE= data set.
Observations in the OUTTREE= data
set that represent clusters of more
than one observation from the input
data set have missing values for the
COPY variables.
If one variable in the input data set
represents
the
frequency
of
occurrence for other values in the
observation, specify the variable’s
name in a FREQ statement. PROC
CLUSTER the treats the data set as if
each observation appeared n times,
where n is the values of the FREQ
variable for the observation.
The values of the ID variable
identify observations in the sidpayed
cluster history and in the OUTTREE=
data set. If the ID statement is omitted,
each observation is denoted by OBn,
where n is the observation number.
RMSSTD stands for root mean
square standard deviations. If the
coordiantes in the DATA= data set
represent cluster means, the accurate
statistics can be obtained in the
cluster
histories
for
MEHTOD=AVERAGE,
MEHTOD=CENTROID,
OR
METHOD=WARD if the data
contains both of the following:


set
a variable giving the number of
original observations in each
cluster.
a variable giving the root mean
squared standard deviation of
each other.
VAR statement lists numeric
variables to be used in the cluster
analysis. If it is omitted, all numeric
variables not listed in other
statements are used.
observation in the other cluster as the
distance between two clusters. It is
strongly biased toward producing
clusters with roughly equal diameters,
and it can be severely distorted by
moderate outliers.
Density linkage (METHOD =
DENSITY) is referred to a class of
clustering
methods
that
use
nonparametric probability density
estimates. It consists of two steps:


III.
Clustering Methods
There are eleven clustering
methods provided in SAS. The method
of average linkage (METHOD =
AVERAGE or AVG) computes the
average distance between pairs of
observations, one in each other.
Average linkage tends to join clusters
with small variances, and it is slightly
biased toward producing clusters with
the same variance.
In the centroid method (METHOD
= CENTROID), the distance between
two clusters is defined as the
(squared) Euclidean distance between
their centroids or means. This method
is more robust to outliers than most
other hierarchical methods but in
other respects might not perform as
well as Ward’s method or average
linkage.
Complete
linkage
method
(METHOD = COMPLETE) use the
maximum distance between an
observation in one cluster and an
a new dissimilarity measure
based on density estimates and
adjacencies is computed.
a single linkage cluster analysis
is performed.
There are three types of density
linkage: the kth-nearest-neighbor
method, the uniform-kernel method
and Wong’s hybrid method.
EML method (METHOD = EML)
joins clusters to maximize the
likelihood at each level of the
hierarchy under the following
assumptions:



multivariate normal mixture
equal spherical covariance
matrices
unequal sampling probabilities
This method is similar to Ward’s
minimum-variance
method
but
removes the bias toward equal-sized
clusters.
The distance computed by the flexibleBeta method (METHOD = FLEXIBLE)
is presented as
𝐷𝐽𝑀 = (𝐷𝐽𝐾 + 𝐷𝐽𝐿 )
1−𝑏
+ 𝐷𝐾𝐿 𝑏
2
where b is the value of the BETA=
option, or -0.25 by default.
McQuitty’s similarity analysis is
represented as METHOD = MCQUITTY
and independently developed by Sokal
and Michener and McQuitty. The
median method (METHOD = MEDIAN)
was developed by Gower in 1967.
The distance between two
clusters in Single linkage (METHOD =
SINGLE) is the minimum distance
between an observation in one cluster
and an observation in the other
cluster.
The
option
METHOD
=
TWOSTAGE is a modification of
density linkage that ensures that all
points are assigned to modal clusters
before the modal clusters are
permitted to join.
The Ward’s minimum-variance
method
(METHOD
=
WARD)
computed the distance between two
clusters in the ANOVN sum of squares
between the two clusters added up
over all the variables.
Download