MCLUST is a software package for cluster analysis based on... Gaussian mixture models.

advertisement
MCLUST is a software package for cluster analysis based on parameterized
Gaussian mixture models.
MCLUST is written in Fortran with an interface to the S-PLUS commercial
package (http://www.mathsoft.com/splus). Commercial S-PLUS includes two
functions `mclust' and `mclass', which have been superseded by functions
`mhtree' and `mhclass' in MCLUST. In addition, MCLUST a great many other
functions for cluster analysis (see below).
copyright 1996, 1998 Department of Statistics, University of Washington
funded by ONR contracts N00014-96-1-0192 and N00014-96-1-0330
Permission granted for unlimited redistribution for non-commercial use
only.
Please report all anomalies to fraley@stat.washington.edu.
References : (see http://www.stat.washington.edu/fraley/reports.shtml)
Banfield and Raftery, Biometrics 49:803-821, 1993.
DasGupta and Raftery, J. Amer. Stat. Assoc. 93:294-302, 1998.
(earlier version Tech. Rep. 295)
Fraley, SIAM J. on Scientific Computing 20(1):270-281, 1999.
(earlier version Tech. Rep. 311)
Fraley and Raftery, Tech. Rep. 329, U. of Washington Statistics, Feb.
1998.
(to appear in The Computer Journal,
special issue on Clustering and Classification)
Fraley and Raftery, Tech. Rep. 342, U. of Washington Statistics, Nov.
1998.
---------------------------------------------------------------------------Anyone can use MCLUST for noncommercial purposes, but in order to do so
without modification they must have access to the S-PLUS (version 3.3 or
higher).
The most up to date version of MCLUST can be obtained via anonymous ftp
from
ftp.u.washington.edu in the directory public/mclust. There is a gzipped
tar
file `MCLUST.tar.gz' for UNIX users, and a WinZip file `mclust.zip' for
Windows users.
-------------------------------------------------------------------------MCLUST Installation:
--------------------------------------------------------------------------
UNIX:
----The file MCLUST.tar.gz is a compressed version of a directory called
MCLUST
containing the MCLUST software and documentation. The commands for
extracting
this directory (and removing the tar file) are:
%gunzip MCLUST.tar.gz
%tar xvf MCLUST.tar
%rm MCLUST.tar
The following commands compile the Fortran code to create an object file
for dynamic or static loading into S-PLUS:
%cd MCLUST
%Splus COMPILE mclust.o
To statically load the object file into S-PLUS, move it to the working
directory and use the following command:
%Splus LOAD mclust.o
This creates a file called `local.Sqpe', a local version of S-PLUS that
includes the MCLUST Fortran functions. When S-PLUS is invoked in a
directory
containing `local.Sqpe', this local version of S-PLUS will be run.
Dynamic loading, which is available in S-PLUS on some but not all UNIX
platforms, may be preferable to static loading since `local.Sqpe' is a
large file.
The following commands put the S-PLUS functions in the MCLUST/.Data
directory:
%cd MCLUST
%cat mclust.S | Splus
Help files are located in MCLUST/.Data/.Help. You can access eveything
from
your S-PLUS session if you make MCLUST a subdirectory of your working
directory and invoke the command
> attach("MCLUST/.Data")
in S-PLUS. You can put this `attach' command in a `.First' function if
you
want the MCLUST S_PLUS functions to be attached automatically when you
enter
S-PLUS.
Windows:
--------
The file `mclust.zip' is a compressed version of a folder called `mclust'
containing the MCLUST documentation, S-PLUS functions, help files, and
WATCOM Fortran object code. Open `mclust.zip' to invoke WinZip, and then
extract the `mclust' folder. In what follows we assume the `mclust'
folder
has been placed in your working Splus directory.
You can put the S-PLUS functions for MCLUST into your working `_Data'
directory with the S-PLUS command:
> source("mclust/mclust.S")
Alternatively, you may want to place them in a separate directory to and
use
the `attach' command in S-PLUS to access them, to ensure that you don't
change them by accident.
Dynamically load the object code `mclust.obj' with the S-PLUS command:
> dyn.load("mclust/mclust.obj")
This command needs to be executed each time you enter S-PLUS. You may
want to include this command in a `.First' file if you plan to use MCLUST
every time you run S-PLUS in this directory.
The help files are located in the `mclust.HLP' archive.
-------------------------------------------------------------------------MCLUST Usage:
-------------------------------------------------------------------------To illustrate the use of MCLUST we use the `iris' data in S-PLUS for
which a description is available from the command line via `help(iris)'
or
from the help window under `datasets'. The iris data is a 3-dimensional
array,
giving 4 measurements on 50 flowers from each of 3 species of iris.
MCLUST requires 2-dimensional numeric data as input, so we collapse
the iris data into a matrix in which row dimension corresponds to the 150
individual flowers and the column dimension corresponds to the 4
measurements.
The species information has been discarded.
> iris.matrix <matrix(aperm(iris,c(1,3,2)),150,4,dimn=dimnames(iris)[1:2])
The function mhtree() for hierarchical clustering can be applied to this
data
to obtain a classification tree:
> cltree <- mhtree(iris.matrix)
The classification produced by mhtree() for various numbers of clusters
can be obtained via the function mhclass(). For example, to obtain the
classification for 2 through 4 clusters:
> cl <- mhclass(cltree, 2:4)
which produces a matrix whose columns represent the mhtree()
classification
for 2, 3, and 4 clusters, respectively.
A given classification can be plotted using the function clpairs():
> clpairs(iris.matrix, cl[,"2"])
> clpairs(iris.matrix, cl[,"3"])
plots the 3-cluster class from mhtree().
The number of dimensions may need to be reduced in order to get a good
display of the classification.
mhtree() will accept an initial coarse partition as input if one is
available.
In the above example, the default model (Gaussian mixtures with no
constraints on the means and covariance matrices of each component).
Other parameterizations of Gaussian mixtures are available in mhtree().
They differ as to whether certain geometric characteristics (volume,
shape,
and or orientation) are uniform among clusters.
For example, the following produces a classification tree for
hierarchical
clustering based on a spherical covariances with equal volumes
> cltree <- mhtree( iris.matrix, modelid = "EI")
> clpairs(iris.matrix, mhclass(cltree,2))
This is equivalent to the sum-of-squares method or Ward's method.
Other model options that are available for mhtree() are described in the
help file.
Classifications produced by mhtree() are often good but not necessairly
optimal. MCLUST provides iterative `EM' (Expectation-Maximization)
methods
for maximum likelihood clustering with parameterized Gaussian mixture
models.
`EM' iterates between an `E'-step, which computes a matrix `z' whose ikth
element is an estimate of the conditional probability that observation i
belongs to group k given the current parameter estimates, and an `Mstep',
which computes maximum-likelihood parameter estimates given $z$.
Basically, `z' can be thought of as a continuous representation of a
classification; `z' has a row for each observation and an column for each
group, and its entries entries of `z' are lie in the interval [0,1],
with each row summing to 1.
In limit, the parameters converge to the maximum likelihood values
for the Gaussian mixture model, and the sums of the columns of the
`z' converge to `n' times the mixing proportions, where `n' is the
number of observations in the data.
MCLUST provides a function me() that does the M-step followed by the
E-step for EM. Given the data, initial estimates for the conditional
probabilities `z', and the model specification, it produces the
conditional
probabities associated with the maximum likelihood parameters.
Initial estimates of `z' may be obtained from a discrete classification.
For example, me() can be used to `refine' a classification produced by
mhtree():
> cltree <- mhtree( iris.matrix, modelid = "VI") # non-uniform spherical
model
> cl <- mhclass(cltree, 3) # 3-group mhtree classification
> z <- me( iris.matrix, modelid = "VI", ctoz(cl)) # optimal z
The function ctoz() converts a discrete classification into
a `z' matrix that has only 0,1 entries with exactly one 1 per row.
In general, the models used in mhtree() and me() need not be the same.
It may be desirable to use one of the faster methods in mhtree()
(e.~g. spherical or unconstrained models), followed by specification of
a more complex model in EM.
To plot the classification produced by me(), do
> clpairs( iris.matrix, ztoc(z))
The function ztoc() converts conditional probabilties into a discrete
classification by classifying each observation into the class
corresponding
to the column in which the `z' value for that observation is maximized.
The uncertainty in the classification corresponding to `z' can be
obtained
as follows:
> uncer <- 1 - apply( z, 1, "max")
The Splus function quantile() applied to uncertainly gives a measure of
the
quality of the classification.
> quantile(uncer)
In this case the indication is that the majority of observations are well
classified. Note, however, that when groups intersect
uncertain classifications would be expected in the overlapping regions.
The function mstep() allows recovery of the parameters in the model:
> pars <- mstep( iris.matrix, modelid = "VI", z)
> pars
The means and standard deviation of the clusters (and optionally the
partition or classification) may be plotted using the function mixproj(),
which
> mixproj(iris.matrix, ms=pars, part=ztoc(z), dimens = c(1,2))
> mixproj(iris.matrix, ms=pars, part=ztoc(z), dimens = c(3,4))
It's also possible to compute the Bayesian Information Criterion, or
`BIC'
for a given data set, model, and conditional probability estimates.
> cltree <- mhtree(iris.matrix) # uses
> cl <- mhclass(cltree, 2:3)
> z <- me( iris.matrix, modelid = "EI",
spherical
> bic(iris.matrix, modelid = "EI", z)
> z <- me( iris.matrix, modelid = "EI",
> bic(iris.matrix, modelid = "EI", z)
> z <- me( iris.matrix, modelid = "VI",
> bic(iris.matrix, modelid = "VI", z)
> z <- me( iris.matrix, modelid = "VI",
> bic(iris.matrix, modelid = "VI", z)
default unconstrained model
ctoz(cl[,"2"])) # uniform
ctoz(cl[,"3"]))
ctoz(cl[,"2"])) # spherical
ctoz(cl[,"3"]))
The function unclass() applied to the bic value will reveal an attribute
labeled {\tt "rcond"} representing the reciprocal condition estimate
will,
which falls between 0 and 1.
The closer this estimate is to 0, the more likely it is that the
estimated
values are numerically inaccurate due to a covariance that is nearly
singular.
You should check this quantity if you are getting bic values that don't
seem to make sense.
Of these four possiblities (2 models x 2 numbers of groups) illustrated
above,
the best model and number of clusters is "VI" (spherical), with 3 groups.
MCLUST provides two functions, emclust1() and emclust(), for cluster
analysis
with BIC. The input to emclust1() is the data, desired numbers of groups,
and a pair of models, one to be used in the hierarchical clustering
phase,
and the other to be used in the EM phase:
> bicvals <- emclust1( iris.matrix, nclus = 1:6, modelid = c("VI",
"VVV"))
> bicvals
> plot(bicvals)
> sumry <- summary(bicvals, iris.matrix) # summary object for emclust1()
> sumry
The input to emclust() is data, desired numbers of groups, and a list of
models to apply in the EM phase (it uses the unconstrained model for
hierarchical clustering). By default, all available models are used:
>
>
>
>
>
bicvals <- emclust( iris.matrix, nclus = 1:6)
bicvals
plot(bicvals)
sumry <- summary(bicvals, iris.matrix) # summary object for emlust()
sumry
The best model among those fitted by emclust() is the uniform shape model
"VEV", with 2 clusters. The same model with 3 clusters has a BIC value
that is
not signifcantly different from the maximum, so that the conclusion is
that
there are either 2 or 3 clusters in the data. The 2 cluster EM result
separates the first species from the other two, while the 3 cluster
result
nearly separates the three species (there are 5 missclassifications out
of
150).
The summary function allows the summary to be taken over a subset of the
number of clusters (as well as models in the case of emclust()) for which
the analysis was originally run.
It returns a list whose components are the parameter and z values
for the best fit among the models selected according to BIC, as well
as the associated classification and its uncertainty.
For a complete analysis, it may be desirable to try different
permutations of
the observations, use different subsets of the observations, and/or
perturb
the data, to see if the classification remains stable.
Scaling or otherwise transforming the data may also affect the results.
One should always examine the data beforehand, in case the dimensions can
be reduced due to highly correlated variables.
The EM functions in MCLUST have the option of including a Poisson term
for
noise in the model. For me(), mstep(), estep() and bic(), use the `noise
= T'
option to use this model. The initial conditional probabilities for the
noise
term correspond to the last column of the conditional probabilities.
In emclust1() and emclust(), an initial estimate of the noise (in the
form of
a logical vector indicating which observations are to be considered
noise)
must be supplied in order to specify this model. The Splus function
NNclean()
in statlib is one option for estimating noise --- see Byers and Raftery,
tech rep no 305, U of Washington Statistics (April 1996) --- to appear in
JASA
in June 1998). To obtain NNclean, send a message of the form
`send nnclean from S' to statlib@lib.stat.cmu.edu.
The following reproduces the `chevron' minefield example from
Fraley and Raftery (1998) for which the data is included in this
distribution:
> chevron.nnclean <- NNclean(chevron, k = 5) # 5 nearest-neighbor
denoising
> chevron.noise <- !chevron.nnclean$z
> chev.bic <- emclust(chevron, noise = chevron.noise)
-----------------------------------------------------------------------------
Download