Tianxi`s model-based clustering presentation

advertisement
Model-based Clustering in R
Tianxi Dong
Theoretical model
Heuristic methods
 How many clusters we need?
 How to compare the performance between methods?
 How to deal with outliers in heuristic methods?
Solution???
Model-based Method
 Assume that the data come from a mixture of different
probability models;
 Assign each of the N items to the distribution it most likely
belongs to;
 Clustering performance is evaluated.
Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
Model-based Clustering
 We define the density of a mixture of g distributions as
the weighted average
Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
Model-based Clustering
 Find the values of the parameters by maximizing the
likelihood (usually the log of the likelihood) of the
observations
max log f(x1… xN)
over m1… mG, 1… G and p1… pG
 Where N is the number of observations
 This turns out to be a nonlinear mess and is greatly aided
by the “Expectation Maximization Algorithm”
Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
Covariance Structure
 This covariance structure allows for a variety of constraints.
VVV
G=5
P=8
# of Parameters=?
 The best covariance structure is decided based on BIC.
Source: http://ms.mcmaster.ca/canty/seminars/paulmcnicholas.pdf
Over fitting
 When a model is excessively complex (the number of
parameters)
 Have poor predictive performance
 Training error is shown in blue, validation error in red
http://en.wikipedia.org/wiki/Overfitting
Recall: BIC
 BIC = 2 loglikM(x, θ) − (# params)M log(N) (Higher is better)
 BIC = -2 loglikM(x, θ) + (# params)M log(N) (Lower is better)
loglikM(x, θ): the maximized log-likelihood for the model and data
(# params)M : the number of independent parameters to be
estimated in the model M
N:
the number of observations in the data.
The first format is used in Mclust.
Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
R Procedure – MCLUST
MCLUST Packages
 MCLUST is probably the most well known model-based
clustering technique in the literature.
 http://cran.r-project.org/web/packages/mclust/index.html
MCLUST Packages Syntax
Mclust
(data, G=NULL, modelNames=NULL, prior=NULL,
warn=FALSE, ...)
http://cran.r-project.org/web/packages/mclust/mclust.pdf
Parameters to define Mclust
 G
 An integer vector specifying the possible numbers of mixture
components (clusters) for which the BIC is to be calculated. The
default is G=1:9.
 modelNames
 A vector of character strings indicating the models to be fitted
in the maximization phase of clustering.
 prior
 The default assumes no prior and it allows the specification of a
conjugate prior on the means and variances.
http://cran.r-project.org/web/packages/mclust/mclust.pdf
Real Example
Dataset
 Same with Hal’s Dataset
1.
2.
3.
4.
5.
6.
7.
8.
GP_PER :gross profit %
ROA : Return on Assets
ROE : Return on Equities
SK_RET : % stock return
B_TO_M :book to market
NL_ASSETS : Log of assets for size
control
NL_SALARY: log of CEO salary
NL_SALE : log of sale for size control
Source: http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/
Model-based Clustering– Mclust
 This procedure cannot handle missing data natively.
 If there are missing values:
 datam=na.omit(“missing dataset“)
Model-based Clustering– Mclust
Posterior Probability
Classification using Model-based Clustering
 Discriminant analyses
 Test significance of a set of discriminant functions
 Categories are known before classification
Classification
BIC Plot
BIC Table
Means for each cluster
Conclusion
Hierarchal Clustering
Distance-based
Model-based Clustering
Coordinate-based
Usually does not use covariance
information
Standardization matters
Descriptive
Uses covariance information
Standardization doesn’t matter
Inferential
R Code

data <- read.csv("compsetrex.csv")

salary<-data[,c(1,2,3,4,5,7,8,10)]

salaryMclust<- Mclust(salary)

mysummary<-summary(salaryMclust)
#classification matrix
mysummary$classification
#BIC plot and matrix
BICSummary <- summary(salaryMclustBIC, data = salary)
BICSummary











salaryMclustBIC <- mclustBIC(salary)
salaryMclustBIC
#posterior probability
salaryMclust$z
#mean matrix
salaryMclust$parameters
Download