Model-based Clustering in R Tianxi Dong Theoretical model Heuristic methods How many clusters we need? How to compare the performance between methods? How to deal with outliers in heuristic methods? Solution??? Model-based Method Assume that the data come from a mixture of different probability models; Assign each of the N items to the distribution it most likely belongs to; Clustering performance is evaluated. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf Model-based Clustering We define the density of a mixture of g distributions as the weighted average Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf Model-based Clustering Find the values of the parameters by maximizing the likelihood (usually the log of the likelihood) of the observations max log f(x1… xN) over m1… mG, 1… G and p1… pG Where N is the number of observations This turns out to be a nonlinear mess and is greatly aided by the “Expectation Maximization Algorithm” Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf Covariance Structure This covariance structure allows for a variety of constraints. VVV G=5 P=8 # of Parameters=? The best covariance structure is decided based on BIC. Source: http://ms.mcmaster.ca/canty/seminars/paulmcnicholas.pdf Over fitting When a model is excessively complex (the number of parameters) Have poor predictive performance Training error is shown in blue, validation error in red http://en.wikipedia.org/wiki/Overfitting Recall: BIC BIC = 2 loglikM(x, θ) − (# params)M log(N) (Higher is better) BIC = -2 loglikM(x, θ) + (# params)M log(N) (Lower is better) loglikM(x, θ): the maximized log-likelihood for the model and data (# params)M : the number of independent parameters to be estimated in the model M N: the number of observations in the data. The first format is used in Mclust. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf R Procedure – MCLUST MCLUST Packages MCLUST is probably the most well known model-based clustering technique in the literature. http://cran.r-project.org/web/packages/mclust/index.html MCLUST Packages Syntax Mclust (data, G=NULL, modelNames=NULL, prior=NULL, warn=FALSE, ...) http://cran.r-project.org/web/packages/mclust/mclust.pdf Parameters to define Mclust G An integer vector specifying the possible numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9. modelNames A vector of character strings indicating the models to be fitted in the maximization phase of clustering. prior The default assumes no prior and it allows the specification of a conjugate prior on the means and variances. http://cran.r-project.org/web/packages/mclust/mclust.pdf Real Example Dataset Same with Hal’s Dataset 1. 2. 3. 4. 5. 6. 7. 8. GP_PER :gross profit % ROA : Return on Assets ROE : Return on Equities SK_RET : % stock return B_TO_M :book to market NL_ASSETS : Log of assets for size control NL_SALARY: log of CEO salary NL_SALE : log of sale for size control Source: http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/ Model-based Clustering– Mclust This procedure cannot handle missing data natively. If there are missing values: datam=na.omit(“missing dataset“) Model-based Clustering– Mclust Posterior Probability Classification using Model-based Clustering Discriminant analyses Test significance of a set of discriminant functions Categories are known before classification Classification BIC Plot BIC Table Means for each cluster Conclusion Hierarchal Clustering Distance-based Model-based Clustering Coordinate-based Usually does not use covariance information Standardization matters Descriptive Uses covariance information Standardization doesn’t matter Inferential R Code data <- read.csv("compsetrex.csv") salary<-data[,c(1,2,3,4,5,7,8,10)] salaryMclust<- Mclust(salary) mysummary<-summary(salaryMclust) #classification matrix mysummary$classification #BIC plot and matrix BICSummary <- summary(salaryMclustBIC, data = salary) BICSummary salaryMclustBIC <- mclustBIC(salary) salaryMclustBIC #posterior probability salaryMclust$z #mean matrix salaryMclust$parameters