Tianxi`s model-based clustering presentation

Model-based Clustering in R Tianxi Dong Theoretical model Heuristic methods  How many clusters we need?  How to compare the performance between methods?  How to deal with outliers in heuristic methods? Solution??? Model-based Method  Assume that the data come from a mixture of different probability models;  Assign each of the N items to the distribution it most likely belongs to;  Clustering performance is evaluated. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf Model-based Clustering  We define the density of a mixture of g distributions as the weighted average Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf Model-based Clustering  Find the values of the parameters by maximizing the likelihood (usually the log of the likelihood) of the observations max log f(x1… xN) over m1… mG, 1… G and p1… pG  Where N is the number of observations  This turns out to be a nonlinear mess and is greatly aided by the “Expectation Maximization Algorithm” Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf Covariance Structure  This covariance structure allows for a variety of constraints. VVV G=5 P=8 # of Parameters=?  The best covariance structure is decided based on BIC. Source: http://ms.mcmaster.ca/canty/seminars/paulmcnicholas.pdf Over fitting  When a model is excessively complex (the number of parameters)  Have poor predictive performance  Training error is shown in blue, validation error in red http://en.wikipedia.org/wiki/Overfitting Recall: BIC  BIC = 2 loglikM(x, θ) − (# params)M log(N) (Higher is better)  BIC = -2 loglikM(x, θ) + (# params)M log(N) (Lower is better) loglikM(x, θ): the maximized log-likelihood for the model and data (# params)M : the number of independent parameters to be estimated in the model M N: the number of observations in the data. The first format is used in Mclust. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf R Procedure – MCLUST MCLUST Packages  MCLUST is probably the most well known model-based clustering technique in the literature.  http://cran.r-project.org/web/packages/mclust/index.html MCLUST Packages Syntax Mclust (data, G=NULL, modelNames=NULL, prior=NULL, warn=FALSE, ...) http://cran.r-project.org/web/packages/mclust/mclust.pdf Parameters to define Mclust  G  An integer vector specifying the possible numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9.  modelNames  A vector of character strings indicating the models to be ﬁtted in the maximization phase of clustering.  prior  The default assumes no prior and it allows the specification of a conjugate prior on the means and variances. http://cran.r-project.org/web/packages/mclust/mclust.pdf Real Example Dataset  Same with Hal’s Dataset 1. 2. 3. 4. 5. 6. 7. 8. GP_PER :gross profit % ROA : Return on Assets ROE : Return on Equities SK_RET : % stock return B_TO_M :book to market NL_ASSETS : Log of assets for size control NL_SALARY: log of CEO salary NL_SALE : log of sale for size control Source: http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/ Model-based Clustering– Mclust  This procedure cannot handle missing data natively.  If there are missing values:  datam=na.omit(“missing dataset“) Model-based Clustering– Mclust Posterior Probability Classification using Model-based Clustering  Discriminant analyses  Test significance of a set of discriminant functions  Categories are known before classification Classification BIC Plot BIC Table Means for each cluster Conclusion Hierarchal Clustering Distance-based Model-based Clustering Coordinate-based Usually does not use covariance information Standardization matters Descriptive Uses covariance information Standardization doesn’t matter Inferential R Code  data <- read.csv("compsetrex.csv")  salary<-data[,c(1,2,3,4,5,7,8,10)]  salaryMclust<- Mclust(salary)  mysummary<-summary(salaryMclust) #classification matrix mysummary$classification #BIC plot and matrix BICSummary <- summary(salaryMclustBIC, data = salary) BICSummary            salaryMclustBIC <- mclustBIC(salary) salaryMclustBIC #posterior probability salaryMclust$z #mean matrix salaryMclust$parameters

Tianxi`s model-based clustering presentation

Related documents

Products

Support

Tianxi`s model-based clustering presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib