Expectation Maximization (EM) Basic Notions [Dempster, Laird & Rubin 1977] Consider points p = (x1, ..., xd) from a d-dimensional Euclidean vector space Clusters are represented by probability distributions Typically: mixture of Gaussian distributions Representation of a cluster C Center point PC of all points in the cluster d x d Covariance matrix 6C for the points in the cluster C 35 R1 37 27 29 31 33 19 21 23 R6 25 11 13 15 R11 17 3 1 WS 2003/04 R36 R31 R26 R21 R16 5 Density function for cluster C: 1 x P T ¦ 1 x P 1 C P( x | C ) e2 C C 2S d ¦ C 7 9 6 – 126 Data Mining Algorithms EM: Gaussian Mixture – 2D examples A Gaussian Mixture Model, k = 3 A single Gaussian density function R36 R31 R26 R21 R36 R31 R26 35 R1 R21 37 29 31 33 23 25 27 17 19 R6 21 13 15 7 9 11 1 3 5 R16 R11 37 33 R1 35 29 31 25 27 21 R6 23 17 Data Mining Algorithms 19 13 15 9 11 5 7 1 WS 2003/04 3 R16 R11 6 – 127 EM – Basic Notions Density function for clustering M = {C1, …, Ck} Let Wi denote the fraction of cluster Ci in the entire data set D P( x) ¦ k i 1 Wi P ( x | Ci ) Assignment of points to clusters A point may belong to several clusters with different probabilities P(Ci | x) Wi P ( x | Ci ) P( x) Maximize E(M), a measure for the quality of a clustering M E(M) indicates the probability that the data D have been generated by following the distribution model M E (M ) WS 2003/04 ¦ xD log( P ( x)) Data Mining Algorithms 6 – 128 EM – Algorithm ClusteringbyExpectationMaximization (point set D, int k) Generate an initial model M’ = (C1’, …, Ck’) repeat // (re-) assign points to clusters Compute P(x|Ci), P(x) and P(Ci|x) for each object x from D and each cluster (= Gaussian) Ci // (re-) compute the models Compute a new model M = {C1, …, Ck} by recomputing Wi, PC and 6C for each Cluster Ci Replace M’ by M until |E(M) – E(M’)| < H return M WS 2003/04 Data Mining Algorithms 6 – 129 EM – Recomputation of Parameters Weight Wi of cluster Ci Wi Center Pi of cluster Ci Pi Shape matrix of cluster Ci 6i WS 2003/04 6i 1 ¦ P(Ci | x) n xD ¦ ¦ xD ¦ x P (Ci | x) xD P (Ci | x) 2 P P ( C | x )( x ) i i xD ¦ xD P (Ci | x) Data Mining Algorithms 6 – 130 EM – Discussion Convergence to (possibly local) minimum Computational effort: O(n * |M| * #iterations) #iterations is quite high in many cases Both result and runtime strongly depend on … … the initial assignment … a proper choice of parameter k (= desired number of clusters) Modification to obtain a really partitioning variant WS 2003/04 Objects may belong to several clusters Assign each object to the cluster to which it belongs with the highest probability Data Mining Algorithms 6 – 131 Clustering (II) – Summary Advanced Clustering Topics Incremental Clustering, Generalized DBSCAN, Outlier Detection Scaling Up Clustering Algorithms BIRCH, Data Bubbles, Index-based Clustering, GRID Clustering Sets of Similarity Queries (Similarity Join) Subspace Clustering Expectation Maximization: a statistical approach WS 2003/04 Data Mining Algorithms 6 – 132