A Spectral Algorithm for Learning Mixtures of Spherical Gaussians by MASSA CHUSETTS INSTITUTE OF TECHNOLOGY Grant J. Wang OCT 15 2003 B.S in Computer Science LIBRARIES Cornell University (2001) Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology August 29, 2003 © MMIII Massachusetts Institute of Technology. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. ..................... Author ............................... Department of Electrical Engineering and Computer Science August 29, 2003 ...... .................. .. Certified by ... Santosh Vempala Associate Professor of Mathematics Thesis Supgrvisor Accepted by ...... Arthur C. Smith Chairman, Department Committee on Graduate Theses BARKER A Spectral Algorithm for Learning Mixtures of Spherical Gaussians by Grant J. Wang1 Submitted to the Engineering and Computer Science Electrical Department of on August 29, 2003 In Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering and Computer Science ABSTRACT Mixtures of Gaussians are a fundamental model in statistics and learning theory, and the task of learning such mixtures is a problem of both practical and theoretical interest. While there are many notions of learning, in this thesis we will consider the following notion: given m points sampled from the mixture, classify the points according to the Gaussian from which they were sampled. While there are algorithms for learning mixtures of Gaussians used in practice, such as the EM algorithm, very few provide any guarantees. For instance, it is known that the EM algorithm can fail to converge. Recently, polynomial-time algorithms have been developed that provably learn mixtures of spherical Gaussians when assumptions are made on the separation between the Gaussians. A spherical Gaussian distribution is a Gaussian distribution in which the variance is the same in every direction. In this thesis, we develop a polynomial-time algorithm to learn a mixture of k spherical Gaussians in W" when the separation depends polynomially on k and log n. Previous works provided algorithms that learn mixtures of Gaussians when the separation between the Gaussians grows polynomially in n. We also show how the algorithm can be modified slightly to learn mixtures of spherical Gaussians when the separation is essentially the least possible, in exponential-time. The main tools we use to develop this algorithm are the singular value decomposition and distance concentration. While distance concentration has been used to learn Gaussians in previous works, the application of singular value decomposition is a new technique that we believe can be applied to learning mixtures of non-spherical Gaussians. Thesis Supervisor: Santosh Vempala Title: Associate Professor of Mathematics 'Supported in part by National Science Foundation Career award CCR-987024 and an NTT Fellowship. Acknowledgments When I came to MIT two years ago, I was without an advisor. I was unclear as to whether this was normal, and I spent much of my first semester worrying about this. I am lucky that Santosh took me in as a student, and didn't rush me, gently leading me along my first research experience. For this, the endless support, and teaching me so much, I am grateful. Thanks to everybody on the third floor for sharing many meals, laughs, conversations, and crossword puzzles. Two things stand out - the 313 and the intramural volleyball team. Chris, Abhi, and Sofya have been great officemates, even though I'm sure I've done less work because of them. And who knew that my first intramural championship ever would be the result of playing volleyball with computer science graduate students and my advisor? As a residence, 21R has been good to me - the guys have always been there to make sure I was having enough fun. Wing deserves a special thanks, for being something like a surrogate mom and cooking me several meals. My parents and sister, of course, deserve the most thanks, for always being there and supporting my interests, regardless of what they were. 3 Chapter 1 Introduction The Gaussian (or Normal) distribution is a staple distribution of science and engineering. It is slightly misnamed, because the first use of it dates to de Moivre in 1733, who used it to approximate binomial distributions for large n [11]. Gauss himself first used it in 1809 in the analysis of errors of experiments. In modern times, the distribution is used in diverse fields to model various phenomena. Why the Gaussian distribution is used in practice can be linked to the Central Limit Theorem, which states that the mean of a set S of identically distributed, independent random variables with finite mean and variance approaches a Gaussian random variable as the size of the set S goes to infinity. However, the assumption that the data comes from a Gaussian distribution can often be one made out of convenience; as Lippman stated: Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation. Regardless, the Gaussian distribution certainly is a distribution of both practical and theoretical interest. One common assumption that is made by practitioners is that data samples come from a mixture of Gaussian distributions. A mixture of Gaussian distributions is a distribution in which a sample is picked according to the following procedure: the ith Gaussian F is picked with probability wi, and a sample is chosen according to F. The use of mixture of Gaussians is prevalent across many fields such as medicine, geology, and astrophysics [12]. Here, we briefly touch upon one of the uses in medicine described in [12]. In a typical scenario, a patient suffering from some unknown disease is subjected to numerous clinical tests, and data from each test is recorded. Each disease will affect the results of the clinical tests in different ways. In this way, each disease induces some sort of distribution on clinical test data. The results of a particular patient's clinical test data is then a sample from this distribution. As noted in [12], this distribution is assumed to be a Gaussian. Given many different samples (i.e. clinical test data from patients), one task would be to classify the samples according to which distribution they came from, i.e. cluster the patients according to their diseases. 4 In this thesis, we investigate a special case of this easily stated problem: given points sampled from a mixture of Gaussian distributions, learn the mixture of Gaussians. The term learn is intentionally vague; there are several notions of learning, each slightly different. The notion we consider here is that of classification: given m points in R, determine for each point the distribution from which it was sampled. Another popular notion is that of maximum-likelihood learning. Here, we are asked to compute the underlying properties of the mixture of Gaussians: the mixing weights as well as the mean and variances of each underlying distribution. When n < 3, this problem is easy in the sense that the points can be plotted in n-dimensions (line, plane, R3 ), and the human eye can easily classify the points, as long as the distributions do not overlap too much. In figure 1-1, we see two Gaussians in R2 , which are easily discernible. However, when n > 4, the structure of the sampled points is no longer observable. Easily discemable Gaussians 25 0 0 0 0 0 00 2010 15 0 0. 0 0 C)&0(0 01100 0 O(Z 8D0 o 00 0 0 000 ' 0 0 0b 10 h GD 0 GD0 0: 5 0 -10 -5 0 5 10 15 20 25 30 Figure 1-1: Two spherical Gaussians in R2 . Points sampled from one of the Gaussians is labelled with dots, the other labelled with circles. Since the distributions do not overlap too much, classifying points is easy. 1.1 Our work This thesis presents an algorithm that correctly classifies sufficiently many samples from a mixture of k spherical Gaussians in Wi. A spherical Gaussian is one in which the variance in any direction is the same. A key property of a mixture of Gaussians that affects the ability to classify points from a sample is the distance between the means of the distributions. When the means of the distributions are close together, correctly classifying samples is difficult task, as can be seen in figure 1-2. However, as the means of the distributions grow further apart, it becomes easier to distinguish, since points 5 from different Gaussians are almost separated in space. Recently, algorithms have been designed to learn Gaussians when the distance between means of the distributions grow polynomially with n. This thesis presents the first algorithm to provably learn spherical Gaussians where the distance between means grows polynomially in k and log n. When the number of Gaussians is much smaller than the dimension, our results provide algorithms that can learn a larger class of mixtures of distributions. Separation between Gaussians too small to distinguish 0 0 0 0 15 0 00 000b 0 0 5 8 00 0O Q) 0 0 ( 00 0 O.q bP 00 1MP 0800 00 00 9 0 0 t C9 00 OD O 00 0 *Oo- 0 -.. Q 0 0 00 0 10 0 0 0 0 0 0 0 00 9Q 0QP 0 0000 So 0 00.% ' 1 0 0 0 0000 0 00 0 0 0 -5 -10 -5 0 5 10 15 20 Figure 1-2: Two spherical Gaussians in R2 . Since the distance between the two means of the Gaussians is small, it is hard to classify points. The two main techniques we use are spectral projection and distance concentration. While distance concentration has been used previously to learn Gaussians (see Chapter 3), only random projection had been used. The use of spectral projection, however is new and is the key technique in this thesis. The effect of spectral projection is to shrink the size of each Gaussian while at the same time maintaining the distance between means. This allows us to apply previous techniques after applying spectral projection to classify Gaussians when the distance between the means is much smaller. 1.2 Bibliographic note The work in this thesis is joint work with Santosh Vempala; much of it appears in [14] and [13]. 6 Chapter 2 Definitions and Notation In this chapter, we introduce the necessary definitions and notation to present the results in this thesis. 2.1 Spherical Gaussian Distributions A spherical Gaussian distribution over R" can be viewed as a product distribution of n univariate Gaussian distributions. A univariate random variable from the Gaussian distribution with mean p E R and variance oG2 E R has the density function 1 v e _(xA)2 2 and is denoted by N(p, o-). With this in hand, we can define a spherical Gaussian over R. Definition 1. A spherical Gaussian F E R" with mean y and variance or2 has the following product distribution over R7: (N (/pi, a-), N(P2, 9-), . N(Pn, O-)). N In this thesis, the distance between the means of two spherical Gaussians will often be discussed relative to the radius of a Gaussian, which we define next. Definition 2. The radius of a spherical Gaussian F in ER is defined as a/-nY. 2.2 Mixtures of Gaussian distributions The main algorithm in this thesis classifies points sampled from a distribution that is the mixture of spherical Gaussians. Here, we give a precise definition of such a distribution. 7 Definition 3. A mixture of k spherical Gaussians in R1 is a collection of weights W1 ... wk and collection of spherical Gaussians F ... Fk over Rn. Each spherical Gaussian F has an associated mean pt' and variance a'. Sampling a point from a mixture of k spherical Gaussians involves first picking a Gaussian F with probability wi, and then sampling a point from Fi. A crucial feature of a mixture of Gaussians that affects the ability of an algorithm to classify points from that mixture is the distance between the means of the distributions, which we define to be separation below. Definition 4. The separation between two Gaussians F, F is the distance between their means: ||pi - tt||. The separation for a mixture of spherical Gaussians is the minimum separation between any two Gaussians. In this thesis the separation of a mixture will often be discussed with respect to the radius of the Gaussians in the mixture. The intuition for these definitions are more easily seen visually; see Figure 2-1 for an explanation of the definitions. Radius and separation of Gaussians 30 25 . **.**.. 20 * .. . .- 15 10 5 :4.j ~3..* 0 -5 -5 0 5 10 15 20 25 30 35 Figure 2-1: Two spherical Gaussians F 1 , F2 in R2 . F has a mean of ui = (0, 0), variance o = 4, and radius 2/. F 2 has a mean of [12 = (20, 20), variance o.2 = 16, and radius 4V2-. The circle around each Gaussian is the radius of that Gaussian. The separation (Il - P211 = v/202 + 202) between the means of the Gaussians is the line connecting their means. 2.3 Singular Value Decomposition One of the main tools that we use in our algorithm is the singular value decomposition (SVD) of a matrix. This is commonly known in the literature as spectral projection. 8 Definition 5. The singular value decomposition of a matrix A e position: inxn is the decom- A = UEVT where U C Rm"x, E E R"m, and V E Rx". The columns of U form an orthonormal basis for R' and are referred to as the left singular vectors of A. The columns of V form an orthonormal basis for R7 and are referred to as the right singular vectors of A. E is a diagonal matrix, and the entries on its diagonals or1 ... o, are defined as the singular values of A, where r is the rank of A. One of the key uses of the singular value decomposition is spectral projection. For a vector w E R7 and a r-dimensional subspace V C R7, we write projvw to mean the projection of w onto the subspace V. Specifically, let v1 ... vr be a set of orthogonal unit vectors such that span{vI . .. vr} = V. Then projvw = Z_1(w -vi)vi. For a matrix W E Rmxn, projvW is the matrix whose rows are the rows of W projected onto V. By spectral projection of a vector w E R, we mean the projection of w onto the subspace V spanned by the top r right singular vectors of a matrix A. The subspace V has the desirable property that it maximizes the projection of the rows of A onto any rank r subspace. The following theorem makes this precise: Theorem 6. Let A E Rrxn, and let r < n. Let V be the subspace spanned by the top r right singular vectors of A. Then we have that: IprojvA||F = max lprojwA IF W:rank(W)=r Proof. Let W be an arbitrary subspace of dimension r spanned by the orthogonal vectors w. . . wr, and V be the subspace spanned by the top k right singular vectors V1...Vr Note that we have: ||projwA|| 2 | Awi|2 . = i=1 For each wi, project w onto v 1 That is, let: ... vn, the n right singular vectors of A that span Rn. wi= E vj j=1 Since Avi ... Avn is an orthonormal set of vectors, we have that for every i: n | JAwi|| 2 a | Avj| 2 . = j=1 So we have: ||projwA1| 2 c = || Avj|| 2 i=1 j=1 a j=1 (i=1 9 ( I|Avj 2 Now, for a certain projwvjg| 2 < 2 Since ||AvjJH HvH = j, _2> note that 2 = 2 1. Since o4, and ao2 > o2 > ... < 1. This is true because aij (wi v), and _1 a = 1. Iifland so Z>E' = 1, Zg 1cf= 1, o,2 it follows that r 2 | ||projwA1 E Since V achieves this value, we have the desired result. This is the key property of spectral projection that we use in this thesis. 2.4 Gaussians and projections Let F be a spherical Gaussian over R with variance -and mean p. If we project F to a subspace V of dimension r < n, F becomes a spherical Gaussian with radius of. To see this, let {Vi ... Vr} be a set of orthonormal vectors spanning V. Let {Vr+1 ... Vn} be a set of vectors orthogonal to V. Then span{v1 ... vn} = R". F has the following product distribution: (N(p I, a-), In the basis of {v 1 ... vn}, N(P2, O-), ... ,N(Pn, O-)). this is a similar product distribution since F is spherical: (N (vi - -4, o-), N(V2 - I, n-5 P-,...,N -) Projecting this onto V is just a restriction of the coordinates to the first r coordinates, and therefore the projection of F onto V is a spherical Gaussian with radius orv'. 2.5 Learning a mixture of Gaussian distributions As we noted in the introduction, the notion of learning a mixture of Gaussian distributions can be interpreted in a number of ways. Here, we formally define the notion of learning that the algorithm in this thesis achieves. Definition 7. Let M be a mixture of k spherical Gaussians F ... Fk over R7 with weights w 1 ... Wk. Let -ti, a? be the mean and variance of F. An algorithm A learns a mixture M with separation s, if for any 6 > 0, given at most a polynomial number of samples in log j, n, k and Wmin, the algorithm correctly classifies the samples according to which Gaussian they were sampled from with probability 1 - 6. Another popular notion of learning is that of maximum likelihood estimation. Here, an algorithm is given samples from a mixture of distributions, and is asked to compute the parameters of a mixture (in the case of Gaussians, this would be the means, variances, mixing weights) that maximize the likelihood that the data seen was sampled from that mixture. In general, these notions of learning are equivalent in the sense that an algorithm for one notion can be used to achieve another notion of learning. For 10 example, note that an algorithm that achieves the above definition of learning can also be used to approximate the means, variances and mixing weights of the mixture model. For sufficiently large sample sets, these parameters are exactly those that can be derived from a correct classification of the sample sets. A maximum likelihood algorithm can also be used to achieve the classification notion of learning by classifying each point according to whichever Gaussian has the highest probability that it generated it. 11 Chapter 3 Previous work 3.1 EM Algorithm The EM algorithm is the most well-known algorithm for learning mixture models. The notion of learning it achieves is that of maximum likelihood estimation. Introduced by Dempster, Laird, and Rubin [5], EM is the algorithm that is used in practice to learn mixture models. Despite this practical success, there are few theoretical guarantees known that the algorithm achieves. The problem that the EM algorithm solves is actually more general than simply learning mixture models. In full generality, EM is a parameter estimation algorithm for "hidden", or "incomplete" data [3]. Given this data, EM attempts to find parameters E that maximize the likelihood of the data. In the particular case of a mixture of spherical Gaussians, the data given to EM are the sample points. The data is considered hidden or incomplete because each point is not labelled with the distribution from which it was sampled. EM attempts to estimate the parameters of the mixture (mixing weights, and means and variances of the underlying Gaussians) that maximize the log-likelihood of the sample data A. EM works with the log-likelihood, rather than the likelihood because it is analytically simpler. The algorithm proceeds in an iterative fashion; that is, it makes successive estimates 01, 0 2 ,... on the parameters. The key property of EM is that the log-likelihood of the parameters increases; that is, L(Ot+1 ) > L(0t). Each iteration involves two steps: an expectation step and a maximization step. In the expectation step, we compute the expected value of the log-likelihood of the complete data for some parameters 0', given the observed data and the current value of E. In the maximization step, we maximize this expected value, subject to the parameters 0'. The one key property proven by [5] is that the log-likelihood increases; however, one can see that this can result in a local optimum. There are no promises that EM will compute the global optimum; furthermore, the rate of convergence can be exponential in the dimension. In light of this, recent research has been focused towards polynomial time algorithms for learning Gaussians. 12 3.2 Recent work Recent work has produced algorithms with provable guarantees on the ability to learn mixtures Gaussians with large separation. The tools used to produce such results include random projection, isoperimetry, distance concentration, and clustering algorithms. The first work to provide a provably correct algorithm for learning Gaussians is due to Dasgupta[4]. The notion of learning Gaussians in [4] is to approximately determine the means of each of the component Gaussians. Stated roughly, he learns a mixture of k sphere-like Gaussians over 11 at minimum separation proportional to the radius: ||Ipi - pI > C max{uoi, oj} Ii. There are also other technical conditions: the Gaussians must have the same variance in every direction, and the minimum mixing weight must be Q(1/k). By sphere-like, we mean that the ratio I-ax between the maximum variance in any direction and minimum variance in any direction is bounded. This is known as the eccentricity of the Gaussian. The two main tools he uses are random projection and distance concentration. The first step of his algorithm involves random projection: given a sample matrix where each row is a sampled from a mixture of Gaussians, he projects to a randomly chosen O(log k)-dimensional subspace. After projection, the separation between Gaussians and the radii of Gaussians now are a function of log k: specifically, the separation is now Caulog k and the radii of each Gaussian is u log k. Note that in terms of the dimension, things are no different. However, the eccentricity will be reduced to a constant, thus making the Gaussians approximately spherical. Once in O(log k) dimensions, the algorithm proceeds in k iterations; in each iteration, it attempts to find the points sampled from the ith Gaussian, F. The exact details follow: initially, the set S contains all of the sample points. In each iteration, the algorithm finds the point x whose distance to at least p other points is smallest. Every point within a distance q from x is removed from S and placed in the set Si, the set of points belonging to F. Back in n-dimensions, the estimate for the mean of the ith Gaussian is the average of the preimage of the points in Si. The reason such a simple algorithm works is because of distance concentration. Roughly stated, this says that the squared distance between any two sampled points lies close to its expectation. The expected squared distance between two points from the same Gaussian is 2a 2 log k, whereas the expected squared distance between two points from different Gaussians is (or + ) log k + pI 2. If the squared separation is on the order of Cu 2 log k, then points from each Gaussian are separated in space: from any point, all other points within a squared distance of Ciu 2 log k belong to the same Gaussian. Arora and Kannan extended these techniques to learn mixtures of different Gaussians of arbitrary shape [1]; that is, the eccentricity of the Gaussians need not be bounded, and they need not possess the same covariance matrix. By using isoperimetry to find new distance concentration results for non-spherical Gaussians, they are able to learn Gaussians at a separation of pi - II > max{ai, a }n. 13 While [1] introduce several algorithms, the same key ideas are used: random projection shrinks the eccentricity, and distance concentration allows relatively simple classification algorithms to succeed. Later, DasGupta and Schulman discovered that a variant of the EM algorithm can be proven to learn Gaussians at a slightly smaller separation of: |1pi - pj. ;> -nl/4 To prove that this variant works, they also appeal to distance concentration. Learning mixtures of distributions when the separation between Gaussians is not a polynomial in the dimension has not been known, since distance concentration techniques cannot be directly applied. 14 Chapter 4 The Algorithm Our algorithm for learning mixtures of spherical Gaussians is most easily described by the following succinct description. Let M be a mixture of k spherical Gaussians. Algorithm SpectralLearn Input: Matrix A E R"fX" whose rows are sampled from M Output: A partition (S 1,... , Sk) of the rows of A. 1. Compute the singular value decomposition of A. 2. Let r = max{k,Clog(n/wmin)}. Project the samples to the rank r subspace spanned by the top r right singular vectors. 3. Perform a distance-based clustering in the r-dimensional space. Step 1 can be done using one of many algorithms developed for computing the SVD, which can be found in a standard numerical analysis step, such as [7]. For concreteness, we refer the reader to the R-SVD algorithm [7]. The running time of this algorithm is O(n3 + m 2 n). With the singular value decomposition A = UEVT in hand, the projection in step 2 is trivial. In particular, this can be done by the matrix multiplication AV, where 17 is the matrix containing only the first r columns of V. The rows of AV are the projected rows of A in Rk. Step 3 is very similar in nature to the algorithms described in [1]. However, due to the specific properties of spectral projection, our algorithm is a bit more complicated, and we refer the reader to the full description in Chapter 6, The above algorithm can be applied to any mixture of spherical Gaussians. The main result of this thesis is a guarantee on the ability of this algorithm to correctly classify the points in the Gaussian when the separation between Gaussians is polynomially dependent on k and not n. The full statement of our main theorem follows: Theorem 8. Let M be a mixture of k spherical Gaussians with separation pi - pjI| > C max{u-a, -j} ((k log (n/wmin)) 1/4 + (log (n/wmi.))/2) 15 The algorithm SpectralLearn learns M with at least M m =0 n in I n +min I ma+nka |pi|2n) I samples. 4.1 Intuition and Proof sketch To see why spectral projection aids in learning a mixture of Gaussians, consider the task of learning a mixture of k Gaussians in n dimensions given a sample matrix A E R"'". If the separation between all Gaussians is on the order of 0(nl/ 4 ), we can use the clustering algorithm used in [1] to correctly classify the sample points. If the separation between the Gaussians is much smaller than 6(nl/ 4 ), it is clear that the same techniques cannot be applied. In particular, distances between points from the same Gaussian and distances between points from different Gaussians are indistinguishable; since the clustering algorithm uses only this information, it will clearly fail. Therefore, when the separation between Gaussians is 6(k1 / 4 ), polynomial in k and log n, and not the dimension n, we cannot just naively apply the clustering algorithm to the sample points. The key observation is this: assume that we know the means of the Gaussians are P1 ... Pk, and let U be the k-dimensional subspace spanned by {P1 ... II}. If we project the Gaussians to U, note that the separation between any two Gaussians does not change, since their means are in the subspace U. However, we have reduced the dimension to k, and a Gaussian remains a Gaussian after projection (see Chapter 2). In our particular case, the results of the projection are the following: a mixture of Gaussians in k dimensions with a separation of O(k 1 / 4 ). Therefore, we can apply the clustering algorithm in [1] to correctly classify points! Of course, we do not know the means of the Gaussians p1 ... Ilk, nor the subspace U, so we cannot project the rows of A to U. Furthermore, any arbitrary subspace does not possess the crucial property that the separation between Gaussians does not change upon projection. Here is where the singular value decomposition helps. Let V be the subspace spanned by the top k right singular vectors of A. If we project the rows of A to V, the separation between the mean vectors is approximately preserved at 6(k1 / 4 ) while reducing the dimension to k. Now, we can apply the clustering algorithm described above! The key property is that projection to V behaves in much the same way that projection to U does: separation is preserved. To see why the separation between mean vectors is preserved upon projection to V, consider the relationship between U and V. Recall that V is the subspace that maximizes ||projwAll for any k dimensional subspace W. Consider the expected value of this random variable for an arbitrary W: E[IlprojwAII]. The key is that U achieves the maximum of this expected value. Why is this the case? Consider one spherical Gaussian. If we project it to a line, the line passing through the mean is the best line. Why? Since the Gaussian is spherical, the variance is the same in all directions. 16 Therefore, any line that does not pass through the mean does not take advantage of this. For a mixture of Gaussians, the same argument can be applied. For a particular Gaussian, as long as the subspace we are projecting to passes through the mean, any subspace is equally good for it. It follows that for a mixture of k Gaussians, the best k dimensional subspace is U: the subspace that passes through each of the mean vectors. Therefore, U is the subspace that maximizes E[|lprojwAl|], and V is the subspace that maximizes I|projwA I. To show that separation is preserved, we need to prove that the random variable |lprojwAll is tightly concentrated around its expectation. With this in hand, we can show that flprojvAl| is not much smaller than E[JlprojuAll] and that E[IlprojvA] is not much smaller than E[IlprojuAl ]. In particular, this will allow us to show that the average distance between the mean vectors and the projected mean vectors is small, meaning that the average separation has been maintained. In the above informal argument, we projected to k dimensions. This was only for clarity; the actual algorithm projects to r = max{k, C In n/wmin} dimensions. We need to project to a slightly higher dimension because of the exact distance concentration bounds. We should also not that because we only have a guarantee on the average separation, rather than separation for each pair of Gaussians, applying distance concentration at this point is nontrivial. However, our algorithms will use the same main idea: points from the same Gaussian are closer than points from different Gaussians. The rest of this thesis is devoted to proving the main theorem. Section 5 proves that, on average, separation between main vectors is approximately maintained. Section 6 describes the distance concentration algorithm in more detail, and proves its correctness. These two results suffice to prove the main theorem. 17 Chapter 5 Spectral Projection In this section, we will prove that with high probability, projecting a suitably large sample of points from a mixture of k spherical Gaussians to the subspace spanned by the top r = max{k, C log n/wmin} right singular vectors of the sample matrix preserves the average distance from the mean vectors to the projected mean vectors. To build up to this result, we first show that the subspace spanned by the mean vectors of the mixture maximizes the expected norm of the projection of the sample matrix. 5.1 The Best Expected Subspace Let M be a mixture of spherical Gaussians, and let A E W"' be a matrix whose rows are sampled from M. We show that the k-dimensional subspace W that maximizes E[IlprojvAl ] is the subspace spanned by the mean vectors P1 ... pk of the mixture M. We prove this result through a series of lemmas. The first lemma shows that for any random vector X E R", with the property that E[X 2Xj] = E[Xi]E[Xj], the expected squared norm of the projection of X onto v is equal to the squared norm of the projection of E[X] = M onto v plus a term dependent only on the variance of X and the length of v. Lemma 1. For any v c R", E[(X -v)2] = (p -v)2 + u2 ||v|I2 Proof. n E[(X -V)21 E[(E XiVi)21 = E[ nij=1 XiX~viv 3 ] E[XiXj]vivj. = i,j=1 18 Using the assumption that E[XiXj] = E[Xi]E[Xj], n E[(X -v) n S E[Xi]E[X ]viv = - n2 E [Xi] 2 V2 + E [Xi]V2 i=1 2 (E[X]v)2 . = 2 (11. V) 2 ±+u2 IV 12 . With this lemma in hand, it is easy to prove that for all vectors of the same length, the vector in the direction of 1- maximizes the expected squared norm of the projection. This follows from the previous lemma - the expected squared norm is equal to the sum of two terms. The second is the same for all vectors of the same length, so we only need to concern ourselves with maximizing the first term. The vector in the direction of p maximizes the first term, so the lemma follows. Corollary 1. For all v E R7 such that ||vII E[(X . Y)2] =|II|, > E[(X .-V)2] Proof. By the above lemma, E[(X -P) 2] E[(X -v)2] So E[(X . p)2] - E[(X . v) 2] = . /)2 - = 2 2 (P.P)2 +0_ 11PI1 = (p.-v)2 + a2||v||2 (IL V)2 > 0, and we have the desired result. 0 Next, we consider the projection of X to a higher dimensional subspace. We write IIprojvX 12 to denote the squared length of the projection of X onto a subspace V. For an orthonormal basis {v1 ... v,.} for V, this is just Z_ 1 (X . v,) 2 . The next lemma is an analogue of the first lemma in this section; it shows that the expected squared norm of the projection is just the projection of ft onto the subspace V, plus terms that dependent only on the size of the subspace and variance. Lemma 2. Let V C R7 be a subspace of dimension r with orthonormalbasis {v1 ... Vr}. Then E[||projvX||2 ] = IprojvE[X]|| 2 + rO.2 Proof. r E[||projvX|| 2 ] = E II(X - v)Vi 12] r EE[ 19 i-=1 (X . V)2]. The second and third equalities follow from the fact that {VI ... Vr} is an orthonormal basis. By linearity of expectation and lemma 1, we have: r E[||projvX|| 2] ZE[(X -vi)2] r = 1:p - VX +2+T i=1 =|projvE[X]|| 2 + rU2 Now consider a mixture of k distributions F ... Fk, with mean vectors ,ui and variances Uo2. be generated randomly from a mixture of distributions F ... Fk with Let A E mn"" mixing weights w1 ... Wk. For a matrix A, ||projvA 12 denotes the squared 2-norm of 1 IIprojvAi1 2 . the projections of its rows onto the subspace V, i.e. ||projvA| 2 = For a sample matrix A, we can separate the rows of A into those generated from the same distribution. Then applying lemma 2, we can show that the expected squared norm of the projection of A onto a subspace V is just the projection of E[A] onto V, plus terms dependent only on the mixture and dimension of the subspace. This is formalized in the following lemma. Theorem 9. Let V C R {V1...Vr}. Then be a subspace of dimension r with an orthonormal basis k E[I|projvA|| ] =|IprojvE[A]|| + mr 2 2 wi . ro2 i=1 Proof. E[||projvA||2 ] EE[|projvA i | 2] = i=1 k - J E[||projvAi| 2]. ( 1=1 iEF 1 The second equality follows from linearity of expectation. In expectation, there are wim samples from each distribution. By applying lemma 2, we have k k 1 11 projvE[Ai]| +ri = mkwi(projv pi| + ri 1=1 1=1 iEFi k = 20 |projvE[A]|||2+ m wi . rO. From this it follows that the expected best subspace for a mixture of k distributions is simply the subspace that contains the mean vectors of the distributions, as the following corollary shows. Corollary 2 (Expected Best Subspace). Let V C IV be a subspace of dimension r > k, and let U contain the mean vectors il ... M . Then, E[||projuA|2] > E[||projvA||2] Proof. By Theorem 9 we have, E[||projuA|| 2 ] - E[f|projvAl 2] = |projuE [A]|11 = |E[A]|| 2 - ||projvE[A]|| 2 2 - ||projvE[A]|1 2 > 0. D As a notational convenience, from this point on in this thesis, the subspace W is an arbitrary subspace, U is the subspace that contains the mean vectors of the mixture, and V is the subspace that maximizes the squared norm of the projection of A. 5.2 Concentration of the expected norm of projection We have shown that U maximizes the expected squared norm of the projection of the sample matrix A. Surely, we cannot hope that with high probability V is exactly U. We would like to show that with high probability, V lies close to U in the sense that it also preserves the distance between the mean vectors of the distributions. To prove this, we will need to first show that, for an arbitrary subspace W, the squared norm of the projection of A onto W is close to its expectation. With this in hand, we can show that every subspace has this property. As a result, we can argue that, with high probability, I|projvA|| 2 is close to |IprojuA112 . This will essentially prove the main result of the section. The next lemma proves the first necessary idea: IIprojwA |2 is tightly concentrated about its expectation E[I projwA 2]. Lemma 3 (Concentration). Let W be a subspace of IR of dimension r, with an orthonormalbasis {w1 ... w,}. Let A E R' xn be generatedfrom a mixtures of spherical Gaussians F1,... , Fk with mixing weights w1,... , Wk. Assume that A contains at least m rows from each Gaussian. Then for any E > 0: 1. Pr (||projwA||2 > (1 + c)E[I|projwA||2 ]) < ke2 . . 2. Pr (||projwA| 2 < (1 - e)E[||projwA||2 ]) < keProof. Let £ be the event in consideration. We further condition F on the event that exactly mi rows are generated by the ith Gaussian. 21 iF IprojwAi 2 . Therefore, we have the following bound on the probability of the conditioned event: Note that |lprojwAll 2 , k |Iprojw Ai Ij2 > (1 + E)E[ k max Pr Pr (E/(mj ... mk)) \iEFI IIprojwAi I2 / iFI Let B be the event that: I projwBI 2 > (1 + c)E[|IprojwB|I 2 ] where B E W"" is a matrix containing the rows of A that are generated by F1 , an arbitrary Gaussian. We bound the probability of 8/(ml ... ik) by bounding the probability of B. _1 I Y . We are interested in the Let Yij = (Bi -w). Note that IIprojwB11 2 = E11 2 MI'1 r_> Yij > (1 + E)E[jjprojwBj| ]. Note that Yj is a Gaussian random event B variable with mean ([-ti wj) and variance or,. We can write Yi = uXij, where Xjj is Gaussian random variable with mean (PIjWj) and variance 1. Rewriting in terms of Xjj, we are interested in the event mI r Z(oXa) 2 > (1 + E)E[|jprojwBj B -- 2 i=1 j=1 W3 ) 2 + of), we have: 1 w 1 m1 ((pu Since E[|lprojwBI| 2 ] = MIT JU _(1+ 2 EX2 E) =_ 1 m ((p_ _ i > W) + 02) i=1 j=1 By Markov's inequality, Pr(B <E[ exp (t E 1 De Pr (B) K t(1+f) 1 1 ml (('V Note that Z = 1i 2(w parameter E_>1 for Z is (see e.g. [6]): E[e=e Zi= 1 X -1 X j Xj) -w)+<r) is a chi-squared random variable with noncentrality and mjr degrees of freedom. The moment generating function (1 - 2t)~mir/ 2 e Z] = w2 m [ 2t So we obtain the bound on Pr (B): Pr (B) < (1 - 2t)-mlr/ 2 exp < t( Ejrsl X m1(p,_ (1- 2t)-nr/2 exp -)2-(1 + E) zE; m((p - wj) 2 + o)(1 - 2t) 7 ) (1- 2t)oI t(-(1 + E)mlr - (c - 2t(e + 1)) 22 -2= 0 (I - 2t) 1, Using the fact that 1 < e2t+4t2 and setting t = , we have E - 2t(e +1) > 0, and so Pr (B) P (2t+4t2 ) M e 1+E) letir Therefore, we obtain the following bound on 8/(mI Pr (E/(mi ... m)) < kmaxe . . _.2M 8 1- Since this is true regardless of the choice of m 1 ,... at most the above. , . mk) <ke- e 2 mr 8 , the probability of E itself is F- The above lemma shows tight concentration for a single subspace. As noted before, we would like to show that all subspaces stay close to their expectation. The following lemma proves this: Lemma 4. Suppose A e RMxn has at least m rows from each of the k spherical Gaussians F1 ... Fk. Then, for any 1 > e > 0, -> > 0, and any r such that 1 < r < n, the probability that there exists a subspace W of dimension r that satisfies ||projwA|| 2 < (1 - E)E[I|projwA||2 ] - (6r rIna) E [| A1| 2] is at most ke smr. E___ (2) rn Proof. Let S be the desired event, and let W be the set of all r-dimensional subspaces. We consider a finite set S of r dimensional subspaces, with |S| = N, such that for any W E W with orthonormal basis {W1 ... Wr}, there exists a W* E S with orthonormal basis {w* . .. w*}, such that for all i and j, wi - w* < (, component-wise. The main idea is to show that if any subspace W E W satisfies the desired inequality, there exists a subspace W* E S that satisfies a similar inequality. In particular, we will show that IIprojw.A|| 2 is not much larger than IprojwA 12, and that E[jprojw-A| 2 ] is not much smaller than E[IlprojwA |2]. Now we can bound the probability that there is a W* E S that satisfies this similar inequality by the union bound, thus bounding the probability of the desired event. 23 - W* )2 projw- a||I(a = i=1 r (a - (w* = -wi +wi))2 i=1 r (a . W1)2 + (a -(w* - Wi))2 + 2(a -(w - w))(a -wi) < n r I|projwa||2 +( 2 E |aj||W* - Wijl j=1 ) i=1 (j=1 + Il|projw al| + rn aia| + 2r da||el||2 " I|projwal1 2 + 3rna||al 2 The last line follows from a < --. The above also gives us a bound on the projection of matrices: ||projwA||2 | projwA|| 2 - 3r a||AI| 2. Similarly, E[||projwAJ |2] E[f|projwA||2 ] + 3rvliaE[| Al| 2] Therefore, it follows that if there exists a W such that: HprojwAH| 2 < (1 - c)E[JHprojwAJ| 2] - (6rva)E[IA112] Then there exists a W* G S such that: |lprojw*A| 2 + (1 + E)3rv' YaE[I|AI| 2 ] K (1 - c)E[JHprojwA1 12 ] + 3rv'ia||AI 2. Applying a union bound, we have that: Pr(&) < Pr (3W* : |projw*AJ 2 < (1- e)E[JHprojwAJ|2]) + Pr (3rV aAI 12 > (1 + e)3rx/naE[IIA112 ]) Using Lemma 3 and the union bound, we get that 2 _6 _, Pr (E) < (N +1)ke 8 A simple upper bound on N+1 is (2)r', the number of grid points in the cube [-1, 1]nr El with grid size a. The lemma follows. With this in hand, we can prove the main theorem of this section. 24 5.3 Average distance between mean vectors and projected mean vectors is small The following main theorem shows that projecting the mean vectors of the distributions to the subspace found by the singular value decomposition of the sample matrix preserves, on average, the location of the mean vectors. Theorem 10. Let the rows of A E R"' be picked according to a mixture of spher- ical Gaussians {F 1,..., Fk} with mixing weights (w 1,... means (P 1 , Wk), ... Pk) and variances (o.. .0). Let r = max{k,961n (1")}. Let V C Rn be the r-dimensional subspace spanned by the top r right singular vectors, and let U be an r-dimensional subspace that contains the mean vectors p1 ... pA. Then for any } > c > 0, with rn> ( 5000 ynln C2 Wmin n + m ax i |p||2 ego )+ 1 (k) n n - k In6 n we have with probability at least 1 - 6, k 2 ||projuE[A] 1| - ||projvE[A]||2< rm(n - r) wig 2 Proof. We first describe the intuition behind the proof. We would like to show that with high probability, I|projuE[A] 1 2 - IIprojvE[A]1| 2 is small. Note that by theorem 9, we have that: lprojuE[A]11 2 - ||projvE[A]| 2 = E[||projuA|| 2 ] - E[\|projvA|| 2] To prove the desired result, we will show that the following event is unlikely: 91 - E[IjprojuAl|2] - E[lJprojvAJf 2 ] is large. We actually show that a stronger event is unlikely: E2 = 3W : E[||projuAl 2] - E[||projwA|| 2 ] is large and ||projwA|| 2 > | IprojuA|| 2. Note that E2 implies Ei. In order for Si to occur, it must be the case that there is some W such that IIprojwA 112 is much larger than its expectation or IIprojuA| 2 is much smaller than its expectation, since E[IHprojuA|| 2 ] - E[IlprojwA|| 2 ] is large. These are unlikely events by the lemmas we have proven in this section. While the above approach derives a bound on IIprojuE[A]1| 2 - projvE[A]J| 2 , better results can be obtained by looking at the orthogonal subspaces. The same argument as above works, and we follow through with that here. We will show that the following event is unlikely: k projuE [A]~~ - JprojvE[A]J~ 25 > em(n - r) wia Let U be the subspace orthogonal to U, and similarly for V. Since IA l 2 for any subspace W, we have that: 11projWA1| ||projuE[A]|| 2 - ||projvE[A]|| 2 = ||projVE [A]|| 2 - 2 = ||projwA|12+ ||projUE [A]|| 2 Applying theorem 9, we can rewrite the desired event as: k E[IIprojVA11 ] - E[l 2 projUA11 2 ] > Em(n r) - (5.1) wioQ We also have by theorem 9 that: k E[IlprojUAll ] 2 = I proj-jE[A]112 + m(n - r) wiOr2 i= k = m(n - k) wioU. i=1 since none of the mean vectors are in U. Therefore, we can rewrite 5.1 as E l projA 112] > (1 + c)E[flproj-All 2 ]. Following the same argument as the intuition mentioned previously, we will show that the stronger event is unlikely: E = 3W: E[jjprojwAlI 2] > (1 + c)E[llprojUAjl 2 ] and ||projWA1 2 < IIprojUA11 2 . In order for E to occur, it must be the case that either of the following two events hold: A 3W : E[llprojwAll 2 ] > (1 + c)E[jjprojUAll 2 ] and B IIprojUA11 2 - E[IlprojyAll 2 ] >: (E[IIprojwA11 2 ] - E[llprojUAll 2 1) 3W : E[llprojwAl 2 ] > (1 + c)E[jjprojUAll 2 ] and E[|lprojwA|| 2] _ |IprojWA|| 2 > (E[||projW A2] - E[l|projg A| 2 ]) This is because if there exists a W such that E[llprojwAll2] > (1 + E)E[lprojyA|1 12] and IIprojWA 12 < IIprojUA 12, it must be the case that either there exists a W such that 2 IIprojwA112 is much smaller than its expectation or IIprojUA1l is much larger than its expectation. However, these are low probability events, as we will see. The probability of A is at most the probability that there exists W such that projUA11 2 - E[IIproj-A11 2 ] > 1 ((1 + c)E[ljprojUAj| 2 ] - E[IlprojUAI|2]), which is at most Pr (A) < Pr (IIprojA2 (1 + E)E[|projA2 2 32 The last inequality follows from lemma 3, and the fact that each distribution generates at least win' rows (the probability this does not happen is much smaller than 6). Now 26 the probability of B is at most Pr (B) < Pr 3W : |lprojWAH|2 _< Pr (]W: IprojWAI| 2 = Pr (3W: |projWAI 2 _E[IlprojwA| < ( 2] (I1 - _ E[lprojWAH2]) E[lprojWAI|2]) - (18- )E[IlprojWA|2] E[|lprojuAl2 - Here we used the fact that E < 1. By applying lemma 4 with =E[JprojUA 12] 48jni(n - r)E[|JAI| 2 ] and 512n n C2 + In (k)) n- k ar 6 samples from each Gaussian, we have that the probability of B, and therefore E is at most 6, which is the desired result. l The theorem above essentially says that the average distance between the mean vectors and the mean vectors projected to the subspace found by the SVD is not too large. The following corollary helps to see this. Corollary 3 (Likely Best Subspace). Let p1,... , yk be the means of the k spher- ical Gaussians in the mixture and wm n the smallest mixing weight. Let ', ... ,' be their projections onto the subspace spanned by the top r right singular vectors of the sample matrix A. With a sample of size of m = O*( E2wf ) we have with high probability k k ( 11/_1|p1|2 _ 2) < c(n - w oQ. r) i=1 i=1 Proof. Let V be the optimal rank r subspace. Let m be as large as required in Theorem 10. By the theorem, we have that with probability at least 1 - 6, k ||projuE[A]|| 2 - |IprojvE[A]|| 2 < em(n - r) woQ. Since k k Iproju E[A]||2 - ||projvE[A]||2 = m JW,||p||2 - MYW,||P ||2 Therefore, k m k wi (I P2 2 _Il'11 ) em(n -r)Zwio2. i=1 e~nLI 27 Random projection vs. spectral projection: examples 5.3.1 As mentioned in the introduction, random projection does not preserve the distance between the means. The following figures illustrate the difference between random projection and the SVD-based projection for a mixture of Gaussians. This gives a more intuitive presentation of the main theorem. 40 -1 32 -* Figure 2: SVD1 Figure 1: RP1 Figure 1 and figure 3 are 2-dimensional random projections of samples from two different 49-dimensional mixtures (one with k = 2, the other with k = 4). Figure 2 and figure 4 are the projections to the best rank 2 subspaces of the same data sets. *- 38 "50 '" --- 12 - 52 5'4 50 5 0'0 82 64 6 6 82 64 86 88 7 2 Figure 4: SVD2 Figure 3: RP2 28 4 7 Chapter 6 After Projection Let us recall the main theorem of the last chapter: given a sufficiently large sample of m points from a mixture of k spherical Gaussians in R, we have that after projection to the top r right singular vectors: k m k 2 wi(||il|| ) < m(n - r)Z wioj i=1 where pi are the mean vectors of the Gaussians and b' are the mean vectors projected to the subspace V spanned by the top k right singular vectors of A, the sample matrix. Suppose we knew the ratio D between the largest variance o.a and the smallest variance .2in. Applying theorem 10 with E = , we have: k 2 1i1 ZWI, - 2 IIP11 22 ) < e2 2min Oj2 min- This implies that for every i, I - '|2 - 2 _ 2 2 Since each projected mean only moves a bit from the original mean, the separation between the projected means is at least the original separation, minus this small amount. That is, for every i, j: Ip' - / || > |pi - p|j| - eo1. If the initial separation between the means of the mixture is C rlIn4m 1/4 then the separation after projection is roughly the same! Since the radius of each Gaussian drops to o-fr after projection to a k-dimensional subspace, we can apply the algorithms based on distance concentration described in chapter 3 to classify the points 29 of the Gaussian. Recall that these algorithms classify points from the (approximately) smallest Gaussian first, and repeatedly learn Gaussians of increasing size. However, D may not be known or may be very large. In this case, we cannot apply distance concentration naively, since the separation between a mean and its projection may be quite large for a small Gaussian. That is, without a bound on D, we can only insure that the average distance between means and projected means is small. The algorithm we present in the next section deals with this difficulty and learns a mixture of k spherical Gaussians with separation 14 max{ca o-}(r ln 4))1/4. (6.1) Recall that r = max{k, 96 ln(4m/6)}. The main idea of finding the smallest Gaussian and removing points from it is still the core of our algorithm. The key modification our algorithm makes is to first throw away points from Gaussians for which separation does not hold. Then, the algorithm classifies points from Gaussians (roughly) in order of increasing variance. Later in this chapter we will describe non-polynomial time algorithms when the separation of the mixture is smaller than (6.1). 6.1 The main algorithm The algorithm is best described in the following box. Let S be the points sampled from the mixture of spherical Gaussians. This algorithm is to be used after projecting to the top r right singular vectors of the sample matrix. Algorithm. (a) Let R = maxEs minys Ix - y-l. (b) Discard all points from S whose closest point lies at squared distance at most 3UR 2 to form a new set S'. (c) Let x, w be the two closest points in the remaining set, and let H be the set of all points at squared distance at most 1 = I|I 6n4m w| 2 (1+ ±8 ) from x. (d) Report H as a Gaussian, and remove the points in H from S'. Repeat step (c) till there are no points left. (e) Output all Gaussians learned in step (d) with variance greater than 3R 2 /r. (f) Remove the points from step (e) from the original sample S, and repeat the entire algorithm (including the SVD calculation and projection) on the rest. 30 We now briefly describe the details of the algorithm. Step (a) computes an estimate of the radius of the largest Gaussian. By distance concentration, the closest point to any point is a point from the same Gaussian with high probability. The pair of points with the largest, shortest distance should come from the Gaussian with the largest variance. Now, step (b) removes those points from Gaussians whose variance is too small. Since we know only that the average separation between the projected mean vectors is maintained, Gaussians with small variance are not necessarily well-separated, so we remove them from the set S to obtain S'. Now, with points only from the large Gaussians, we can group those points from the smallest Gaussian together. The two closest points form an estimate of the radius of the smallest Gaussian in S', and 1 is an estimate of the largest distance any point from that Gaussian may lie from x. This is the reasoning behind steps (c) and (d). In step (e), though, we only output those Gaussians with large variance. Why? When we removed points as in step (b), we may have accidentally removed some points from "medium-sized" Gaussians for which separation holds. Therefore, in step (e) we can only output those Gaussians with radii that are not too small. Lastly, in step (f), we can remove those points correctly classified from S, and repeat the entire process on the remaining set. Distance Concentration This section includes the necessary lemmas on distance concentration that we will use repeatedly in the analysis. The first lemma shows that in expectation, points from the same Gaussian lie at a distance on the order of the radii of that Gaussian, whereas points from different Gaussians lie on the order of the radii of the larger Gaussian plus the distance between the mean vectors of the Gaussians. The second lemma shows that with high probability, the distance between two points lies close to its expectation. Lemma 5. Let X E F, Y E F be spherical Gaussians in r dimensions with means . Then ps, pt , and variances o, 2 E[||X - Y|| 2] - (2r + Or k + ||/' - '|2 Proof. E[I|X -EY 2 ] E[E(Xi = -Y) 2 ] SE[X2] + E[Y. ]2 2E[Xi]E[Y] r = 1|p' 112 + a2r + I||p' 1|2 + Or2 r + a2 r + | p =o2r 31 - 2 yP's i=1 Lemma 6. Let X E F, and Y E F, where F,, Ft are r dimensional Gaussians with Then for a > 0, the probability that means p', p', and variances or 2r. ||X -Y||2 - E[||X -Y|| 2]| is at most 4e- 2 > 2+ a -p'| + 2|p V oU+ U)) /8 Proof. We consider the probability that ||X -Y|| 2- - Y| y[X2 ] ;> e((. E The other part is analogous. We write + u.)V+ 2||p' - oi + uZi ± (pi - Ptz)) 2 , =X _( = - U + of). t'| where the Zi are N(O, 1) random variables. Therefore, we can rewrite the above event as: oU2 + otZ2+ (p,i t i S - p'i))2 _ t4 + (o2 + o2)r + |p' -- -12 i=1 a ( (Cr + U2) Vr+ 2|1|p' - pt| V02 + ort2 The probability of this occurring is at most the sum of the probabilities of the following events r A=I Z(or2+ r)Zi !(O2+ 0.) (r+ eV'-) i=1 p' -- p' Simpify-ngi) w S2( 2 + ot Zi > a ,(21 - 2|' 11 '1 [ 2 + or . Simplifying, we get: Pr (A) < Pr Pr (B) < Pr Z > r + af IT) (psi - pti)Zi > a||[' - < e-a2/8 '||) < e- The above inequalities hold by applying Markov's inequality and moment generating L functions as in the proof of Lemma 3. 32 Proof of correctness Now we are ready to prove that the algorithm correctly classifies a sufficiently large sample from a mixture of spherical Gaussians with high probability. Theorem 11. With a sample of size (n Wmin + In (max m = Q I~t 2 ) + In') and initial separation |bpi - p|II>14 max{o-g, u-} r In the algorithm correctly classifies all Gaussians with probability at least 1 - 6. Proof. First, let us apply Corollary 3 with E = follows that, .in. Let W k g.2 be the largest variance. It k ZWI i1 2 ) < WminZ I '1 2 Wig2 Wminnj2 So in particular, for all i, _IJ |112 - pp 212 12 < e 2. This implies that for any Gaussian F with variance larger than all j: eo 2 , we have that for 1/4 p || '>- 14o-i r ln 4m)1/4 - 2/-ac > 12 o-i (r ln 4) provided that e < 1. By applying Lemma 6 with this separation between mean vectors and a= 24 n (4), we obtain the following with probability at least 1 : . For any Gaussian F and any two points x, y drawn from it, 2cgr - 4g 2 6rln (4k) < |x - y |2 < 2u 2 r +4U2 It will also be helpful to have the following bounds on I- 6r ln (6.2) y 12 in terms of only 2 or or |IX - y 1 2 < 3o 2 r This follows by upper bounding the deviation term 4uf holds by applying r > 96 ln (). 33 (6.3) 6r ln (w) by ofr. This * For any Gaussians F # Fj, with ou > of and or > ea' and any two points x from F, and y from Fj, |x - y 2 + ( 2)r + 38o 6rln( (6.4) . This follows from lemma 6, which states that: ||z - y ;(o + orj)r +| I's- p|| - a ((af +u2o v/ + 211 p' - p-' l f+ of) and r > 96 In (4), we 241n (i), | ' - p'. > 12-i (rln (g)) With a = obtain the above bound. We can also obtain a lower bound only in terms of U2 as we did above for two points from the same Gaussian by the same upper bound on the deviation term: - y1 2 > 12o r. (6.5) Using these bounds, we can classify the smallest Gaussian using (6.2), since interGaussian distances are smaller than intra-Gaussian distances. However, this only holds for Gaussians with variance larger than oa2 . We first show that the first step in the algorithm removes all such small Gaussians, and then show that any Gaussians classified in step (e) are complete Gaussians. The next few observations prove the correctness of the algorithm, conditioned on Lemma 6 to obtain (6.2), (6.3), (6.4) and (6.5). 1. Let y E S' be from some Gaussian F. Then a?2 > or2 . 2. Let x c F, H be the point and set used in any iteration of step (c). H = S' n F, i.e. it is the points in S' from Fj. Then 3. For any Gaussian F with ou > 3eR 2/r, we have F C S', i.e. any Gaussian with sufficiently large variance will be contained in S'. We proceed to prove 1-3. 1. Let x be any point from F, the Gaussian with largest variance, and let w be any other point. By (6.3) and (6.5), 1|X -- wH12 > a2 r, so R2 > r2 r. Now suppose by way of contradiction that or < eU2 . We will show that if this is the case, y would have removed from S, contradicting its membership in S'. Let z be another point in S from F. Then by (6.3) we have that: y - z1 2 < < 30rc 3ea 2 r < 3eR 2 Since y's closest point lies at a distance at most 3eR 2 , this contradicts y c S'. 34 2. First, we show that x, w in step (c) of the algorithm belong to the same Gaussian. If not, suppose without loss of generality that w E F, and that o-r < o-r. But then by (6.2) and (6.4), there exists a point z from F such that: <z 2<ur +4 |x - 6r ln( . However, ||x- w||2>22 2 r+38 6r ln( . This contradicts the fact that x, w are the two closest points in S'. With the bounds on |Hx - w|12 for x, w E F from (6.2), we obtain the following bounds on 1, our estimate for the furthest point from x that is still in Fj: 2 -,r + 4a3 6r In ( < 2 r + 28 6rln I . The lower bound on I ensures that every point in F and S' is included in H. The upper bound on I ensures that any point z E F 0 Fj will not be included in H. If a > uj, this follows from the fact that the upper bound on 1 is less than the lower bound on | - zI| 2 in (6.4). Now suppose ao o<. Since x, w are the closest points in S', it must be the case that: 1|x- 2 w 2r 6r In( +4 ) since otherwise, two points from F would be the two closest points in S'. Since 11x- w112 > 2or -4 6r ln (i), oa2r - we have that u-r - 42 6r In . 6rln4~) Applying this to (6.4), we have that: ~x- zH|2>2a2 o-r + 34 - 6r ln). As this is larger than the upper bound on 1, we have that F n S'= H. 3. Let x C F, with or > 3R 2 /r. We want to show that x E S', so we need to show that step (b) never removes x from S. This holds if: Vz, |1x - z1 2 > 3RR2 35 which is true by (6.3) and (6.5). We have just shown that any Gaussian with variance at least 3eR 2 /r will be correctly classified. By setting e < 1, at least the largest Gaussian is classified in step (e), since R 2 < 3orr by (6.3). So we remove at least one Gaussian from the sample during each iteration. It remains to bound the success probability of the algorithm. For this we need " Corollary 3 holds up to k times, " Steps (a)-(e) are successful up to k times A )k > 1 - 6/2. The probability The probability of the first event is at least (1 of the latter event is just the probability that distance concentration holds, which is (1 - -) > (1 - '). Therefore, we have the probability that the algorithm succeeds is: (1 1)k(1 - 6/2) > 1 - 6. Immediate from this proof is the main theorem of this thesis (Theorem 8). 6.2 Non-polynomial time algorithms Note that if we assume only the minimum possible separation, I pi - puI > C max{c-a, u3 }, then after projection to the top k right singular vectors the separation is II' - p I > (C - 2e) max{u-, a-} which is the radius divided by vx/ (note that we started with a separation of radius divided by Vn/). Here we could use an exponential in k algorithm using 0( k ) samples from the Gaussian to obtain a maximum-likelihood estimation. First, project the samples to the k right singular vectors of the sample matrix. Then, consider each of the 0 Wmink k partitions of the points into k clusters. For each of these clusters, we can compute the mean and variance, as well as the mixing weight of the cluster. Since the points were generated from a spherical Gaussian, and we know the density function F for a spherical Gaussian with a given mean and variance, we can compute the likelihood of the partition. Let x be any point in the sample, and let 1(x) denote the cluster that contains it. Then the likelihood of the sample is: 11 Fi)(x). xES By examining each of the partitions, we can determine the partition that has the maximum-likelihood, obtaining estimates of the means, variances, and mixing weights of the mixture. 36 Chapter 7 Conclusion We have presented an algorithm that learns a mixture of spherical Gaussians in the sense that, given a suitably large sample of points, it can correctly classify the points according to which distribution they were generated from with high probability. The work here introduces many interesting questions and remarks, which we conclude with. 7.1 Remarks The algorithm and its guarantees can be extended to mixtures of weakly-isotropic distributions, provided they have two types of concentration bounds. For a guarantee as in Theorem 10, we require an analog of Lemma 3 to hold, and for a guarantee as in Theorem 11, we need a lemma similar to Lemma 6. A particular class of distributions that possess good concentration bounds is the class of logconcave distributions. A distribution is said to be logconcave if its density function f is logconcave, i.e. for any x, y c R", and any 0 < a < 1, f(ax + (1 - a)y) > f(x),f(y) 1 ". Examples of logconcave distributions include the special case of the uniform distribution on a weakly isotropic convex body, e.g. cubes, balls, etc. Although the weakly isotropic property might sound restrictive, it is worth noting that any single log-concave distribution can be made weakly isotropic by a linear transformation (see, e.g. [8]). We remark that the SVD is tolerant to noise whose 2-norm is bounded [10, 9, 2]. Thus even after corruption, the SVD of the sample matrix will recover a subspace that is close to one spanning the mean vectors of the underlying distributions. In this low-dimensional space, one could exhaustively examine subsets of the data to learn the mixture (the ignored portion of the data corresponds to the noise). 7.2 Open problems and related areas The first natural question is whether the technique of projecting the sample matrix to the top k right singular vectors can be applied when the Gaussians in the mixture are no longer spherical. Although a variant of Theorem 9 does not hold when the Gaussians are not spherical, perhaps it is the case that the subspace that maximizes the expected 37 norm of the projection of the sample matrix is not far from the subspace spanned by the mean vectors. Even if this does hold, however, classifying the points after projection would prove to be a much more difficult task than the clustering algorithm we describe in Chapter 6. It will probably be necessary to use isoperimetry as in [1]. One area of more open-ended research is finding the right parameter that naturally characterizes the difficulty of learning mixtures of Gaussians. While this work and previous works considered separation (the distance between means of the underlying Gaussians) to be the main parameter, small separation between mean vectors does not necessarily mean that it is impossible to distinguish two Gaussians. Consider two Gaussians, one with very large variance and one with very small variance. Even if the two Gaussians have the same mean, note that it is possible to distinguish points from the other. With distance concentration, we can first remove points from the smaller Gaussian. In this case, there is very little probability overlap between the two distributions. Learning a mixture of Gaussians when the only assumption is probability overlap seems to be a challenging problem. Lastly, while the work in this thesis is primarily theoretical, it would be interesting to see how this algorithm performs in practice. While it is difficult to find data that takes the form of spherical Gaussians, finding data that is approximately Gaussian is not difficult. Applying the algorithms developed in this thesis to this experimental data may show that the the algorithm actually performs much better in practice than the theoretical bounds suggest. Further, experiments can test the success of this algorithm when the Gaussians are non-spherical, and suggest how to prove that spectral projection works with non-spherical Gaussians. 38 Bibliography [1] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the 33rd ACM STOC, 2001. [2] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spectral analysis of data. In Proceedings of the 33rd ACM STOC, 2001. [3] M. Collins. The EM Algorithm. Unpublished manuscript, 1997. [4] S. DasGupta. Learning mixtures of gaussians. In Proceedings of the 40th IEEE FOCS, 1999. [5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. In J. Royal Statistics Soc. Ser. B, volume 39, pages 1-38, 1977. [6] M. L. Eaton. Multivariate Statistics. Wiley, New York, 1983. [7] G. H. Golub and C. F. Loan. Matrix Computations. Johns Hopkins, third edition, 1996. [8] L. Lovaisz and S. Vempala. Logconcave functions: Geometry and efficient sampling algorithms. In Proceedings of the 44th IEEE FOCS (to appear), 2003. [9] C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: a probabilistic analysis. In Journal of Computer and System Sciences, volume 61, pages 217-235, 2000. [10] G. Stewart. Error and perturbation bounds for subspaces associated with certain eigenvalue problems. In SIAM review, volume 15(4), pages 727-64, 1973. [11] S. Stigler. Statistics on the Table. Harvard University Press, 1999. [12] D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite mixture distributions. Wiley, 1985. [13] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. In Jounal of Computer and System Sciences (to appear). [14] S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In Proceedings of the 43rd IEEE FOCS, 2002. 39