A Spectral Algorithm for Learning J. Grant

advertisement
A Spectral Algorithm for Learning
Mixtures of Spherical Gaussians
by
MASSA CHUSETTS INSTITUTE
OF TECHNOLOGY
Grant J. Wang
OCT 15 2003
B.S in Computer Science
LIBRARIES
Cornell University (2001)
Submitted to the Department of Electrical Engineering and Computer
Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
August 29, 2003
© MMIII Massachusetts Institute of Technology. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
.....................
Author ...............................
Department of Electrical Engineering and Computer Science
August 29, 2003
...... ..................
..
Certified by ...
Santosh Vempala
Associate Professor of Mathematics
Thesis Supgrvisor
Accepted by ......
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
BARKER
A Spectral Algorithm for Learning Mixtures of Spherical Gaussians
by
Grant J. Wang1
Submitted to the
Engineering and Computer Science
Electrical
Department of
on August 29, 2003
In Partial Fulfillment of the Requirements for the Degree of
Master of Science in Electrical Engineering and Computer Science
ABSTRACT
Mixtures of Gaussians are a fundamental model in statistics and learning theory,
and the task of learning such mixtures is a problem of both practical and theoretical
interest. While there are many notions of learning, in this thesis we will consider
the following notion: given m points sampled from the mixture, classify the points
according to the Gaussian from which they were sampled.
While there are algorithms for learning mixtures of Gaussians used in practice, such
as the EM algorithm, very few provide any guarantees. For instance, it is known that
the EM algorithm can fail to converge. Recently, polynomial-time algorithms have
been developed that provably learn mixtures of spherical Gaussians when assumptions
are made on the separation between the Gaussians. A spherical Gaussian distribution
is a Gaussian distribution in which the variance is the same in every direction.
In this thesis, we develop a polynomial-time algorithm to learn a mixture of k
spherical Gaussians in W" when the separation depends polynomially on k and log n.
Previous works provided algorithms that learn mixtures of Gaussians when the separation between the Gaussians grows polynomially in n. We also show how the algorithm
can be modified slightly to learn mixtures of spherical Gaussians when the separation
is essentially the least possible, in exponential-time. The main tools we use to develop
this algorithm are the singular value decomposition and distance concentration. While
distance concentration has been used to learn Gaussians in previous works, the application of singular value decomposition is a new technique that we believe can be
applied to learning mixtures of non-spherical Gaussians.
Thesis Supervisor: Santosh Vempala
Title: Associate Professor of Mathematics
'Supported in part by National Science Foundation Career award CCR-987024 and an NTT Fellowship.
Acknowledgments
When I came to MIT two years ago, I was without an advisor. I was unclear as to
whether this was normal, and I spent much of my first semester worrying about this.
I am lucky that Santosh took me in as a student, and didn't rush me, gently leading
me along my first research experience. For this, the endless support, and teaching me
so much, I am grateful.
Thanks to everybody on the third floor for sharing many meals, laughs, conversations, and crossword puzzles. Two things stand out - the 313 and the intramural
volleyball team. Chris, Abhi, and Sofya have been great officemates, even though I'm
sure I've done less work because of them. And who knew that my first intramural
championship ever would be the result of playing volleyball with computer science
graduate students and my advisor?
As a residence, 21R has been good to me - the guys have always been there to make
sure I was having enough fun. Wing deserves a special thanks, for being something
like a surrogate mom and cooking me several meals. My parents and sister, of course,
deserve the most thanks, for always being there and supporting my interests, regardless
of what they were.
3
Chapter 1
Introduction
The Gaussian (or Normal) distribution is a staple distribution of science and engineering. It is slightly misnamed, because the first use of it dates to de Moivre in 1733, who
used it to approximate binomial distributions for large n [11]. Gauss himself first used
it in 1809 in the analysis of errors of experiments. In modern times, the distribution
is used in diverse fields to model various phenomena. Why the Gaussian distribution
is used in practice can be linked to the Central Limit Theorem, which states that the
mean of a set S of identically distributed, independent random variables with finite
mean and variance approaches a Gaussian random variable as the size of the set S goes
to infinity. However, the assumption that the data comes from a Gaussian distribution
can often be one made out of convenience; as Lippman stated:
Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians,
because they believe it has been established by observation.
Regardless, the Gaussian distribution certainly is a distribution of both practical and
theoretical interest.
One common assumption that is made by practitioners is that data samples come
from a mixture of Gaussian distributions. A mixture of Gaussian distributions is a
distribution in which a sample is picked according to the following procedure: the ith
Gaussian F is picked with probability wi, and a sample is chosen according to F. The
use of mixture of Gaussians is prevalent across many fields such as medicine, geology,
and astrophysics [12]. Here, we briefly touch upon one of the uses in medicine described
in [12]. In a typical scenario, a patient suffering from some unknown disease is subjected
to numerous clinical tests, and data from each test is recorded. Each disease will affect
the results of the clinical tests in different ways. In this way, each disease induces some
sort of distribution on clinical test data. The results of a particular patient's clinical
test data is then a sample from this distribution. As noted in [12], this distribution is
assumed to be a Gaussian. Given many different samples (i.e. clinical test data from
patients), one task would be to classify the samples according to which distribution
they came from, i.e. cluster the patients according to their diseases.
4
In this thesis, we investigate a special case of this easily stated problem: given points
sampled from a mixture of Gaussian distributions, learn the mixture of Gaussians. The
term learn is intentionally vague; there are several notions of learning, each slightly
different. The notion we consider here is that of classification: given m points in R,
determine for each point the distribution from which it was sampled. Another popular
notion is that of maximum-likelihood learning. Here, we are asked to compute the
underlying properties of the mixture of Gaussians: the mixing weights as well as the
mean and variances of each underlying distribution.
When n < 3, this problem is easy in the sense that the points can be plotted in
n-dimensions (line, plane, R3 ), and the human eye can easily classify the points, as
long as the distributions do not overlap too much. In figure 1-1, we see two Gaussians
in R2 , which are easily discernible. However, when n > 4, the structure of the sampled
points is no longer observable.
Easily discemable Gaussians
25
0
0
0
0
0
00
2010
15
0
0.
0 0 C)&0(0
01100
0 O(Z 8D0 o
00
0
0
000
'
0
0
0b
10
h
GD 0 GD0
0:
5
0
-10
-5
0
5
10
15
20
25
30
Figure 1-1: Two spherical Gaussians in R2 . Points sampled from one of the Gaussians
is labelled with dots, the other labelled with circles. Since the distributions do not
overlap too much, classifying points is easy.
1.1
Our work
This thesis presents an algorithm that correctly classifies sufficiently many samples
from a mixture of k spherical Gaussians in Wi. A spherical Gaussian is one in which
the variance in any direction is the same. A key property of a mixture of Gaussians that
affects the ability to classify points from a sample is the distance between the means
of the distributions. When the means of the distributions are close together, correctly
classifying samples is difficult task, as can be seen in figure 1-2. However, as the means
of the distributions grow further apart, it becomes easier to distinguish, since points
5
from different Gaussians are almost separated in space. Recently, algorithms have
been designed to learn Gaussians when the distance between means of the distributions
grow polynomially with n. This thesis presents the first algorithm to provably learn
spherical Gaussians where the distance between means grows polynomially in k and
log n. When the number of Gaussians is much smaller than the dimension, our results
provide algorithms that can learn a larger class of mixtures of distributions.
Separation between Gaussians too small to distinguish
0
0
0
0
15
0
00
000b
0
0
5
8 00
0O
Q)
0
0
(
00
0
O.q bP 00 1MP
0800
00 00 9 0
0
t
C9
00 OD
O 00
0
*Oo-
0
-.. Q
0
0
00
0
10
0
0 0
0
0
0
0
00 9Q
0QP 0 0000
So 0
00.%
'
1
0
0
0 0000
0
00
0
0
0
-5
-10
-5
0
5
10
15
20
Figure 1-2: Two spherical Gaussians in R2 . Since the distance between the two means
of the Gaussians is small, it is hard to classify points.
The two main techniques we use are spectral projection and distance concentration.
While distance concentration has been used previously to learn Gaussians (see Chapter
3), only random projection had been used. The use of spectral projection, however
is new and is the key technique in this thesis. The effect of spectral projection is
to shrink the size of each Gaussian while at the same time maintaining the distance
between means. This allows us to apply previous techniques after applying spectral
projection to classify Gaussians when the distance between the means is much smaller.
1.2
Bibliographic note
The work in this thesis is joint work with Santosh Vempala; much of it appears in [14]
and [13].
6
Chapter 2
Definitions and Notation
In this chapter, we introduce the necessary definitions and notation to present the
results in this thesis.
2.1
Spherical Gaussian Distributions
A spherical Gaussian distribution over R" can be viewed as a product distribution of
n univariate Gaussian distributions. A univariate random variable from the Gaussian
distribution with mean p E R and variance oG2 E R has the density function
1
v
e
_(xA)2
2
and is denoted by N(p, o-). With this in hand, we can define a spherical Gaussian over
R.
Definition 1. A spherical Gaussian F E R" with mean y and variance or2 has the
following product distribution over R7:
(N (/pi, a-), N(P2,
9-),
.
N(Pn, O-)).
N
In this thesis, the distance between the means of two spherical Gaussians will often
be discussed relative to the radius of a Gaussian, which we define next.
Definition 2. The radius of a spherical Gaussian F in ER is defined as a/-nY.
2.2
Mixtures of Gaussian distributions
The main algorithm in this thesis classifies points sampled from a distribution that
is the mixture of spherical Gaussians. Here, we give a precise definition of such a
distribution.
7
Definition 3. A mixture of k spherical Gaussians in R1 is a collection of weights
W1 ... wk and collection of spherical Gaussians F ... Fk over Rn. Each spherical Gaussian F has an associated mean pt' and variance a'. Sampling a point from a mixture
of k spherical Gaussians involves first picking a Gaussian F with probability wi, and
then sampling a point from Fi.
A crucial feature of a mixture of Gaussians that affects the ability of an algorithm to
classify points from that mixture is the distance between the means of the distributions,
which we define to be separation below.
Definition 4. The separation between two Gaussians F, F is the distance between
their means: ||pi - tt||. The separation for a mixture of spherical Gaussians is the
minimum separation between any two Gaussians.
In this thesis the separation of a mixture will often be discussed with respect to the
radius of the Gaussians in the mixture. The intuition for these definitions are more
easily seen visually; see Figure 2-1 for an explanation of the definitions.
Radius and separation of Gaussians
30
25
. **.**..
20
*
..
.
.-
15
10
5
:4.j
~3..*
0
-5
-5
0
5
10
15
20
25
30
35
Figure 2-1: Two spherical Gaussians F 1 , F2 in R2 . F has a mean of ui = (0, 0),
variance o = 4, and radius 2/. F 2 has a mean of [12 = (20, 20), variance o.2 = 16,
and radius 4V2-. The circle around each Gaussian is the radius of that Gaussian. The
separation (Il - P211 = v/202 + 202) between the means of the Gaussians is the line
connecting their means.
2.3
Singular Value Decomposition
One of the main tools that we use in our algorithm is the singular value decomposition
(SVD) of a matrix. This is commonly known in the literature as spectral projection.
8
Definition 5. The singular value decomposition of a matrix A e
position:
inxn
is the decom-
A = UEVT
where U C Rm"x,
E E R"m,
and V E Rx". The columns of U form an orthonormal
basis for R' and are referred to as the left singular vectors of A. The columns of V
form an orthonormal basis for R7 and are referred to as the right singular vectors of
A. E is a diagonal matrix, and the entries on its diagonals or1 ... o, are defined as the
singular values of A, where r is the rank of A.
One of the key uses of the singular value decomposition is spectral projection. For
a vector w E R7 and a r-dimensional subspace V C R7, we write projvw to mean the
projection of w onto the subspace V. Specifically, let v1 ... vr be a set of orthogonal
unit vectors such that span{vI . .. vr} = V. Then projvw = Z_1(w -vi)vi. For a matrix
W E Rmxn, projvW is the matrix whose rows are the rows of W projected onto V.
By spectral projection of a vector w E R, we mean the projection of w onto the
subspace V spanned by the top r right singular vectors of a matrix A. The subspace
V has the desirable property that it maximizes the projection of the rows of A onto
any rank r subspace. The following theorem makes this precise:
Theorem 6. Let A E Rrxn, and let r < n. Let V be the subspace spanned by the top
r right singular vectors of A. Then we have that:
IprojvA||F =
max
lprojwA IF
W:rank(W)=r
Proof. Let W be an arbitrary subspace of dimension r spanned by the orthogonal
vectors w. . . wr, and V be the subspace spanned by the top k right singular vectors
V1...Vr Note that we have:
||projwA|| 2
| Awi|2 .
=
i=1
For each wi, project w onto v 1
That is, let:
...
vn, the n right singular vectors of A that span Rn.
wi= E
vj
j=1
Since Avi ... Avn is an orthonormal set of vectors, we have that for every i:
n
| JAwi|| 2
a | Avj| 2 .
=
j=1
So we have:
||projwA1| 2
c
=
|| Avj||
2
i=1 j=1
a
j=1 (i=1
9
(
I|Avj
2
Now, for a certain
projwvjg|
2
<
2
Since ||AvjJH
HvH
=
j,
_2>
note that
2 =
2
1. Since
o4, and ao2 > o2 >
...
< 1. This is true because aij (wi v), and
_1 a = 1.
Iifland so Z>E'
= 1, Zg 1cf= 1,
o,2 it follows that
r
2
|
||projwA1
E
Since V achieves this value, we have the desired result.
This is the key property of spectral projection that we use in this thesis.
2.4
Gaussians and projections
Let F be a spherical Gaussian over R with variance -and mean p. If we project F to
a subspace V of dimension r < n, F becomes a spherical Gaussian with radius of.
To see this, let {Vi ... Vr} be a set of orthonormal vectors spanning V. Let {Vr+1 ... Vn}
be a set of vectors orthogonal to V. Then span{v1 ... vn} = R". F has the following
product distribution:
(N(p I, a-),
In the basis of {v 1 ...
vn},
N(P2, O-), ... ,N(Pn, O-)).
this is a similar product distribution since F is spherical:
(N (vi - -4, o-), N(V2 -
I,
n-5
P-,...,N
-)
Projecting this onto V is just a restriction of the coordinates to the first r coordinates,
and therefore the projection of F onto V is a spherical Gaussian with radius orv'.
2.5
Learning a mixture of Gaussian distributions
As we noted in the introduction, the notion of learning a mixture of Gaussian distributions can be interpreted in a number of ways. Here, we formally define the notion
of learning that the algorithm in this thesis achieves.
Definition 7. Let M be a mixture of k spherical Gaussians F ... Fk over R7 with
weights w 1 ... Wk. Let -ti, a? be the mean and variance of F. An algorithm A learns a
mixture M with separation s, if for any 6 > 0, given at most a polynomial number of
samples in log j, n, k and Wmin, the algorithm correctly classifies the samples according
to which Gaussian they were sampled from with probability 1 - 6.
Another popular notion of learning is that of maximum likelihood estimation. Here,
an algorithm is given samples from a mixture of distributions, and is asked to compute
the parameters of a mixture (in the case of Gaussians, this would be the means, variances, mixing weights) that maximize the likelihood that the data seen was sampled
from that mixture. In general, these notions of learning are equivalent in the sense that
an algorithm for one notion can be used to achieve another notion of learning. For
10
example, note that an algorithm that achieves the above definition of learning can also
be used to approximate the means, variances and mixing weights of the mixture model.
For sufficiently large sample sets, these parameters are exactly those that can be derived from a correct classification of the sample sets. A maximum likelihood algorithm
can also be used to achieve the classification notion of learning by classifying each point
according to whichever Gaussian has the highest probability that it generated it.
11
Chapter 3
Previous work
3.1
EM Algorithm
The EM algorithm is the most well-known algorithm for learning mixture models. The
notion of learning it achieves is that of maximum likelihood estimation. Introduced by
Dempster, Laird, and Rubin [5], EM is the algorithm that is used in practice to learn
mixture models. Despite this practical success, there are few theoretical guarantees
known that the algorithm achieves.
The problem that the EM algorithm solves is actually more general than simply
learning mixture models. In full generality, EM is a parameter estimation algorithm for
"hidden", or "incomplete" data [3]. Given this data, EM attempts to find parameters E
that maximize the likelihood of the data. In the particular case of a mixture of spherical
Gaussians, the data given to EM are the sample points. The data is considered hidden
or incomplete because each point is not labelled with the distribution from which it
was sampled. EM attempts to estimate the parameters of the mixture (mixing weights,
and means and variances of the underlying Gaussians) that maximize the log-likelihood
of the sample data A. EM works with the log-likelihood, rather than the likelihood
because it is analytically simpler.
The algorithm proceeds in an iterative fashion; that is, it makes successive estimates
01, 0 2 ,... on the parameters. The key property of EM is that the log-likelihood of
the parameters increases; that is, L(Ot+1 ) > L(0t). Each iteration involves two steps:
an expectation step and a maximization step. In the expectation step, we compute the
expected value of the log-likelihood of the complete data for some parameters 0', given
the observed data and the current value of E. In the maximization step, we maximize
this expected value, subject to the parameters 0'.
The one key property proven by [5] is that the log-likelihood increases; however, one
can see that this can result in a local optimum. There are no promises that EM will
compute the global optimum; furthermore, the rate of convergence can be exponential
in the dimension. In light of this, recent research has been focused towards polynomial
time algorithms for learning Gaussians.
12
3.2
Recent work
Recent work has produced algorithms with provable guarantees on the ability to learn
mixtures Gaussians with large separation. The tools used to produce such results
include random projection, isoperimetry, distance concentration, and clustering algorithms.
The first work to provide a provably correct algorithm for learning Gaussians is due
to Dasgupta[4]. The notion of learning Gaussians in [4] is to approximately determine
the means of each of the component Gaussians. Stated roughly, he learns a mixture of
k sphere-like Gaussians over 11 at minimum separation proportional to the radius:
||Ipi - pI > C max{uoi, oj} Ii.
There are also other technical conditions: the Gaussians must have the same variance
in every direction, and the minimum mixing weight must be Q(1/k). By sphere-like, we
mean that the ratio I-ax between the maximum variance in any direction and minimum
variance in any direction is bounded. This is known as the eccentricity of the Gaussian.
The two main tools he uses are random projection and distance concentration.
The first step of his algorithm involves random projection: given a sample matrix
where each row is a sampled from a mixture of Gaussians, he projects to a randomly
chosen O(log k)-dimensional subspace. After projection, the separation between Gaussians and the radii of Gaussians now are a function of log k: specifically, the separation
is now Caulog k and the radii of each Gaussian is u log k. Note that in terms of
the dimension, things are no different. However, the eccentricity will be reduced to a
constant, thus making the Gaussians approximately spherical.
Once in O(log k) dimensions, the algorithm proceeds in k iterations; in each iteration, it attempts to find the points sampled from the ith Gaussian, F. The exact
details follow: initially, the set S contains all of the sample points. In each iteration,
the algorithm finds the point x whose distance to at least p other points is smallest.
Every point within a distance q from x is removed from S and placed in the set Si,
the set of points belonging to F. Back in n-dimensions, the estimate for the mean of
the ith Gaussian is the average of the preimage of the points in Si. The reason such a
simple algorithm works is because of distance concentration. Roughly stated, this says
that the squared distance between any two sampled points lies close to its expectation.
The expected squared distance between two points from the same Gaussian is 2a 2 log k,
whereas the expected squared distance between two points from different Gaussians is
(or + ) log k +
pI
2. If the squared separation is on the order of Cu 2 log k, then
points from each Gaussian are separated in space: from any point, all other points
within a squared distance of Ciu 2 log k belong to the same Gaussian.
Arora and Kannan extended these techniques to learn mixtures of different Gaussians of arbitrary shape [1]; that is, the eccentricity of the Gaussians need not be
bounded, and they need not possess the same covariance matrix. By using isoperimetry to find new distance concentration results for non-spherical Gaussians, they are
able to learn Gaussians at a separation of
pi -
II > max{ai, a }n.
13
While [1] introduce several algorithms, the same key ideas are used: random projection
shrinks the eccentricity, and distance concentration allows relatively simple classification algorithms to succeed.
Later, DasGupta and Schulman discovered that a variant of the EM algorithm can
be proven to learn Gaussians at a slightly smaller separation of:
|1pi - pj. ;> -nl/4
To prove that this variant works, they also appeal to distance concentration. Learning
mixtures of distributions when the separation between Gaussians is not a polynomial
in the dimension has not been known, since distance concentration techniques cannot
be directly applied.
14
Chapter 4
The Algorithm
Our algorithm for learning mixtures of spherical Gaussians is most easily described by
the following succinct description. Let M be a mixture of k spherical Gaussians.
Algorithm SpectralLearn
Input: Matrix A E R"fX" whose rows are sampled from M
Output: A partition (S 1,... , Sk) of the rows of A.
1. Compute the singular value decomposition of A.
2. Let r = max{k,Clog(n/wmin)}. Project the samples to the rank r
subspace spanned by the top r right singular vectors.
3. Perform a distance-based clustering in the r-dimensional space.
Step 1 can be done using one of many algorithms developed for computing the SVD,
which can be found in a standard numerical analysis step, such as [7]. For concreteness,
we refer the reader to the R-SVD algorithm [7]. The running time of this algorithm is
O(n3 + m 2 n).
With the singular value decomposition A = UEVT in hand, the projection in step 2
is trivial. In particular, this can be done by the matrix multiplication AV, where 17 is
the matrix containing only the first r columns of V. The rows of AV are the projected
rows of A in Rk.
Step 3 is very similar in nature to the algorithms described in [1]. However, due to
the specific properties of spectral projection, our algorithm is a bit more complicated,
and we refer the reader to the full description in Chapter 6,
The above algorithm can be applied to any mixture of spherical Gaussians. The
main result of this thesis is a guarantee on the ability of this algorithm to correctly classify the points in the Gaussian when the separation between Gaussians is polynomially
dependent on k and not n. The full statement of our main theorem follows:
Theorem 8. Let M be a mixture of k spherical Gaussians with separation
pi - pjI| > C max{u-a, -j} ((k log (n/wmin)) 1/4 + (log (n/wmi.))/2)
15
The algorithm SpectralLearn learns M with at least
M
m =0
n
in
I
n +min
I ma+nka |pi|2n)
I
samples.
4.1
Intuition and Proof sketch
To see why spectral projection aids in learning a mixture of Gaussians, consider the
task of learning a mixture of k Gaussians in n dimensions given a sample matrix
A E R"'". If the separation between all Gaussians is on the order of 0(nl/ 4 ), we can
use the clustering algorithm used in [1] to correctly classify the sample points. If the
separation between the Gaussians is much smaller than 6(nl/ 4 ), it is clear that the same
techniques cannot be applied. In particular, distances between points from the same
Gaussian and distances between points from different Gaussians are indistinguishable;
since the clustering algorithm uses only this information, it will clearly fail.
Therefore, when the separation between Gaussians is 6(k1 / 4 ), polynomial in k and
log n, and not the dimension n, we cannot just naively apply the clustering algorithm
to the sample points.
The key observation is this: assume that we know the means of the Gaussians
are P1 ... Pk, and let U be the k-dimensional subspace spanned by {P1 ... II}. If we
project the Gaussians to U, note that the separation between any two Gaussians does
not change, since their means are in the subspace U. However, we have reduced the
dimension to k, and a Gaussian remains a Gaussian after projection (see Chapter 2).
In our particular case, the results of the projection are the following: a mixture of
Gaussians in k dimensions with a separation of O(k 1 / 4 ). Therefore, we can apply the
clustering algorithm in [1] to correctly classify points!
Of course, we do not know the means of the Gaussians p1 ... Ilk, nor the subspace
U, so we cannot project the rows of A to U. Furthermore, any arbitrary subspace
does not possess the crucial property that the separation between Gaussians does not
change upon projection. Here is where the singular value decomposition helps. Let V
be the subspace spanned by the top k right singular vectors of A. If we project the
rows of A to V, the separation between the mean vectors is approximately preserved at
6(k1 / 4 ) while reducing the dimension to k. Now, we can apply the clustering algorithm
described above! The key property is that projection to V behaves in much the same
way that projection to U does: separation is preserved.
To see why the separation between mean vectors is preserved upon projection to
V, consider the relationship between U and V. Recall that V is the subspace that
maximizes ||projwAll for any k dimensional subspace W. Consider the expected value
of this random variable for an arbitrary W: E[IlprojwAII]. The key is that U achieves
the maximum of this expected value. Why is this the case? Consider one spherical
Gaussian. If we project it to a line, the line passing through the mean is the best
line. Why? Since the Gaussian is spherical, the variance is the same in all directions.
16
Therefore, any line that does not pass through the mean does not take advantage of
this. For a mixture of Gaussians, the same argument can be applied. For a particular
Gaussian, as long as the subspace we are projecting to passes through the mean, any
subspace is equally good for it. It follows that for a mixture of k Gaussians, the best k
dimensional subspace is U: the subspace that passes through each of the mean vectors.
Therefore, U is the subspace that maximizes E[|lprojwAl|], and V is the subspace
that maximizes I|projwA I. To show that separation is preserved, we need to prove that
the random variable |lprojwAll is tightly concentrated around its expectation. With
this in hand, we can show that flprojvAl| is not much smaller than E[JlprojuAll] and
that E[IlprojvA] is not much smaller than E[IlprojuAl ]. In particular, this will allow us
to show that the average distance between the mean vectors and the projected mean
vectors is small, meaning that the average separation has been maintained.
In the above informal argument, we projected to k dimensions. This was only for
clarity; the actual algorithm projects to r = max{k, C In n/wmin} dimensions. We need
to project to a slightly higher dimension because of the exact distance concentration
bounds. We should also not that because we only have a guarantee on the average
separation, rather than separation for each pair of Gaussians, applying distance concentration at this point is nontrivial. However, our algorithms will use the same main
idea: points from the same Gaussian are closer than points from different Gaussians.
The rest of this thesis is devoted to proving the main theorem. Section 5 proves
that, on average, separation between main vectors is approximately maintained. Section 6 describes the distance concentration algorithm in more detail, and proves its
correctness. These two results suffice to prove the main theorem.
17
Chapter 5
Spectral Projection
In this section, we will prove that with high probability, projecting a suitably large
sample of points from a mixture of k spherical Gaussians to the subspace spanned by
the top r = max{k, C log n/wmin} right singular vectors of the sample matrix preserves
the average distance from the mean vectors to the projected mean vectors. To build
up to this result, we first show that the subspace spanned by the mean vectors of the
mixture maximizes the expected norm of the projection of the sample matrix.
5.1
The Best Expected Subspace
Let M be a mixture of spherical Gaussians, and let A E W"' be a matrix whose rows
are sampled from M. We show that the k-dimensional subspace W that maximizes
E[IlprojvAl ] is the subspace spanned by the mean vectors P1 ... pk of the mixture M.
We prove this result through a series of lemmas.
The first lemma shows that for any random vector X E R", with the property that
E[X 2Xj] = E[Xi]E[Xj], the expected squared norm of the projection of X onto v is
equal to the squared norm of the projection of E[X] = M onto v plus a term dependent
only on the variance of X and the length of v.
Lemma 1. For any v c R",
E[(X -v)2]
= (p
-v)2 + u2 ||v|I2
Proof.
n
E[(X -V)21
E[(E XiVi)21
=
E[ nij=1 XiX~viv 3 ]
E[XiXj]vivj.
=
i,j=1
18
Using the assumption that E[XiXj] = E[Xi]E[Xj],
n
E[(X -v)
n
S E[Xi]E[X ]viv
=
-
n2
E [Xi] 2 V2 +
E [Xi]V2
i=1
2
(E[X]v)2 .
=
2
(11. V) 2 ±+u2 IV 12 .
With this lemma in hand, it is easy to prove that for all vectors of the same length,
the vector in the direction of 1- maximizes the expected squared norm of the projection.
This follows from the previous lemma - the expected squared norm is equal to the sum
of two terms. The second is the same for all vectors of the same length, so we only
need to concern ourselves with maximizing the first term. The vector in the direction
of p maximizes the first term, so the lemma follows.
Corollary 1. For all v E R7 such that ||vII
E[(X
. Y)2]
=|II|,
> E[(X
.-V)2]
Proof. By the above lemma,
E[(X -P) 2]
E[(X -v)2]
So E[(X . p)2] - E[(X . v) 2]
=
. /)2
-
=
2
2
(P.P)2 +0_ 11PI1
=
(p.-v)2 + a2||v||2
(IL V)2
> 0, and we have the desired result. 0
Next, we consider the projection of X to a higher dimensional subspace. We write
IIprojvX 12 to denote the squared length of the projection of X onto a subspace V.
For an orthonormal basis {v1 ... v,.} for V, this is just Z_ 1 (X . v,) 2 . The next lemma
is an analogue of the first lemma in this section; it shows that the expected squared
norm of the projection is just the projection of ft onto the subspace V, plus terms that
dependent only on the size of the subspace and variance.
Lemma 2. Let V C R7 be a subspace of dimension r with orthonormalbasis {v1 ... Vr}.
Then
E[||projvX||2 ]
=
IprojvE[X]||
2
+ rO.2
Proof.
r
E[||projvX|| 2 ] = E
II(X - v)Vi 12]
r
EE[
19
i-=1
(X . V)2].
The second and third equalities follow from the fact that {VI ... Vr} is an orthonormal
basis. By linearity of expectation and lemma 1, we have:
r
E[||projvX||
2]
ZE[(X -vi)2]
r
=
1:p - VX +2+T
i=1
=|projvE[X]|| 2 + rU2
Now consider a mixture of k distributions F ... Fk, with mean vectors ,ui and
variances Uo2.
be generated randomly from a mixture of distributions F ... Fk with
Let A E mn""
mixing weights w1 ... Wk. For a matrix A, ||projvA 12 denotes the squared 2-norm of
1 IIprojvAi1 2 .
the projections of its rows onto the subspace V, i.e. ||projvA| 2 =
For a sample matrix A, we can separate the rows of A into those generated from
the same distribution. Then applying lemma 2, we can show that the expected squared
norm of the projection of A onto a subspace V is just the projection of E[A] onto V,
plus terms dependent only on the mixture and dimension of the subspace. This is
formalized in the following lemma.
Theorem 9. Let V C R
{V1...Vr}. Then
be a subspace of dimension r with an orthonormal basis
k
E[I|projvA|| ] =|IprojvE[A]|| + mr
2
2
wi . ro2
i=1
Proof.
E[||projvA||2 ]
EE[|projvA i | 2]
=
i=1
k
-
J
E[||projvAi| 2].
(
1=1 iEF 1
The second equality follows from linearity of expectation. In expectation, there are
wim samples from each distribution. By applying lemma 2, we have
k
k
1
11 projvE[Ai]|
+ri
=
mkwi(projv pi| + ri
1=1
1=1 iEFi
k
=
20
|projvE[A]|||2+
m
wi . rO.
From this it follows that the expected best subspace for a mixture of k distributions
is simply the subspace that contains the mean vectors of the distributions, as the
following corollary shows.
Corollary 2 (Expected Best Subspace). Let V C IV be a subspace of dimension
r > k, and let U contain the mean vectors il ... M . Then,
E[||projuA|2] > E[||projvA||2]
Proof. By Theorem 9 we have,
E[||projuA|| 2 ] - E[f|projvAl
2]
=
|projuE [A]|11
=
|E[A]|| 2 - ||projvE[A]|| 2
2
-
||projvE[A]|1 2
>
0.
D
As a notational convenience, from this point on in this thesis, the subspace W is an
arbitrary subspace, U is the subspace that contains the mean vectors of the mixture,
and V is the subspace that maximizes the squared norm of the projection of A.
5.2
Concentration of the expected norm of projection
We have shown that U maximizes the expected squared norm of the projection of the
sample matrix A. Surely, we cannot hope that with high probability V is exactly U.
We would like to show that with high probability, V lies close to U in the sense that
it also preserves the distance between the mean vectors of the distributions.
To prove this, we will need to first show that, for an arbitrary subspace W, the
squared norm of the projection of A onto W is close to its expectation. With this in
hand, we can show that every subspace has this property. As a result, we can argue
that, with high probability, I|projvA|| 2 is close to |IprojuA112 . This will essentially prove
the main result of the section.
The next lemma proves the first necessary idea: IIprojwA |2 is tightly concentrated
about its expectation E[I projwA 2].
Lemma 3 (Concentration). Let W be a subspace of IR of dimension r, with an
orthonormalbasis {w1 ... w,}. Let A E R' xn be generatedfrom a mixtures of spherical
Gaussians F1,... , Fk with mixing weights w1,... , Wk. Assume that A contains at least
m rows from each Gaussian. Then for any E > 0:
1. Pr (||projwA||2 > (1 + c)E[I|projwA||2 ]) < ke2 .
.
2. Pr (||projwA| 2 < (1 - e)E[||projwA||2 ]) < keProof. Let £ be the event in consideration. We further condition F on the event that
exactly mi rows are generated by the ith Gaussian.
21
iF IprojwAi 2 . Therefore, we have the following
bound on the probability of the conditioned event:
Note that |lprojwAll 2
,
k
|Iprojw Ai Ij2 > (1 + E)E[
k max Pr
Pr (E/(mj ... mk))
\iEFI
IIprojwAi I2
/
iFI
Let B be the event that:
I projwBI
2
> (1 + c)E[|IprojwB|I 2 ]
where B E W"" is a matrix containing the rows of A that are generated by F1 ,
an arbitrary Gaussian. We bound the probability of 8/(ml ... ik) by bounding the
probability of B.
_1 I Y . We are interested in the
Let Yij = (Bi -w). Note that IIprojwB11 2 = E11
2
MI'1 r_> Yij > (1 + E)E[jjprojwBj| ]. Note that Yj is a Gaussian random
event B
variable with mean ([-ti wj) and variance or,. We can write Yi = uXij, where Xjj is
Gaussian random variable with mean (PIjWj) and variance 1. Rewriting in terms of
Xjj, we are interested in the event
mI
r
Z(oXa) 2 > (1 + E)E[|jprojwBj
B --
2
i=1 j=1
W3 ) 2 + of), we have:
1 w
1 m1 ((pu
Since E[|lprojwBI| 2 ] =
MIT
JU
_(1+
2
EX2
E)
=_
1 m ((p_
_
i
>
W)
+ 02)
i=1 j=1
By Markov's inequality,
Pr(B <E[ exp (t E 1 De
Pr (B) K
t(1+f) 1 1 ml (('V
Note that Z =
1i
2(w
parameter E_>1
for Z is (see e.g. [6]):
E[e=e
Zi=
1
X
-1 X j
Xj)
-w)+<r)
is a chi-squared random variable with noncentrality
and mjr degrees of freedom. The moment generating function
(1 - 2t)~mir/ 2 e
Z] =
w2
m
[
2t
So we obtain the bound on Pr (B):
Pr (B)
<
(1 - 2t)-mlr/ 2
exp
<
t(
Ejrsl
X
m1(p,_
(1- 2t)-nr/2 exp
-)2-(1
+ E)
zE;
m((p
- wj) 2 + o)(1
-
2t)
7
)
(1- 2t)oI
t(-(1 + E)mlr - (c - 2t(e + 1))
22
-2=
0
(I - 2t) 1,
Using the fact that 1
< e2t+4t2 and setting t =
, we have E - 2t(e +1) > 0, and
so
Pr (B)
P (2t+4t2 ) M
e
1+E)
letir
Therefore, we obtain the following bound on 8/(mI
Pr (E/(mi ... m)) < kmaxe
. .
_.2M
8
1-
Since this is true regardless of the choice of m 1 ,...
at most the above.
,
. mk)
<ke-
e
2
mr
8
, the probability of E itself is
F-
The above lemma shows tight concentration for a single subspace. As noted before,
we would like to show that all subspaces stay close to their expectation. The following
lemma proves this:
Lemma 4. Suppose A e RMxn has at least m rows from each of the k spherical
Gaussians F1 ... Fk. Then, for any 1 > e > 0, ->
> 0, and any r such that
1 < r < n, the probability that there exists a subspace W of dimension r that satisfies
||projwA|| 2
<
(1 - E)E[I|projwA||2 ]
-
(6r rIna) E [| A1| 2]
is at most
ke smr.
E___
(2) rn
Proof. Let S be the desired event, and let W be the set of all r-dimensional subspaces.
We consider a finite set S of r dimensional subspaces, with |S| = N, such that for any
W E W with orthonormal basis {W1 ... Wr}, there exists a W* E S with orthonormal
basis {w* . .. w*}, such that for all i and j, wi - w* < (, component-wise.
The main idea is to show that if any subspace W E W satisfies the desired inequality, there exists a subspace W* E S that satisfies a similar inequality. In particular, we
will show that IIprojw.A|| 2 is not much larger than IprojwA 12, and that E[jprojw-A| 2
]
is not much smaller than E[IlprojwA |2]. Now we can bound the probability that there
is a W* E S that satisfies this similar inequality by the union bound, thus bounding
the probability of the desired event.
23
- W* )2
projw- a||I(a =
i=1
r
(a - (w*
=
-wi +wi))2
i=1
r
(a . W1)2 + (a -(w* - Wi))2 + 2(a -(w - w))(a -wi)
<
n
r
I|projwa||2 +(
2
E |aj||W* - Wijl
j=1
)
i=1 (j=1
+
Il|projw al| + rn aia| + 2r da||el||2
"
I|projwal1 2 + 3rna||al
2
The last line follows from a < --. The above also gives us a bound on the projection
of matrices:
||projwA||2
| projwA|| 2 - 3r
a||AI| 2.
Similarly,
E[||projwAJ |2]
E[f|projwA||2 ] + 3rvliaE[| Al| 2]
Therefore, it follows that if there exists a W such that:
HprojwAH| 2
<
(1 - c)E[JHprojwAJ| 2] - (6rva)E[IA112]
Then there exists a W* G S such that:
|lprojw*A|
2
+
(1 + E)3rv' YaE[I|AI| 2 ] K (1 - c)E[JHprojwA1 12 ] + 3rv'ia||AI 2.
Applying a union bound, we have that:
Pr(&)
< Pr (3W* : |projw*AJ 2 < (1- e)E[JHprojwAJ|2]) +
Pr (3rV aAI 12 > (1 + e)3rx/naE[IIA112 ])
Using Lemma 3 and the union bound, we get that
2
_6 _,
Pr (E) < (N +1)ke
8
A simple upper bound on N+1 is (2)r', the number of grid points in the cube [-1, 1]nr
El
with grid size a. The lemma follows.
With this in hand, we can prove the main theorem of this section.
24
5.3
Average distance between mean vectors and
projected mean vectors is small
The following main theorem shows that projecting the mean vectors of the distributions to the subspace found by the singular value decomposition of the sample matrix
preserves, on average, the location of the mean vectors.
Theorem 10. Let the rows of A E R"'
be picked according to a mixture of spher-
ical Gaussians {F 1,..., Fk} with mixing weights (w 1,...
means (P 1
, Wk),
...
Pk)
and
variances (o.. .0). Let r = max{k,961n (1")}. Let V C Rn be the r-dimensional
subspace spanned by the top r right singular vectors, and let U be an r-dimensional
subspace that contains the mean vectors p1 ... pA. Then for any } > c > 0, with
rn>
(
5000
ynln
C2 Wmin
n
+
m ax
i
|p||2
ego
)+
1
(k)
n n - k In6
n
we have with probability at least 1 - 6,
k
2
||projuE[A] 1|
- ||projvE[A]||2< rm(n - r)
wig 2
Proof. We first describe the intuition behind the proof. We would like to show that
with high probability, I|projuE[A] 1 2 - IIprojvE[A]1| 2 is small. Note that by theorem 9,
we have that:
lprojuE[A]11 2
-
||projvE[A]|
2
= E[||projuA|| 2 ] - E[\|projvA|| 2]
To prove the desired result, we will show that the following event is unlikely:
91 - E[IjprojuAl|2] - E[lJprojvAJf 2 ] is large.
We actually show that a stronger event is unlikely:
E2 = 3W : E[||projuAl 2] - E[||projwA|| 2 ] is large and ||projwA||
2
> | IprojuA|| 2.
Note that E2 implies Ei. In order for Si to occur, it must be the case that there is
some W such that IIprojwA 112 is much larger than its expectation or IIprojuA| 2 is much
smaller than its expectation, since E[IHprojuA|| 2 ] - E[IlprojwA|| 2 ] is large. These are
unlikely events by the lemmas we have proven in this section.
While the above approach derives a bound on IIprojuE[A]1| 2 - projvE[A]J| 2 , better
results can be obtained by looking at the orthogonal subspaces. The same argument
as above works, and we follow through with that here.
We will show that the following event is unlikely:
k
projuE [A]~~
-
JprojvE[A]J~
25
> em(n
-
r)
wia
Let U be the subspace orthogonal to U, and similarly for V. Since IA l
2 for any subspace W, we have that:
11projWA1|
||projuE[A]|| 2
-
||projvE[A]|| 2 = ||projVE [A]|| 2
-
2 =
||projwA|12+
||projUE [A]|| 2
Applying theorem 9, we can rewrite the desired event as:
k
E[IIprojVA11 ] - E[l
2
projUA11 2 ]
> Em(n
r)
-
(5.1)
wioQ
We also have by theorem 9 that:
k
E[IlprojUAll ]
2
=
I proj-jE[A]112
+ m(n
-
r)
wiOr2
i=
k
=
m(n - k)
wioU.
i=1
since none of the mean vectors are in U. Therefore, we can rewrite 5.1 as E l projA 112] >
(1 + c)E[flproj-All 2 ]. Following the same argument as the intuition mentioned previously, we will show that the stronger event is unlikely:
E = 3W: E[jjprojwAlI 2] > (1 + c)E[llprojUAjl 2 ] and ||projWA1 2 <
IIprojUA11 2 .
In order for E to occur, it must be the case that either of the following two events hold:
A
3W : E[llprojwAll 2 ] > (1 + c)E[jjprojUAll 2 ] and
B
IIprojUA11 2 - E[IlprojyAll 2 ] >: (E[IIprojwA11 2 ] - E[llprojUAll 2 1)
3W : E[llprojwAl 2 ] > (1 + c)E[jjprojUAll 2 ] and
E[|lprojwA|| 2] _ |IprojWA|| 2 >
(E[||projW A2] - E[l|projg A|
2
])
This is because if there exists a W such that E[llprojwAll2] > (1 + E)E[lprojyA|1 12] and
IIprojWA 12 < IIprojUA 12, it must be the case that either there exists a W such that
2
IIprojwA112 is much smaller than its expectation or IIprojUA1l is much larger than its
expectation. However, these are low probability events, as we will see.
The probability of A is at most the probability that there exists W such that
projUA11 2 - E[IIproj-A11 2 ] > 1 ((1 + c)E[ljprojUAj| 2 ] - E[IlprojUAI|2]), which is at most
Pr (A) < Pr (IIprojA2
(1 + E)E[|projA2
2
32
The last inequality follows from lemma 3, and the fact that each distribution generates
at least win' rows (the probability this does not happen is much smaller than 6). Now
26
the probability of B is at most
Pr (B)
< Pr 3W : |lprojWAH|2
_<
Pr (]W: IprojWAI| 2
=
Pr (3W: |projWAI 2
_E[IlprojwA|
< (
2]
(I1 -
_
E[lprojWAH2])
E[lprojWAI|2])
-
(18-
)E[IlprojWA|2]
E[|lprojuAl2
-
Here we used the fact that E < 1. By applying lemma 4 with
=E[JprojUA
12]
48jni(n - r)E[|JAI| 2 ]
and
512n n
C2
+
In (k))
n- k
ar
6
samples from each Gaussian, we have that the probability of B, and therefore E is at
most 6, which is the desired result.
l
The theorem above essentially says that the average distance between the mean
vectors and the mean vectors projected to the subspace found by the SVD is not too
large. The following corollary helps to see this.
Corollary 3 (Likely Best Subspace). Let p1,...
,
yk be the means of the k spher-
ical Gaussians in the mixture and wm n the smallest mixing weight. Let ', ... ,'
be their projections onto the subspace spanned by the top r right singular vectors of
the sample matrix A. With a sample of size of m = O*( E2wf
) we have with high
probability
k
k
(
11/_1|p1|2 _
2)
< c(n
-
w oQ.
r)
i=1
i=1
Proof. Let V be the optimal rank r subspace. Let m be as large as required in Theorem
10. By the theorem, we have that with probability at least 1 - 6,
k
||projuE[A]|| 2
-
|IprojvE[A]||
2
< em(n
-
r)
woQ.
Since
k
k
Iproju E[A]||2 - ||projvE[A]||2 = m JW,||p||2 - MYW,||P ||2
Therefore,
k
m
k
wi (I P2
2
_Il'11 )
em(n -r)Zwio2.
i=1
e~nLI
27
Random projection vs. spectral projection: examples
5.3.1
As mentioned in the introduction, random projection does not preserve the distance
between the means. The following figures illustrate the difference between random
projection and the SVD-based projection for a mixture of Gaussians. This gives a
more intuitive presentation of the main theorem.
40
-1
32 -*
Figure 2: SVD1
Figure 1: RP1
Figure 1 and figure 3 are 2-dimensional random projections of samples from two
different 49-dimensional mixtures (one with k = 2, the other with k = 4). Figure 2
and figure 4 are the projections to the best rank 2 subspaces of the same data sets.
*-
38
"50
'"
---
12 -
52
5'4
50
5
0'0
82
64
6
6
82
64
86
88
7
2
Figure 4: SVD2
Figure 3: RP2
28
4
7
Chapter 6
After Projection
Let us recall the main theorem of the last chapter: given a sufficiently large sample of
m points from a mixture of k spherical Gaussians in R, we have that after projection
to the top r right singular vectors:
k
m
k
2 wi(||il||
) < m(n - r)Z wioj
i=1
where pi are the mean vectors of the Gaussians and b' are the mean vectors projected
to the subspace V spanned by the top k right singular vectors of A, the sample matrix.
Suppose we knew the ratio D between the largest variance o.a and the smallest
variance .2in. Applying theorem 10 with E =
, we have:
k
2
1i1
ZWI,
-
2
IIP11 22 ) <
e2
2min
Oj2
min-
This implies that for every i,
I
-
'|2
-
2
_
2
2
Since each projected mean only moves a bit from the original mean, the separation between the projected means is at least the original separation, minus this small amount.
That is, for every i, j:
Ip' - / || > |pi - p|j| - eo1.
If the initial separation between the means of the mixture is
C
rlIn4m
1/4
then the separation after projection is roughly the same! Since the radius of each
Gaussian drops to o-fr after projection to a k-dimensional subspace, we can apply the
algorithms based on distance concentration described in chapter 3 to classify the points
29
of the Gaussian. Recall that these algorithms classify points from the (approximately)
smallest Gaussian first, and repeatedly learn Gaussians of increasing size.
However, D may not be known or may be very large. In this case, we cannot apply
distance concentration naively, since the separation between a mean and its projection
may be quite large for a small Gaussian. That is, without a bound on D, we can only
insure that the average distance between means and projected means is small.
The algorithm we present in the next section deals with this difficulty and learns a
mixture of k spherical Gaussians with separation
14 max{ca o-}(r ln 4))1/4.
(6.1)
Recall that r = max{k, 96 ln(4m/6)}. The main idea of finding the smallest Gaussian
and removing points from it is still the core of our algorithm. The key modification
our algorithm makes is to first throw away points from Gaussians for which separation
does not hold. Then, the algorithm classifies points from Gaussians (roughly) in order
of increasing variance.
Later in this chapter we will describe non-polynomial time algorithms when the
separation of the mixture is smaller than (6.1).
6.1
The main algorithm
The algorithm is best described in the following box. Let S be the points sampled
from the mixture of spherical Gaussians. This algorithm is to be used after projecting
to the top r right singular vectors of the sample matrix.
Algorithm.
(a) Let R = maxEs minys Ix - y-l.
(b) Discard all points from S whose closest point lies at squared distance at most 3UR 2 to form a new set S'.
(c) Let x, w be the two closest points in the remaining set, and let
H be the set of all points at squared distance at most 1 = I|I 6n4m
w|
2
(1+ ±8
) from
x.
(d) Report H as a Gaussian, and remove the points in H from S'.
Repeat step (c) till there are no points left.
(e) Output all Gaussians learned in step (d) with variance greater than
3R 2 /r.
(f) Remove the points from step (e) from the original sample S, and
repeat the entire algorithm (including the SVD calculation and projection) on the rest.
30
We now briefly describe the details of the algorithm. Step (a) computes an estimate
of the radius of the largest Gaussian. By distance concentration, the closest point to
any point is a point from the same Gaussian with high probability. The pair of points
with the largest, shortest distance should come from the Gaussian with the largest
variance. Now, step (b) removes those points from Gaussians whose variance is too
small. Since we know only that the average separation between the projected mean
vectors is maintained, Gaussians with small variance are not necessarily well-separated,
so we remove them from the set S to obtain S'.
Now, with points only from the large Gaussians, we can group those points from
the smallest Gaussian together. The two closest points form an estimate of the radius
of the smallest Gaussian in S', and 1 is an estimate of the largest distance any point
from that Gaussian may lie from x. This is the reasoning behind steps (c) and (d). In
step (e), though, we only output those Gaussians with large variance. Why? When
we removed points as in step (b), we may have accidentally removed some points from
"medium-sized" Gaussians for which separation holds. Therefore, in step (e) we can
only output those Gaussians with radii that are not too small. Lastly, in step (f), we
can remove those points correctly classified from S, and repeat the entire process on
the remaining set.
Distance Concentration
This section includes the necessary lemmas on distance concentration that we will use
repeatedly in the analysis.
The first lemma shows that in expectation, points from the same Gaussian lie at
a distance on the order of the radii of that Gaussian, whereas points from different
Gaussians lie on the order of the radii of the larger Gaussian plus the distance between
the mean vectors of the Gaussians.
The second lemma shows that with high probability, the distance between two
points lies close to its expectation.
Lemma 5. Let X E F, Y E F be spherical Gaussians in r dimensions with means
. Then
ps, pt , and variances o,
2
E[||X
-
Y|| 2] - (2r + Or k + ||/'
-
'|2
Proof.
E[I|X
-EY 2 ]
E[E(Xi
=
-Y)
2
]
SE[X2] + E[Y. ]2
2E[Xi]E[Y]
r
= 1|p' 112 + a2r + I||p' 1|2 + Or2
r + a2 r + | p
=o2r
31
-
2
yP's
i=1
Lemma 6. Let X E F, and Y E F, where F,, Ft are r dimensional Gaussians with
Then for a > 0, the probability that
means p', p', and variances or 2r.
||X -Y||2 - E[||X -Y|| 2]|
is at most 4e-
2
>
2+
a
-p'|
+ 2|p
V
oU+ U))
/8
Proof. We consider the probability that
||X -Y||
2-
- Y|
y[X2 ] ;> e((.
E
The other part is analogous. We write
+
u.)V+
2||p' -
oi + uZi ± (pi - Ptz)) 2 ,
=X
_(
=
-
U + of).
t'|
where the Zi are N(O, 1) random variables. Therefore, we can rewrite the above event
as:
oU2 + otZ2+ (p,i
t i
S
-
p'i))2
_ t4 +
(o2 + o2)r + |p' --
-12
i=1
a ( (Cr + U2) Vr+ 2|1|p' - pt|
V02 +
ort2
The probability of this occurring is at most the sum of the probabilities of the
following events
r
A=I
Z(or2+ r)Zi !(O2+ 0.) (r+ eV'-)
i=1
p' -- p'
Simpify-ngi)
w S2(
2 +
ot Zi
> a
,(21
-
2|'
11
'1 [ 2 + or .
Simplifying, we get:
Pr (A) < Pr
Pr (B) < Pr
Z
>
r + af
IT)
(psi - pti)Zi >
a||['
-
<
e-a2/8
'||)
< e-
The above inequalities hold by applying Markov's inequality and moment generating
L
functions as in the proof of Lemma 3.
32
Proof of correctness
Now we are ready to prove that the algorithm correctly classifies a sufficiently large
sample from a mixture of spherical Gaussians with high probability.
Theorem 11. With a sample of size
(n Wmin + In (max
m = Q
I~t
2 )
+ In')
and initial separation
|bpi - p|II>14 max{o-g, u-}
r In
the algorithm correctly classifies all Gaussians with probability at least 1 - 6.
Proof. First, let us apply Corollary 3 with E =
follows that,
.in.
Let
W
k
g.2
be the largest variance. It
k
ZWI i1
2 ) < WminZ
I '1
2
Wig2
Wminnj2
So in particular, for all i,
_IJ
|112
- pp 212
12 < e 2.
This implies that for any Gaussian F with variance larger than
all j:
eo
2
,
we have that for
1/4
p ||
'>-
14o-i
r ln
4m)1/4 - 2/-ac
> 12 o-i (r ln
4)
provided that e < 1. By applying Lemma 6 with this separation between mean vectors
and a=
24 n (4), we obtain the following with probability at least 1 :
. For any Gaussian F and any two points x, y drawn from it,
2cgr
-
4g 2
6rln
(4k)
<
|x - y |2 < 2u 2 r +4U2
It will also be helpful to have the following bounds on I-
6r ln
(6.2)
y 12 in terms of only
2
or
or
|IX
- y 1 2 < 3o 2 r
This follows by upper bounding the deviation term 4uf
holds by applying r > 96 ln ().
33
(6.3)
6r ln
(w)
by ofr. This
* For any Gaussians F # Fj, with ou > of and or > ea' and any two points x
from F, and y from Fj,
|x - y
2
+
(
2)r + 38o
6rln(
(6.4)
.
This follows from lemma 6, which states that:
||z - y
;(o +
orj)r
+| I's- p|| - a ((af +u2o v/ + 211 p' - p-' l
f+
of)
and r > 96 In (4), we
241n (i), | ' - p'. > 12-i (rln (g))
With a =
obtain the above bound. We can also obtain a lower bound only in terms of U2
as we did above for two points from the same Gaussian by the same upper bound
on the deviation term:
-
y1 2
> 12o
r.
(6.5)
Using these bounds, we can classify the smallest Gaussian using (6.2), since interGaussian distances are smaller than intra-Gaussian distances. However, this only holds
for Gaussians with variance larger than oa2 . We first show that the first step in
the algorithm removes all such small Gaussians, and then show that any Gaussians
classified in step (e) are complete Gaussians. The next few observations prove the
correctness of the algorithm, conditioned on Lemma 6 to obtain (6.2), (6.3), (6.4) and
(6.5).
1. Let y E S' be from some Gaussian F. Then a?2 > or2 .
2. Let x c F, H be the point and set used in any iteration of step (c).
H = S' n F, i.e. it is the points in S' from Fj.
Then
3. For any Gaussian F with ou > 3eR 2/r, we have F C S', i.e. any Gaussian with
sufficiently large variance will be contained in S'.
We proceed to prove 1-3.
1. Let x be any point from F, the Gaussian with largest variance, and let w be any
other point. By (6.3) and (6.5), 1|X -- wH12 > a2 r, so R2 > r2 r.
Now suppose by way of contradiction that or < eU2 . We will show that if this is
the case, y would have removed from S, contradicting its membership in S'. Let z be
another point in S from F. Then by (6.3) we have that:
y - z1
2
<
<
30rc
3ea 2 r < 3eR 2
Since y's closest point lies at a distance at most 3eR 2 , this contradicts y c S'.
34
2. First, we show that x, w in step (c) of the algorithm belong to the same Gaussian.
If not, suppose without loss of generality that w E F, and that o-r < o-r. But then by
(6.2) and (6.4), there exists a point z from F such that:
<z 2<ur +4
|x -
6r ln(
.
However,
||x- w||2>22
2
r+38
6r ln(
.
This contradicts the fact that x, w are the two closest points in S'.
With the bounds on |Hx - w|12 for x, w E F from (6.2), we obtain the following
bounds on 1, our estimate for the furthest point from x that is still in Fj:
2 -,r + 4a3
6r In (
< 2
r + 28
6rln I
.
The lower bound on I ensures that every point in F and S' is included in H.
The upper bound on I ensures that any point z E F 0 Fj will not be included in
H. If a > uj, this follows from the fact that the upper bound on 1 is less than the
lower bound on | - zI| 2 in (6.4). Now suppose ao
o<. Since x, w are the closest
points in S', it must be the case that:
1|x-
2
w
2r
6r In(
+4
)
since otherwise, two points from F would be the two closest points in S'. Since
11x- w112 > 2or -4
6r ln (i),
oa2r
-
we have that
u-r - 42
6r
In
.
6rln4~)
Applying this to (6.4), we have that:
~x- zH|2>2a2 o-r + 34 -
6r ln).
As this is larger than the upper bound on 1, we have that F n S'= H.
3. Let x C F, with or > 3R 2 /r. We want to show that x E S', so we need to show
that step (b) never removes x from S. This holds if:
Vz, |1x - z1 2 > 3RR2
35
which is true by (6.3) and (6.5).
We have just shown that any Gaussian with variance at least 3eR 2 /r will be correctly classified. By setting e < 1, at least the largest Gaussian is classified in step (e),
since R 2 < 3orr by (6.3). So we remove at least one Gaussian from the sample during
each iteration.
It remains to bound the success probability of the algorithm. For this we need
" Corollary 3 holds up to k times,
" Steps (a)-(e) are successful up to k times
A
)k > 1 - 6/2. The probability
The probability of the first event is at least (1 of the latter event is just the probability that distance concentration holds, which is
(1 - -) > (1 - '). Therefore, we have the probability that the algorithm succeeds is:
(1 1)k(1 - 6/2) > 1 - 6.
Immediate from this proof is the main theorem of this thesis (Theorem 8).
6.2
Non-polynomial time algorithms
Note that if we assume only the minimum possible separation,
I pi - puI > C max{c-a, u3 },
then after projection to the top k right singular vectors the separation is
II' - p I > (C - 2e) max{u-, a-}
which is the radius divided by vx/ (note that we started with a separation of radius
divided by Vn/). Here we could use an exponential in k algorithm using 0( k )
samples from the Gaussian to obtain a maximum-likelihood estimation. First, project
the samples to the k right singular vectors of the sample matrix. Then, consider each
of the 0 Wmink
k
partitions of the points into k clusters. For each of these clusters,
we can compute the mean and variance, as well as the mixing weight of the cluster.
Since the points were generated from a spherical Gaussian, and we know the density
function F for a spherical Gaussian with a given mean and variance, we can compute
the likelihood of the partition. Let x be any point in the sample, and let 1(x) denote
the cluster that contains it. Then the likelihood of the sample is:
11 Fi)(x).
xES
By examining each of the partitions, we can determine the partition that has the
maximum-likelihood, obtaining estimates of the means, variances, and mixing weights
of the mixture.
36
Chapter 7
Conclusion
We have presented an algorithm that learns a mixture of spherical Gaussians in the
sense that, given a suitably large sample of points, it can correctly classify the points
according to which distribution they were generated from with high probability. The
work here introduces many interesting questions and remarks, which we conclude with.
7.1
Remarks
The algorithm and its guarantees can be extended to mixtures of weakly-isotropic
distributions, provided they have two types of concentration bounds. For a guarantee
as in Theorem 10, we require an analog of Lemma 3 to hold, and for a guarantee as in
Theorem 11, we need a lemma similar to Lemma 6. A particular class of distributions
that possess good concentration bounds is the class of logconcave distributions. A
distribution is said to be logconcave if its density function f is logconcave, i.e. for any
x, y c R", and any 0 < a < 1, f(ax + (1 - a)y) > f(x),f(y) 1 ". Examples of logconcave distributions include the special case of the uniform distribution on a weakly
isotropic convex body, e.g. cubes, balls, etc. Although the weakly isotropic property
might sound restrictive, it is worth noting that any single log-concave distribution can
be made weakly isotropic by a linear transformation (see, e.g. [8]).
We remark that the SVD is tolerant to noise whose 2-norm is bounded [10, 9, 2].
Thus even after corruption, the SVD of the sample matrix will recover a subspace
that is close to one spanning the mean vectors of the underlying distributions. In this
low-dimensional space, one could exhaustively examine subsets of the data to learn the
mixture (the ignored portion of the data corresponds to the noise).
7.2
Open problems and related areas
The first natural question is whether the technique of projecting the sample matrix to
the top k right singular vectors can be applied when the Gaussians in the mixture are
no longer spherical. Although a variant of Theorem 9 does not hold when the Gaussians
are not spherical, perhaps it is the case that the subspace that maximizes the expected
37
norm of the projection of the sample matrix is not far from the subspace spanned by the
mean vectors. Even if this does hold, however, classifying the points after projection
would prove to be a much more difficult task than the clustering algorithm we describe
in Chapter 6. It will probably be necessary to use isoperimetry as in [1].
One area of more open-ended research is finding the right parameter that naturally
characterizes the difficulty of learning mixtures of Gaussians. While this work and
previous works considered separation (the distance between means of the underlying
Gaussians) to be the main parameter, small separation between mean vectors does
not necessarily mean that it is impossible to distinguish two Gaussians. Consider two
Gaussians, one with very large variance and one with very small variance. Even if
the two Gaussians have the same mean, note that it is possible to distinguish points
from the other. With distance concentration, we can first remove points from the
smaller Gaussian. In this case, there is very little probability overlap between the two
distributions. Learning a mixture of Gaussians when the only assumption is probability
overlap seems to be a challenging problem.
Lastly, while the work in this thesis is primarily theoretical, it would be interesting
to see how this algorithm performs in practice. While it is difficult to find data that
takes the form of spherical Gaussians, finding data that is approximately Gaussian is
not difficult. Applying the algorithms developed in this thesis to this experimental
data may show that the the algorithm actually performs much better in practice than
the theoretical bounds suggest. Further, experiments can test the success of this algorithm when the Gaussians are non-spherical, and suggest how to prove that spectral
projection works with non-spherical Gaussians.
38
Bibliography
[1] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings
of the 33rd ACM STOC, 2001.
[2] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spectral analysis of data.
In Proceedings of the 33rd ACM STOC, 2001.
[3] M. Collins. The EM Algorithm. Unpublished manuscript, 1997.
[4] S. DasGupta. Learning mixtures of gaussians. In Proceedings of the 40th IEEE
FOCS, 1999.
[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. In J. Royal Statistics Soc. Ser. B, volume 39,
pages 1-38, 1977.
[6] M. L. Eaton. Multivariate Statistics. Wiley, New York, 1983.
[7] G. H. Golub and C. F. Loan. Matrix Computations. Johns Hopkins, third edition,
1996.
[8] L. Lovaisz and S. Vempala. Logconcave functions: Geometry and efficient sampling
algorithms. In Proceedings of the 44th IEEE FOCS (to appear), 2003.
[9] C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic
indexing: a probabilistic analysis. In Journal of Computer and System Sciences,
volume 61, pages 217-235, 2000.
[10] G. Stewart. Error and perturbation bounds for subspaces associated with certain
eigenvalue problems. In SIAM review, volume 15(4), pages 727-64, 1973.
[11] S. Stigler. Statistics on the Table. Harvard University Press, 1999.
[12] D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite
mixture distributions. Wiley, 1985.
[13] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. In
Jounal of Computer and System Sciences (to appear).
[14] S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In Proceedings of the 43rd IEEE FOCS, 2002.
39
Download