Efficiently Learning Mixtures of Gaussians Ankur Moitra, MIT May 11, 2010

advertisement
Efficiently Learning Mixtures of Gaussians
Ankur Moitra, MIT
joint work with Adam Tauman Kalai and Gregory Valiant
May 11, 2010
What is a Mixture of Gaussians?
What is a Mixture of Gaussians?
Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1):
What is a Mixture of Gaussians?
Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1):
SA(F)
w1
SA(F1)
x
w2
SA(F2)
What is a Mixture of Gaussians?
Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1):
SA(F)
w1
SA(F1)
x
w2
SA(F2)
What is a Mixture of Gaussians?
Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1):
SA(F)
w1
SA(F1)
x
w2
SA(F2)
What is a Mixture of Gaussians?
Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1):
SA(F)
w1
SA(F1)
x
w2
SA(F2)
F (x) = w1 N (µ1 , Σ1 , x) + w2 N (µ2 , Σ2 , x)
Pearson and the Naples Crabs
0
5
10
15
20
(figure due to Peter Macdonald)
0.58
0.60
0.62
0.64
0.66
0.68
0.70
Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x)
Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x)
Definition
We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x)
Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x)
Definition
We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x)
1
There are five unknown variables: w1 , µ1 , σ12 , µ2 , σ22
Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x)
Definition
We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x)
1
There are five unknown variables: w1 , µ1 , σ12 , µ2 , σ22
2
The r th -raw moment of Fi (x) is a polynomial in µi , σi
Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x)
Definition
We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x)
1
There are five unknown variables: w1 , µ1 , σ12 , µ2 , σ22
2
The r th -raw moment of Fi (x) is a polynomial in µi , σi
Definition
Let Ex←Fi (x) [x r ] = Mr (µi , σi2 )
Question
What if we knew the r th -raw moment of F (x) perfectly?
Question
What if we knew the r th -raw moment of F (x) perfectly?
Each value yields a constraint:
Ex←F (x) [x r ] = w1 Mr (µ1 , σ12 ) + w2 Mr (µ2 , σ22 )
Question
What if we knew the r th -raw moment of F (x) perfectly?
Each value yields a constraint:
Ex←F (x) [x r ] = w1 Mr (µ1 , σ12 ) + w2 Mr (µ2 , σ22 )
Definition
We will refer to M̃r =
1
|S|
P
i∈S
xir as the empirical r th -raw moment of F (x)
Pearson’s Sixth Moment Test
Pearson’s Sixth Moment Test
1
Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6}
Pearson’s Sixth Moment Test
1
Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6}
2
Find all simultaneous roots of
{w1 Mr (µ1 , σ12 ) + (1 − w1 )Mr (µ2 , σ22 ) = M̃r }r ∈{1,2,...5}
Pearson’s Sixth Moment Test
1
Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6}
2
Find all simultaneous roots of
{w1 Mr (µ1 , σ12 ) + (1 − w1 )Mr (µ2 , σ22 ) = M̃r }r ∈{1,2,...5}
3
This yields a list of candidate parameters θ~a , θ~b , ...
Pearson’s Sixth Moment Test
1
Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6}
2
Find all simultaneous roots of
{w1 Mr (µ1 , σ12 ) + (1 − w1 )Mr (µ2 , σ22 ) = M̃r }r ∈{1,2,...5}
3
This yields a list of candidate parameters θ~a , θ~b , ...
4
Choose the candidate that is closest in sixth moment:
w1 M6 (µ1 , σ12 ) + (1 − w1 )M6 (µ2 , σ22 ) ≈ M̃6
”Given the probable error of every ordinate of a frequency-curve, what
are the probable errors of the elements of the two normal curves into
which it may be dissected?” [Karl Pearson]
Question
How does noise in the empirical moments translate to noise in the derived
parameters?
Gaussian Mixture Models
Applictions in physics, biology, geology, social sciences ...
Gaussian Mixture Models
Applictions in physics, biology, geology, social sciences ...
Goal
Estimate parameters in order to understand underlying process
Gaussian Mixture Models
Applictions in physics, biology, geology, social sciences ...
Goal
Estimate parameters in order to understand underlying process
Question
Can we PROVABLY recover the parameters EFFICIENTLY? (Dasgupta, 1999)
Gaussian Mixture Models
Applictions in physics, biology, geology, social sciences ...
Goal
Estimate parameters in order to understand underlying process
Question
Can we PROVABLY recover the parameters EFFICIENTLY? (Dasgupta, 1999)
Definition
D(f (x), g (x)) = 12 kf (x) − g (x)k1
A History of Learning Mixtures of Gaussians
__
1
n2
[Dasgupta, 1999]
A History of Learning Mixtures of Gaussians
__
1
n2
[Dasgupta, 1999]
__
1
n4
[Dasgupta, Schulman, 2000]
A History of Learning Mixtures of Gaussians
__
1
n2
[Dasgupta, 1999]
__
1
n4
[Dasgupta, Schulman, 2000]
__
1
n4
[Arora, Kannan, 2001]
A History of Learning Mixtures of Gaussians
__
1
n2
[Dasgupta, 1999]
__
1
n4
[Dasgupta, Schulman, 2000]
__
1
n4
[Arora, Kannan, 2001]
__
1
k4
[Vempala, Wang, 2002]
A History of Learning Mixtures of Gaussians
__
1
n2
[Dasgupta, 1999]
__
1
n4
__
1
n4
[Dasgupta, Schulman, 2000]
[Arora, Kannan, 2001]
__
1
k4
[Vempala, Wang, 2002]
__
1
k4
[Achlioptas, McSherry, 2005]
A History of Learning Mixtures of Gaussians
__
1
n2
[Dasgupta, 1999]
__
1
n4
__
1
n4
[Dasgupta, Schulman, 2000]
[Arora, Kannan, 2001]
__
1
k4
[Vempala, Wang, 2002]
__
1
k4
[Achlioptas, McSherry, 2005]
[Brubaker, Vempala, 2008]
All previous results required D(F1 , F2 ) ≈ 1...
All previous results required D(F1 , F2 ) ≈ 1...
... because the results relied on CLUSTERING
All previous results required D(F1 , F2 ) ≈ 1...
... because the results relied on CLUSTERING
Question
Can we learn the parameters of the mixture without clustering?
All previous results required D(F1 , F2 ) ≈ 1...
... because the results relied on CLUSTERING
Question
Can we learn the parameters of the mixture without clustering?
Question
Can we learn the parameters when D(F1 , F2 ) is close to ZERO?
Goal
Learn a mixture F̂ = ŵ1 F̂1 + ŵ2 F̂2 so that there is a permutation
π : {1, 2} → {1, 2} and for i = {1, 2}
|wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ Goal
Learn a mixture F̂ = ŵ1 F̂1 + ŵ2 F̂2 so that there is a permutation
π : {1, 2} → {1, 2} and for i = {1, 2}
|wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ We will call such a mixture F̂ -close to F .
Goal
Learn a mixture F̂ = ŵ1 F̂1 + ŵ2 F̂2 so that there is a permutation
π : {1, 2} → {1, 2} and for i = {1, 2}
|wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ We will call such a mixture F̂ -close to F .
Question
When can we hope to learn an -close estimate?
Question
What if w1 = 0?
Question
What if w1 = 0?
We never sample from F1 !
Question
What if w1 = 0?
We never sample from F1 !
Question
What if D(F1 , F2 ) = 0?
Question
What if w1 = 0?
We never sample from F1 !
Question
What if D(F1 , F2 ) = 0?
For any w1 , w2 , F = w1 F1 + w2 F2 is the same distribution!
Question
What if w1 = 0?
We never sample from F1 !
Question
What if D(F1 , F2 ) = 0?
For any w1 , w2 , F = w1 F1 + w2 F2 is the same distribution!
Definition
A mixture of Gaussians F = w1 F1 + w2 F2 is -statistically learnable if for
i = {1, 2}, wi ≥ and D(F1 , F2 ) ≥ .
Efficiently Learning Mixtures of Two Gaussians
Given oracle access to an -statistically learnable mixture of two Gaussians F :
Efficiently Learning Mixtures of Two Gaussians
Given oracle access to an -statistically learnable mixture of two Gaussians F :
Theorem (Kalai, M, Valiant)
There is an algorithm that (with probability at least 1 − δ) learns a mixture of two
Gaussians F̂ that is an -close estimate to F , and the running time and data
requirements are poly ( 1 , n, δ1 ).
Efficiently Learning Mixtures of Two Gaussians
Given oracle access to an -statistically learnable mixture of two Gaussians F :
Theorem (Kalai, M, Valiant)
There is an algorithm that (with probability at least 1 − δ) learns a mixture of two
Gaussians F̂ that is an -close estimate to F , and the running time and data
requirements are poly ( 1 , n, δ1 ).
Previously, even no inverse exponential estimator known for univariate mixtures of
two Gaussians
Question
What about mixtures of k Gaussians?
Question
What about mixtures of k Gaussians?
Definition
P
A mixture of k Gaussians F = i wi Fi is -statistically learnable if for
i = {1, 2, .., k}, wi ≥ and for all i, j D(Fi , Fj ) ≥ .
Question
What about mixtures of k Gaussians?
Definition
P
A mixture of k Gaussians F = i wi Fi is -statistically learnable if for
i = {1, 2, .., k}, wi ≥ and for all i, j D(Fi , Fj ) ≥ .
Definition
P
An estimate F̂ = i ŵi F̂i mixture of k Gaussians is -close to F if there is a
permutation π : {1, 2, ..., k} → {1, 2, ..., k} and for i = {1, 2, ..., k}
|wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ Efficiently Learning Mixtures of Gaussians
Given oracle access to an -statistically learnable mixture of k Gaussians F :
Efficiently Learning Mixtures of Gaussians
Given oracle access to an -statistically learnable mixture of k Gaussians F :
Theorem (M, Valiant)
There is an algorithm that (with probability at least 1 − δ) learns a mixture of k
Gaussians F̂ that is an -close estimate to F , and the running time and data
requirements are poly ( 1 , n, δ1 ).
Efficiently Learning Mixtures of Gaussians
Given oracle access to an -statistically learnable mixture of k Gaussians F :
Theorem (M, Valiant)
There is an algorithm that (with probability at least 1 − δ) learns a mixture of k
Gaussians F̂ that is an -close estimate to F , and the running time and data
requirements are poly ( 1 , n, δ1 ).
The running time and sample complexity depends exponentially on k, but such a
dependence is necessary!
Efficiently Learning Mixtures of Gaussians
Given oracle access to an -statistically learnable mixture of k Gaussians F :
Theorem (M, Valiant)
There is an algorithm that (with probability at least 1 − δ) learns a mixture of k
Gaussians F̂ that is an -close estimate to F , and the running time and data
requirements are poly ( 1 , n, δ1 ).
The running time and sample complexity depends exponentially on k, but such a
dependence is necessary!
Corollary: First polynomial time density estimation for mixtures of Gaussians with
no assumptions!
Question
Can we give additive guarantees?
Question
Can we give additive guarantees?
Cannot give additive guarantees without defining an appropriate normalization
Question
Can we give additive guarantees?
Cannot give additive guarantees without defining an appropriate normalization
Definition
A distribution F (x) is in isotropic position if
Question
Can we give additive guarantees?
Cannot give additive guarantees without defining an appropriate normalization
Definition
A distribution F (x) is in isotropic position if
1
Ex←F (x) [x] = ~0
Question
Can we give additive guarantees?
Cannot give additive guarantees without defining an appropriate normalization
Definition
A distribution F (x) is in isotropic position if
1
Ex←F (x) [x] = ~0
2
Ex←F (x) [(u T x)2 ] = 1 for all kuk = 1
Question
Can we give additive guarantees?
Cannot give additive guarantees without defining an appropriate normalization
Definition
A distribution F (x) is in isotropic position if
1
Ex←F (x) [x] = ~0
2
Ex←F (x) [(u T x)2 ] = 1 for all kuk = 1
Fact
For any distribution F (x) on <n , there is an affine transformation T that places
F (x) in isotropic position
Isotropic Position
Not Isotropic
F1
F2
Isotropic
F1
F2
Given
Mixture F of two Gaussians, -statistically learnable, and in isotropic position
Given
Mixture F of two Gaussians, -statistically learnable, and in isotropic position
Output
F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t.
|wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Given
Mixture F of two Gaussians, -statistically learnable, and in isotropic position
Output
F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t.
|wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea
1
Consider a series of projections down to one dimension
Given
Mixture F of two Gaussians, -statistically learnable, and in isotropic position
Output
F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t.
|wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea
1
Consider a series of projections down to one dimension
2
Run a univariate learning algorithm
Given
Mixture F of two Gaussians, -statistically learnable, and in isotropic position
Output
F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t.
|wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea
1
Consider a series of projections down to one dimension
2
Run a univariate learning algorithm
3
Use these estimates as constraints in a system of equations
Given
Mixture F of two Gaussians, -statistically learnable, and in isotropic position
Output
F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t.
|wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea
1
Consider a series of projections down to one dimension
2
Run a univariate learning algorithm
3
Use these estimates as constraints in a system of equations
4
Solve this system to obtain higher dimensional estimates
Claim
Projr [F1 ] = N (r T µ1 , r T Σ1 r , x)
Claim
Projr [F1 ] = N (r T µ1 , r T Σ1 r , x)
Each univariate estimate yields an approximate linear constraint on the parameters
Claim
Projr [F1 ] = N (r T µ1 , r T Σ1 r , x)
Each univariate estimate yields an approximate linear constraint on the parameters
Definition
Dp (N (µ1 , σ12 ), N (µ2 , σ22 )) = |µ1 − µ2 | + |σ12 − σ22 |
Problem
What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small?
Problem
What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small?
Then we would need to run the univariate algorithm with extremely fine precision!
Problem
What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small?
Then we would need to run the univariate algorithm with extremely fine precision!
Isotropic Projection Lemma: With high probability, Dp (Projr [F1 ], Projr [F2 ]) is
at least polynomially large
Problem
What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small?
Then we would need to run the univariate algorithm with extremely fine precision!
Isotropic Projection Lemma: With high probability, Dp (Projr [F1 ], Projr [F2 ]) is
at least polynomially large
(i.e. at least poly (, n1 ))
Isotropic Projection Lemma
Suppose F = w1 F1 + w2 F2 is in isotropic position and is -statistically learnable:
Isotropic Projection Lemma
Suppose F = w1 F1 + w2 F2 is in isotropic position and is -statistically learnable:
Lemma (Isotropic Projection Lemma)
With probability ≥ 1 − δ over a randomly chosen direction r ,
5 δ 2
Dp (Projr [F1 ], Projr [F2 ]) ≥ 50n
2 = 3 .
Isotropic Projection Lemma
F1
F2
Isotropic Projection Lemma
F1
F2
u
Isotropic Projection Lemma
F1
F2
u
r
Isotropic Projection Lemma
F1
F2
u
Projr[F 1]
Projr[F 2]
r
Isotropic Projection Lemma
F1
F2
u
Projr[F 1]
Projr[F 2]
r
Projr[u]
Isotropic Projection Lemma
F1
F2
Isotropic Projection Lemma
F1
F2
r
Isotropic Projection Lemma
F1
F2
Projr[F 1]
Projr[F 2]
r
Suppose we learn estimates F̂ r , F̂ s from directions r , s
Suppose we learn estimates F̂ r , F̂ s from directions r , s
F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F
Suppose we learn estimates F̂ r , F̂ s from directions r , s
F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F
Problem
How do we know that they yield constraints on the SAME Gaussian?
Searching Nearby Directions
r
Searching Nearby Directions
s
r
Searching Nearby Directions
s
r
Searching Nearby Directions
s
??
r
Searching Nearby Directions
s
??
r
Suppose we learn estimates F̂ r , F̂ s from directions r , s
F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F
Problem
How do we know that they yield constraints on the SAME Gaussian?
Suppose we learn estimates F̂ r , F̂ s from directions r , s
F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F
Problem
How do we know that they yield constraints on the SAME Gaussian?
Pairing Lemma: If we choose directions close enough, then pairing becomes easy
Suppose we learn estimates F̂ r , F̂ s from directions r , s
F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F
Problem
How do we know that they yield constraints on the SAME Gaussian?
Pairing Lemma: If we choose directions close enough, then pairing becomes easy
(”close enough” depends on the Isotropic Projection Lemma)
Pairing Lemma
Suppose kr − sk ≤ 2 (for 2 << 3 )
Pairing Lemma
Suppose kr − sk ≤ 2 (for 2 << 3 )
We still assume F = w1 F1 + w2 F2 is in isotropic position and is -statistically
learnable
Pairing Lemma
Suppose kr − sk ≤ 2 (for 2 << 3 )
We still assume F = w1 F1 + w2 F2 is in isotropic position and is -statistically
learnable
Lemma (Pairing Lemma)
Dp (Projr [Fi ], Projs [Fi ]) ≤ O( 2 ) <<
3
3
Searching Nearby Directions
r
Searching Nearby Directions
r
Searching Nearby Directions
r
Searching Nearby Directions
s
r
Searching Nearby Directions
s
r
Searching Nearby Directions
r
Searching Nearby Directions
r
Searching Nearby Directions
r
Searching Nearby Directions
s
r
Searching Nearby Directions
s
r
Each univariate estimate yields a linear constraint on the parameters:
Projr [F1 ] = N (r T µ1 , r T Σ1 r )
Each univariate estimate yields a linear constraint on the parameters:
Projr [F1 ] = N (r T µ1 , r T Σ1 r )
Problem
What is the condition number of this system?
Each univariate estimate yields a linear constraint on the parameters:
Projr [F1 ] = N (r T µ1 , r T Σ1 r )
Problem
What is the condition number of this system? (i.e. How do errors in univariate
estimates translate to errors in multidimensional estimates?)
Each univariate estimate yields a linear constraint on the parameters:
Projr [F1 ] = N (r T µ1 , r T Σ1 r )
Problem
What is the condition number of this system? (i.e. How do errors in univariate
estimates translate to errors in multidimensional estimates?)
Recovery Lemma: Condition number is polynomially bounded
Each univariate estimate yields a linear constraint on the parameters:
Projr [F1 ] = N (r T µ1 , r T Σ1 r )
Problem
What is the condition number of this system? (i.e. How do errors in univariate
estimates translate to errors in multidimensional estimates?)
Recovery Lemma: Condition number is polynomially bounded : O( n2 )
2
r
r
r
3
s
r
3
s
2
r
3
s
2
r
3
s
__
1
3
3
2
r
3
s
__
1
3
3
2
r
3
s
__
1
3
3
2
1
1
r
3
s
__
1
3
3
2
1
1
r
3
1
1
s
__
1
3
3
2
1
1
r
3
s
__
1
3
3
2
1
1
r
3
1
1
s
__
1
3
3
2
1
1
r
3
1
1
s
__
1
3
3
2
1
1
r
3
1
1
s
2
1
1
r
1
s
2
1
r
1
s
2
1
r
~
SA(F)
Additive Isotropic
Using Additive Estimates to Cluster
^F
2
F1
F2
^F
1
Using Additive Estimates to Cluster
^F
2
F1
F2
^F
1
r
Using Additive Estimates to Cluster
^F
2
F2
^F
1
F1
Projr[F 2]
Projr[F 1]
r
Using Additive Estimates to Cluster
^F
2
F2
^F
1
F1
Projr[F 2]
Projr[F 1]
r
~
SA(F)
Additive Isotropic
~
SA(F)
Additive Isotropic
Clustering
~
SA(F)
Additive Isotropic
Clustering
~
SA(F)
Isotropic
Additive Isotropic
Clustering
^F
~
SA(F)
SA(F)
Isotropic
Additive Isotropic
Clustering
^F
~
SA(F)
~
Compute T
SA(F)
Isotropic
Additive Isotropic
Clustering
^F
~
SA(F)
SA(F)
~
Compute T
~
Apply T
Isotropic
Additive Isotropic
Clustering
^F
~
SA(F)
SA(F)
~
Compute T
~
Apply T
Isotropic
Additive Isotropic
Clustering
^F
Question
Can we learn an additive approximation in one dimension?
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
µ1 , σ12 , µ2 , σ22 , w1
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
µ1 , σ12 , µ2 , σ22 , w1
Additionally, each parameter is bounded:
Claim
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
µ1 , σ12 , µ2 , σ22 , w1
Additionally, each parameter is bounded:
Claim
1
w1 , w2 ∈ [, 1]
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
µ1 , σ12 , µ2 , σ22 , w1
Additionally, each parameter is bounded:
Claim
1
w1 , w2 ∈ [, 1]
2
|µ1 |, |µ2 | ≤
√1
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
µ1 , σ12 , µ2 , σ22 , w1
Additionally, each parameter is bounded:
Claim
1
w1 , w2 ∈ [, 1]
2
|µ1 |, |µ2 | ≤
3
σ12 , σ22 ≤
1
√1
Question
Can we learn an additive approximation in one dimension? How many free
parameters are there?
µ1 , σ12 , µ2 , σ22 , w1
Additionally, each parameter is bounded:
Claim
1
w1 , w2 ∈ [, 1]
2
|µ1 |, |µ2 | ≤
3
σ12 , σ22 ≤
√1
1
In this case, we call the parameters -bounded
So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1
So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1
Question
How do we test if a candidate set of parameters is accurate?
So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1
Question
How do we test if a candidate set of parameters is accurate?
1
Compute empirical moments r = {1, 2, ...6}: M̃r =
1
|S|
P
i∈S
xir
So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1
Question
How do we test if a candidate set of parameters is accurate?
1
2
Compute empirical moments r = {1, 2, ...6}: M̃r =
1
|S|
P
i∈S
Compute the analytical moments Mr (F̂ ) = Ex←F̂ [x r ] where
F̂ = ŵ1 N (µ̂1 , σ̂12 , x) + ŵ2 N (µ̂2 , σ̂22 , x) for r ∈ {1, 2, ..., 6}
xir
So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1
Question
How do we test if a candidate set of parameters is accurate?
1
2
3
Compute empirical moments r = {1, 2, ...6}: M̃r =
1
|S|
P
i∈S
Compute the analytical moments Mr (F̂ ) = Ex←F̂ [x r ] where
F̂ = ŵ1 N (µ̂1 , σ̂12 , x) + ŵ2 N (µ̂2 , σ̂22 , x) for r ∈ {1, 2, ..., 6}
Accept if M̃r ≈ Mr (F̂ ) for all r ∈ {1, 2, ..., 6}
xir
Definition
The pair F , F̂ -standard if
1
the parameters of F , F̂ are -bounded
Definition
The pair F , F̂ -standard if
1
2
the parameters of F , F̂ are -bounded
Dp (F1 , F2 ), Dp (F̂1 , F̂2 ) ≥ Definition
The pair F , F̂ -standard if
1
2
3
the parameters of F , F̂ are -bounded
Dp (F1 , F2 ), D
p (F̂1 , F̂2 ) ≥ P
≤ minπ i |wi − ŵπ(i) | + Dp (Fi , F̂π(i) )
Definition
The pair F , F̂ -standard if
1
2
3
the parameters of F , F̂ are -bounded
Dp (F1 , F2 ), D
p (F̂1 , F̂2 ) ≥ P
≤ minπ i |wi − ŵπ(i) | + Dp (Fi , F̂π(i) )
Theorem
There is a constant c > 0 such that, for any any < c and any -standard F , F̂ ,
max
r ∈{1,2,...,6}
|Mr (F ) − Mr (F̂ )| ≥ 67
Method of Moments
F1 (x)
^F (x)
^F (x)
2
1
F2 (x)
Method of Moments
F1 (x)
^F (x)
^F (x)
2
1
F2 (x)
^
f(x) = F(x) − F(x)
Method of Moments
p(x)
F1 (x)
^F (x)
^F (x)
2
1
F2 (x)
^
f(x) = F(x) − F(x)
Question
Why does this imply one of the first six moment of F , F̂ is different?
Z
0 < p(x)f (x)dx x
Question
Why does this imply one of the first six moment of F , F̂ is different?
Z
0 < p(x)f (x)dx =
x
6
Z X
pr x r f (x)dx x r =1
Question
Why does this imply one of the first six moment of F , F̂ is different?
Z
0 < p(x)f (x)dx =
x
≤
6
Z X
pr x r f (x)dx x r =1
6
X
r =1
|pr ||Mr (F ) − Mr (F̂ )|
Question
Why does this imply one of the first six moment of F , F̂ is different?
Z
0 < p(x)f (x)dx =
x
≤
6
Z X
pr x r f (x)dx x r =1
6
X
r =1
So ∃r ∈{1,2,...,6} s.t. |Mr (F ) − Mr (F̂ )| > 0
|pr ||Mr (F ) − Mr (F̂ )|
Proposition
Pk
Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can
be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero
crossings.
Proposition
Pk
Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can
be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero
crossings.
Theorem (Hummel, Gidas)
Given f (x) : < → <, that is analytic and has n zeros, then for any σ 2 > 0, the
function g (x) = f (x) ◦ N (0, σ 2 , x) has at most n zeros.
Proposition
Pk
Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can
be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero
crossings.
Theorem (Hummel, Gidas)
Given f (x) : < → <, that is analytic and has n zeros, then for any σ 2 > 0, the
function g (x) = f (x) ◦ N (0, σ 2 , x) has at most n zeros.
Convolving by a Gaussian does not increase the number of zero crossings!
Proposition
Pk
Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can
be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero
crossings.
Theorem (Hummel, Gidas)
Given f (x) : < → <, that is analytic and has n zeros, then for any σ 2 > 0, the
function g (x) = f (x) ◦ N (0, σ 2 , x) has at most n zeros.
Convolving by a Gaussian does not increase the number of zero crossings!
Fact
N (0, σ12 , x) ◦ N (0, σ22 , x) = N (0, σ12 + σ22 , x)
Zero Crossings and the Heat Equation
Zero Crossings and the Heat Equation
Zero Crossings and the Heat Equation
Zero Crossings and the Heat Equation
Zero Crossings and the Heat Equation
Zero Crossings and the Heat Equation
F2
F1
F2
F2
F2
F1
r
F2
F2
F1
Projr[F2]
Projr[F3]
Projr[F 1]
r
F2
F2
F1
Projr[F2]
Projr[F3]
Projr[F 1]
r
s
F2
F2
F1
Projr[F2]
Projr[F3]
Projr[F 1]
r
Projs[F3]
F2
F2
F1
Projs[F 1]
Projs[F2]
Projr[F2]
Projr[F3]
Projr[F 1]
r
s
Projs[F3]
F2
F2
F1
Projs[F 1]
Projs[F2]
Projr[F2]
Projr[F3]
Projr[F 1]
r
s
Generalized Isotropic Projection Lemma
Lemma (Generalized Isotropic Projection Lemma)
With probability ≥ 1 − δ over a randomly chosen direction r , for all i 6= j,
Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 .
Generalized Isotropic Projection Lemma
Lemma (Generalized Isotropic Projection Lemma)
With probability ≥ 1 − δ over a randomly chosen direction r , for all i 6= j,
Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 .
FALSE!
Generalized Isotropic Projection Lemma
Lemma (Generalized Isotropic Projection Lemma)
With probability ≥ 1 − δ over a randomly chosen direction r , for all i 6= j,
Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 .
FALSE!
Lemma (Generalized Isotropic Projection Lemma)
With probability ≥ 1 − δ over a randomly chosen direction r , for all there exists
i 6= j, Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 .
Windows
r
Windows
r
Windows
r
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Windows
r
^F a
^F b
^F c
^F d
^F e
^F f
^F g
Pairing Lemma, Part 2?
r
Pairing Lemma, Part 2?
s
r
Pairing Lemma, Part 2?
s
??
r
Thanks!
Download