Efficiently Learning Mixtures of Gaussians Ankur Moitra, MIT May 11, 2010

Efficiently Learning Mixtures of Gaussians Ankur Moitra, MIT joint work with Adam Tauman Kalai and Gregory Valiant May 11, 2010 What is a Mixture of Gaussians? What is a Mixture of Gaussians? Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1): What is a Mixture of Gaussians? Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1): SA(F) w1 SA(F1) x w2 SA(F2) What is a Mixture of Gaussians? Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1): SA(F) w1 SA(F1) x w2 SA(F2) What is a Mixture of Gaussians? Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1): SA(F) w1 SA(F1) x w2 SA(F2) What is a Mixture of Gaussians? Distribution on <n (w1 , w2 ≥ 0, w1 + w2 = 1): SA(F) w1 SA(F1) x w2 SA(F2) F (x) = w1 N (µ1 , Σ1 , x) + w2 N (µ2 , Σ2 , x) Pearson and the Naples Crabs 0 5 10 15 20 (figure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x) Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x) Definition We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x) Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x) Definition We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x) 1 There are five unknown variables: w1 , µ1 , σ12 , µ2 , σ22 Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x) Definition We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x) 1 There are five unknown variables: w1 , µ1 , σ12 , µ2 , σ22 2 The r th -raw moment of Fi (x) is a polynomial in µi , σi Let F (x) = w1 F1 (x) + w2 F2 (x), where Fi (x) = N (µi , σi2 , x) Definition We will refer to Ex←Fi (x) [x r ] as the r th -raw moment of Fi (x) 1 There are five unknown variables: w1 , µ1 , σ12 , µ2 , σ22 2 The r th -raw moment of Fi (x) is a polynomial in µi , σi Definition Let Ex←Fi (x) [x r ] = Mr (µi , σi2 ) Question What if we knew the r th -raw moment of F (x) perfectly? Question What if we knew the r th -raw moment of F (x) perfectly? Each value yields a constraint: Ex←F (x) [x r ] = w1 Mr (µ1 , σ12 ) + w2 Mr (µ2 , σ22 ) Question What if we knew the r th -raw moment of F (x) perfectly? Each value yields a constraint: Ex←F (x) [x r ] = w1 Mr (µ1 , σ12 ) + w2 Mr (µ2 , σ22 ) Definition We will refer to M̃r = 1 |S| P i∈S xir as the empirical r th -raw moment of F (x) Pearson’s Sixth Moment Test Pearson’s Sixth Moment Test 1 Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6} Pearson’s Sixth Moment Test 1 Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6} 2 Find all simultaneous roots of {w1 Mr (µ1 , σ12 ) + (1 − w1 )Mr (µ2 , σ22 ) = M̃r }r ∈{1,2,...5} Pearson’s Sixth Moment Test 1 Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6} 2 Find all simultaneous roots of {w1 Mr (µ1 , σ12 ) + (1 − w1 )Mr (µ2 , σ22 ) = M̃r }r ∈{1,2,...5} 3 This yields a list of candidate parameters θ~a , θ~b , ... Pearson’s Sixth Moment Test 1 Compute the empirical r th -raw moments M̃r for r ∈ {1, 2, ...6} 2 Find all simultaneous roots of {w1 Mr (µ1 , σ12 ) + (1 − w1 )Mr (µ2 , σ22 ) = M̃r }r ∈{1,2,...5} 3 This yields a list of candidate parameters θ~a , θ~b , ... 4 Choose the candidate that is closest in sixth moment: w1 M6 (µ1 , σ12 ) + (1 − w1 )M6 (µ2 , σ22 ) ≈ M̃6 ”Given the probable error of every ordinate of a frequency-curve, what are the probable errors of the elements of the two normal curves into which it may be dissected?” [Karl Pearson] Question How does noise in the empirical moments translate to noise in the derived parameters? Gaussian Mixture Models Applictions in physics, biology, geology, social sciences ... Gaussian Mixture Models Applictions in physics, biology, geology, social sciences ... Goal Estimate parameters in order to understand underlying process Gaussian Mixture Models Applictions in physics, biology, geology, social sciences ... Goal Estimate parameters in order to understand underlying process Question Can we PROVABLY recover the parameters EFFICIENTLY? (Dasgupta, 1999) Gaussian Mixture Models Applictions in physics, biology, geology, social sciences ... Goal Estimate parameters in order to understand underlying process Question Can we PROVABLY recover the parameters EFFICIENTLY? (Dasgupta, 1999) Definition D(f (x), g (x)) = 12 kf (x) − g (x)k1 A History of Learning Mixtures of Gaussians __ 1 n2 [Dasgupta, 1999] A History of Learning Mixtures of Gaussians __ 1 n2 [Dasgupta, 1999] __ 1 n4 [Dasgupta, Schulman, 2000] A History of Learning Mixtures of Gaussians __ 1 n2 [Dasgupta, 1999] __ 1 n4 [Dasgupta, Schulman, 2000] __ 1 n4 [Arora, Kannan, 2001] A History of Learning Mixtures of Gaussians __ 1 n2 [Dasgupta, 1999] __ 1 n4 [Dasgupta, Schulman, 2000] __ 1 n4 [Arora, Kannan, 2001] __ 1 k4 [Vempala, Wang, 2002] A History of Learning Mixtures of Gaussians __ 1 n2 [Dasgupta, 1999] __ 1 n4 __ 1 n4 [Dasgupta, Schulman, 2000] [Arora, Kannan, 2001] __ 1 k4 [Vempala, Wang, 2002] __ 1 k4 [Achlioptas, McSherry, 2005] A History of Learning Mixtures of Gaussians __ 1 n2 [Dasgupta, 1999] __ 1 n4 __ 1 n4 [Dasgupta, Schulman, 2000] [Arora, Kannan, 2001] __ 1 k4 [Vempala, Wang, 2002] __ 1 k4 [Achlioptas, McSherry, 2005] [Brubaker, Vempala, 2008] All previous results required D(F1 , F2 ) ≈ 1... All previous results required D(F1 , F2 ) ≈ 1... ... because the results relied on CLUSTERING All previous results required D(F1 , F2 ) ≈ 1... ... because the results relied on CLUSTERING Question Can we learn the parameters of the mixture without clustering? All previous results required D(F1 , F2 ) ≈ 1... ... because the results relied on CLUSTERING Question Can we learn the parameters of the mixture without clustering? Question Can we learn the parameters when D(F1 , F2 ) is close to ZERO? Goal Learn a mixture F̂ = ŵ1 F̂1 + ŵ2 F̂2 so that there is a permutation π : {1, 2} → {1, 2} and for i = {1, 2} |wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ Goal Learn a mixture F̂ = ŵ1 F̂1 + ŵ2 F̂2 so that there is a permutation π : {1, 2} → {1, 2} and for i = {1, 2} |wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ We will call such a mixture F̂ -close to F . Goal Learn a mixture F̂ = ŵ1 F̂1 + ŵ2 F̂2 so that there is a permutation π : {1, 2} → {1, 2} and for i = {1, 2} |wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ We will call such a mixture F̂ -close to F . Question When can we hope to learn an -close estimate? Question What if w1 = 0? Question What if w1 = 0? We never sample from F1 ! Question What if w1 = 0? We never sample from F1 ! Question What if D(F1 , F2 ) = 0? Question What if w1 = 0? We never sample from F1 ! Question What if D(F1 , F2 ) = 0? For any w1 , w2 , F = w1 F1 + w2 F2 is the same distribution! Question What if w1 = 0? We never sample from F1 ! Question What if D(F1 , F2 ) = 0? For any w1 , w2 , F = w1 F1 + w2 F2 is the same distribution! Definition A mixture of Gaussians F = w1 F1 + w2 F2 is -statistically learnable if for i = {1, 2}, wi ≥ and D(F1 , F2 ) ≥ . Efficiently Learning Mixtures of Two Gaussians Given oracle access to an -statistically learnable mixture of two Gaussians F : Efficiently Learning Mixtures of Two Gaussians Given oracle access to an -statistically learnable mixture of two Gaussians F : Theorem (Kalai, M, Valiant) There is an algorithm that (with probability at least 1 − δ) learns a mixture of two Gaussians F̂ that is an -close estimate to F , and the running time and data requirements are poly ( 1 , n, δ1 ). Efficiently Learning Mixtures of Two Gaussians Given oracle access to an -statistically learnable mixture of two Gaussians F : Theorem (Kalai, M, Valiant) There is an algorithm that (with probability at least 1 − δ) learns a mixture of two Gaussians F̂ that is an -close estimate to F , and the running time and data requirements are poly ( 1 , n, δ1 ). Previously, even no inverse exponential estimator known for univariate mixtures of two Gaussians Question What about mixtures of k Gaussians? Question What about mixtures of k Gaussians? Definition P A mixture of k Gaussians F = i wi Fi is -statistically learnable if for i = {1, 2, .., k}, wi ≥ and for all i, j D(Fi , Fj ) ≥ . Question What about mixtures of k Gaussians? Definition P A mixture of k Gaussians F = i wi Fi is -statistically learnable if for i = {1, 2, .., k}, wi ≥ and for all i, j D(Fi , Fj ) ≥ . Definition P An estimate F̂ = i ŵi F̂i mixture of k Gaussians is -close to F if there is a permutation π : {1, 2, ..., k} → {1, 2, ..., k} and for i = {1, 2, ..., k} |wi − ŵπ(i) | ≤ , D(Fi , F̂π(i) ) ≤ Efficiently Learning Mixtures of Gaussians Given oracle access to an -statistically learnable mixture of k Gaussians F : Efficiently Learning Mixtures of Gaussians Given oracle access to an -statistically learnable mixture of k Gaussians F : Theorem (M, Valiant) There is an algorithm that (with probability at least 1 − δ) learns a mixture of k Gaussians F̂ that is an -close estimate to F , and the running time and data requirements are poly ( 1 , n, δ1 ). Efficiently Learning Mixtures of Gaussians Given oracle access to an -statistically learnable mixture of k Gaussians F : Theorem (M, Valiant) There is an algorithm that (with probability at least 1 − δ) learns a mixture of k Gaussians F̂ that is an -close estimate to F , and the running time and data requirements are poly ( 1 , n, δ1 ). The running time and sample complexity depends exponentially on k, but such a dependence is necessary! Efficiently Learning Mixtures of Gaussians Given oracle access to an -statistically learnable mixture of k Gaussians F : Theorem (M, Valiant) There is an algorithm that (with probability at least 1 − δ) learns a mixture of k Gaussians F̂ that is an -close estimate to F , and the running time and data requirements are poly ( 1 , n, δ1 ). The running time and sample complexity depends exponentially on k, but such a dependence is necessary! Corollary: First polynomial time density estimation for mixtures of Gaussians with no assumptions! Question Can we give additive guarantees? Question Can we give additive guarantees? Cannot give additive guarantees without defining an appropriate normalization Question Can we give additive guarantees? Cannot give additive guarantees without defining an appropriate normalization Definition A distribution F (x) is in isotropic position if Question Can we give additive guarantees? Cannot give additive guarantees without defining an appropriate normalization Definition A distribution F (x) is in isotropic position if 1 Ex←F (x) [x] = ~0 Question Can we give additive guarantees? Cannot give additive guarantees without defining an appropriate normalization Definition A distribution F (x) is in isotropic position if 1 Ex←F (x) [x] = ~0 2 Ex←F (x) [(u T x)2 ] = 1 for all kuk = 1 Question Can we give additive guarantees? Cannot give additive guarantees without defining an appropriate normalization Definition A distribution F (x) is in isotropic position if 1 Ex←F (x) [x] = ~0 2 Ex←F (x) [(u T x)2 ] = 1 for all kuk = 1 Fact For any distribution F (x) on <n , there is an affine transformation T that places F (x) in isotropic position Isotropic Position Not Isotropic F1 F2 Isotropic F1 F2 Given Mixture F of two Gaussians, -statistically learnable, and in isotropic position Given Mixture F of two Gaussians, -statistically learnable, and in isotropic position Output F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t. |wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Given Mixture F of two Gaussians, -statistically learnable, and in isotropic position Output F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t. |wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea 1 Consider a series of projections down to one dimension Given Mixture F of two Gaussians, -statistically learnable, and in isotropic position Output F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t. |wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea 1 Consider a series of projections down to one dimension 2 Run a univariate learning algorithm Given Mixture F of two Gaussians, -statistically learnable, and in isotropic position Output F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t. |wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea 1 Consider a series of projections down to one dimension 2 Run a univariate learning algorithm 3 Use these estimates as constraints in a system of equations Given Mixture F of two Gaussians, -statistically learnable, and in isotropic position Output F̂ = ŵ1 F̂1 + ŵ2 F̂2 s.t. |wi − ŵπ(i) |, kµi − µ̂π(i) k, kΣi − Σ̂π(i) kF ≤ Rough Idea 1 Consider a series of projections down to one dimension 2 Run a univariate learning algorithm 3 Use these estimates as constraints in a system of equations 4 Solve this system to obtain higher dimensional estimates Claim Projr [F1 ] = N (r T µ1 , r T Σ1 r , x) Claim Projr [F1 ] = N (r T µ1 , r T Σ1 r , x) Each univariate estimate yields an approximate linear constraint on the parameters Claim Projr [F1 ] = N (r T µ1 , r T Σ1 r , x) Each univariate estimate yields an approximate linear constraint on the parameters Definition Dp (N (µ1 , σ12 ), N (µ2 , σ22 )) = |µ1 − µ2 | + |σ12 − σ22 | Problem What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small? Problem What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small? Then we would need to run the univariate algorithm with extremely fine precision! Problem What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small? Then we would need to run the univariate algorithm with extremely fine precision! Isotropic Projection Lemma: With high probability, Dp (Projr [F1 ], Projr [F2 ]) is at least polynomially large Problem What if we choose a direction r s.t. Dp (Projr [F1 ], Projr [F2 ]) is extremely small? Then we would need to run the univariate algorithm with extremely fine precision! Isotropic Projection Lemma: With high probability, Dp (Projr [F1 ], Projr [F2 ]) is at least polynomially large (i.e. at least poly (, n1 )) Isotropic Projection Lemma Suppose F = w1 F1 + w2 F2 is in isotropic position and is -statistically learnable: Isotropic Projection Lemma Suppose F = w1 F1 + w2 F2 is in isotropic position and is -statistically learnable: Lemma (Isotropic Projection Lemma) With probability ≥ 1 − δ over a randomly chosen direction r , 5 δ 2 Dp (Projr [F1 ], Projr [F2 ]) ≥ 50n 2 = 3 . Isotropic Projection Lemma F1 F2 Isotropic Projection Lemma F1 F2 u Isotropic Projection Lemma F1 F2 u r Isotropic Projection Lemma F1 F2 u Projr[F 1] Projr[F 2] r Isotropic Projection Lemma F1 F2 u Projr[F 1] Projr[F 2] r Projr[u] Isotropic Projection Lemma F1 F2 Isotropic Projection Lemma F1 F2 r Isotropic Projection Lemma F1 F2 Projr[F 1] Projr[F 2] r Suppose we learn estimates F̂ r , F̂ s from directions r , s Suppose we learn estimates F̂ r , F̂ s from directions r , s F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F Suppose we learn estimates F̂ r , F̂ s from directions r , s F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F Problem How do we know that they yield constraints on the SAME Gaussian? Searching Nearby Directions r Searching Nearby Directions s r Searching Nearby Directions s r Searching Nearby Directions s ?? r Searching Nearby Directions s ?? r Suppose we learn estimates F̂ r , F̂ s from directions r , s F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F Problem How do we know that they yield constraints on the SAME Gaussian? Suppose we learn estimates F̂ r , F̂ s from directions r , s F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F Problem How do we know that they yield constraints on the SAME Gaussian? Pairing Lemma: If we choose directions close enough, then pairing becomes easy Suppose we learn estimates F̂ r , F̂ s from directions r , s F̂1r , F̂1s each yield constraints on multidimensional parameters of one Gaussian in F Problem How do we know that they yield constraints on the SAME Gaussian? Pairing Lemma: If we choose directions close enough, then pairing becomes easy (”close enough” depends on the Isotropic Projection Lemma) Pairing Lemma Suppose kr − sk ≤ 2 (for 2 << 3 ) Pairing Lemma Suppose kr − sk ≤ 2 (for 2 << 3 ) We still assume F = w1 F1 + w2 F2 is in isotropic position and is -statistically learnable Pairing Lemma Suppose kr − sk ≤ 2 (for 2 << 3 ) We still assume F = w1 F1 + w2 F2 is in isotropic position and is -statistically learnable Lemma (Pairing Lemma) Dp (Projr [Fi ], Projs [Fi ]) ≤ O( 2 ) << 3 3 Searching Nearby Directions r Searching Nearby Directions r Searching Nearby Directions r Searching Nearby Directions s r Searching Nearby Directions s r Searching Nearby Directions r Searching Nearby Directions r Searching Nearby Directions r Searching Nearby Directions s r Searching Nearby Directions s r Each univariate estimate yields a linear constraint on the parameters: Projr [F1 ] = N (r T µ1 , r T Σ1 r ) Each univariate estimate yields a linear constraint on the parameters: Projr [F1 ] = N (r T µ1 , r T Σ1 r ) Problem What is the condition number of this system? Each univariate estimate yields a linear constraint on the parameters: Projr [F1 ] = N (r T µ1 , r T Σ1 r ) Problem What is the condition number of this system? (i.e. How do errors in univariate estimates translate to errors in multidimensional estimates?) Each univariate estimate yields a linear constraint on the parameters: Projr [F1 ] = N (r T µ1 , r T Σ1 r ) Problem What is the condition number of this system? (i.e. How do errors in univariate estimates translate to errors in multidimensional estimates?) Recovery Lemma: Condition number is polynomially bounded Each univariate estimate yields a linear constraint on the parameters: Projr [F1 ] = N (r T µ1 , r T Σ1 r ) Problem What is the condition number of this system? (i.e. How do errors in univariate estimates translate to errors in multidimensional estimates?) Recovery Lemma: Condition number is polynomially bounded : O( n2 ) 2 r r r 3 s r 3 s 2 r 3 s 2 r 3 s __ 1 3 3 2 r 3 s __ 1 3 3 2 r 3 s __ 1 3 3 2 1 1 r 3 s __ 1 3 3 2 1 1 r 3 1 1 s __ 1 3 3 2 1 1 r 3 s __ 1 3 3 2 1 1 r 3 1 1 s __ 1 3 3 2 1 1 r 3 1 1 s __ 1 3 3 2 1 1 r 3 1 1 s 2 1 1 r 1 s 2 1 r 1 s 2 1 r ~ SA(F) Additive Isotropic Using Additive Estimates to Cluster ^F 2 F1 F2 ^F 1 Using Additive Estimates to Cluster ^F 2 F1 F2 ^F 1 r Using Additive Estimates to Cluster ^F 2 F2 ^F 1 F1 Projr[F 2] Projr[F 1] r Using Additive Estimates to Cluster ^F 2 F2 ^F 1 F1 Projr[F 2] Projr[F 1] r ~ SA(F) Additive Isotropic ~ SA(F) Additive Isotropic Clustering ~ SA(F) Additive Isotropic Clustering ~ SA(F) Isotropic Additive Isotropic Clustering ^F ~ SA(F) SA(F) Isotropic Additive Isotropic Clustering ^F ~ SA(F) ~ Compute T SA(F) Isotropic Additive Isotropic Clustering ^F ~ SA(F) SA(F) ~ Compute T ~ Apply T Isotropic Additive Isotropic Clustering ^F ~ SA(F) SA(F) ~ Compute T ~ Apply T Isotropic Additive Isotropic Clustering ^F Question Can we learn an additive approximation in one dimension? Question Can we learn an additive approximation in one dimension? How many free parameters are there? Question Can we learn an additive approximation in one dimension? How many free parameters are there? µ1 , σ12 , µ2 , σ22 , w1 Question Can we learn an additive approximation in one dimension? How many free parameters are there? µ1 , σ12 , µ2 , σ22 , w1 Additionally, each parameter is bounded: Claim Question Can we learn an additive approximation in one dimension? How many free parameters are there? µ1 , σ12 , µ2 , σ22 , w1 Additionally, each parameter is bounded: Claim 1 w1 , w2 ∈ [, 1] Question Can we learn an additive approximation in one dimension? How many free parameters are there? µ1 , σ12 , µ2 , σ22 , w1 Additionally, each parameter is bounded: Claim 1 w1 , w2 ∈ [, 1] 2 |µ1 |, |µ2 | ≤ √1 Question Can we learn an additive approximation in one dimension? How many free parameters are there? µ1 , σ12 , µ2 , σ22 , w1 Additionally, each parameter is bounded: Claim 1 w1 , w2 ∈ [, 1] 2 |µ1 |, |µ2 | ≤ 3 σ12 , σ22 ≤ 1 √1 Question Can we learn an additive approximation in one dimension? How many free parameters are there? µ1 , σ12 , µ2 , σ22 , w1 Additionally, each parameter is bounded: Claim 1 w1 , w2 ∈ [, 1] 2 |µ1 |, |µ2 | ≤ 3 σ12 , σ22 ≤ √1 1 In this case, we call the parameters -bounded So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1 So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1 Question How do we test if a candidate set of parameters is accurate? So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1 Question How do we test if a candidate set of parameters is accurate? 1 Compute empirical moments r = {1, 2, ...6}: M̃r = 1 |S| P i∈S xir So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1 Question How do we test if a candidate set of parameters is accurate? 1 2 Compute empirical moments r = {1, 2, ...6}: M̃r = 1 |S| P i∈S Compute the analytical moments Mr (F̂ ) = Ex←F̂ [x r ] where F̂ = ŵ1 N (µ̂1 , σ̂12 , x) + ŵ2 N (µ̂2 , σ̂22 , x) for r ∈ {1, 2, ..., 6} xir So we can use a grid search over µ̂1 × σ̂12 × µ̂2 × σ̂22 × ŵ1 Question How do we test if a candidate set of parameters is accurate? 1 2 3 Compute empirical moments r = {1, 2, ...6}: M̃r = 1 |S| P i∈S Compute the analytical moments Mr (F̂ ) = Ex←F̂ [x r ] where F̂ = ŵ1 N (µ̂1 , σ̂12 , x) + ŵ2 N (µ̂2 , σ̂22 , x) for r ∈ {1, 2, ..., 6} Accept if M̃r ≈ Mr (F̂ ) for all r ∈ {1, 2, ..., 6} xir Definition The pair F , F̂ -standard if 1 the parameters of F , F̂ are -bounded Definition The pair F , F̂ -standard if 1 2 the parameters of F , F̂ are -bounded Dp (F1 , F2 ), Dp (F̂1 , F̂2 ) ≥ Definition The pair F , F̂ -standard if 1 2 3 the parameters of F , F̂ are -bounded Dp (F1 , F2 ), D p (F̂1 , F̂2 ) ≥ P ≤ minπ i |wi − ŵπ(i) | + Dp (Fi , F̂π(i) ) Definition The pair F , F̂ -standard if 1 2 3 the parameters of F , F̂ are -bounded Dp (F1 , F2 ), D p (F̂1 , F̂2 ) ≥ P ≤ minπ i |wi − ŵπ(i) | + Dp (Fi , F̂π(i) ) Theorem There is a constant c > 0 such that, for any any < c and any -standard F , F̂ , max r ∈{1,2,...,6} |Mr (F ) − Mr (F̂ )| ≥ 67 Method of Moments F1 (x) ^F (x) ^F (x) 2 1 F2 (x) Method of Moments F1 (x) ^F (x) ^F (x) 2 1 F2 (x) ^ f(x) = F(x) − F(x) Method of Moments p(x) F1 (x) ^F (x) ^F (x) 2 1 F2 (x) ^ f(x) = F(x) − F(x) Question Why does this imply one of the first six moment of F , F̂ is different? Z 0 < p(x)f (x)dx x Question Why does this imply one of the first six moment of F , F̂ is different? Z 0 < p(x)f (x)dx = x 6 Z X pr x r f (x)dx x r =1 Question Why does this imply one of the first six moment of F , F̂ is different? Z 0 < p(x)f (x)dx = x ≤ 6 Z X pr x r f (x)dx x r =1 6 X r =1 |pr ||Mr (F ) − Mr (F̂ )| Question Why does this imply one of the first six moment of F , F̂ is different? Z 0 < p(x)f (x)dx = x ≤ 6 Z X pr x r f (x)dx x r =1 6 X r =1 So ∃r ∈{1,2,...,6} s.t. |Mr (F ) − Mr (F̂ )| > 0 |pr ||Mr (F ) − Mr (F̂ )| Proposition Pk Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero crossings. Proposition Pk Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero crossings. Theorem (Hummel, Gidas) Given f (x) : < → <, that is analytic and has n zeros, then for any σ 2 > 0, the function g (x) = f (x) ◦ N (0, σ 2 , x) has at most n zeros. Proposition Pk Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero crossings. Theorem (Hummel, Gidas) Given f (x) : < → <, that is analytic and has n zeros, then for any σ 2 > 0, the function g (x) = f (x) ◦ N (0, σ 2 , x) has at most n zeros. Convolving by a Gaussian does not increase the number of zero crossings! Proposition Pk Let f (x) = i=1 αi N (µi , σi2 , x) be a linear combination of k Gaussians (αi can be negative). Then if f (x) is not identically zero, f (x) has at most 2k − 2 zero crossings. Theorem (Hummel, Gidas) Given f (x) : < → <, that is analytic and has n zeros, then for any σ 2 > 0, the function g (x) = f (x) ◦ N (0, σ 2 , x) has at most n zeros. Convolving by a Gaussian does not increase the number of zero crossings! Fact N (0, σ12 , x) ◦ N (0, σ22 , x) = N (0, σ12 + σ22 , x) Zero Crossings and the Heat Equation Zero Crossings and the Heat Equation Zero Crossings and the Heat Equation Zero Crossings and the Heat Equation Zero Crossings and the Heat Equation Zero Crossings and the Heat Equation F2 F1 F2 F2 F2 F1 r F2 F2 F1 Projr[F2] Projr[F3] Projr[F 1] r F2 F2 F1 Projr[F2] Projr[F3] Projr[F 1] r s F2 F2 F1 Projr[F2] Projr[F3] Projr[F 1] r Projs[F3] F2 F2 F1 Projs[F 1] Projs[F2] Projr[F2] Projr[F3] Projr[F 1] r s Projs[F3] F2 F2 F1 Projs[F 1] Projs[F2] Projr[F2] Projr[F3] Projr[F 1] r s Generalized Isotropic Projection Lemma Lemma (Generalized Isotropic Projection Lemma) With probability ≥ 1 − δ over a randomly chosen direction r , for all i 6= j, Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 . Generalized Isotropic Projection Lemma Lemma (Generalized Isotropic Projection Lemma) With probability ≥ 1 − δ over a randomly chosen direction r , for all i 6= j, Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 . FALSE! Generalized Isotropic Projection Lemma Lemma (Generalized Isotropic Projection Lemma) With probability ≥ 1 − δ over a randomly chosen direction r , for all i 6= j, Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 . FALSE! Lemma (Generalized Isotropic Projection Lemma) With probability ≥ 1 − δ over a randomly chosen direction r , for all there exists i 6= j, Dp (Projr [Fi ], Projr [Fj ]) ≥ 3 . Windows r Windows r Windows r Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Windows r ^F a ^F b ^F c ^F d ^F e ^F f ^F g Pairing Lemma, Part 2? r Pairing Lemma, Part 2? s r Pairing Lemma, Part 2? s ?? r Thanks!

Efficiently Learning Mixtures of Gaussians Ankur Moitra, MIT May 11, 2010

Related documents

Products

Support

Efficiently Learning Mixtures of Gaussians Ankur Moitra, MIT May 11, 2010

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib