Machine Learning srihari Gaussian Distribution Sargur N. Srihari 1 Machine Learning srihari The Gaussian Distribution Carl Friedrich Gauss 1777-1855 • For single real-valued variable x N(x | µ,σ 2 ) = ⎧ 1 1 2⎫ exp − (x − µ ) ⎨ ⎬ 2 1/ 2 2 ⎩ 2σ ⎭ (2πσ ) • Parameters: 68% of data lies within σ of mean 95% within 2σ – Mean µ, variance σ 2, • Standard deviation σ • Precision β =1/σ 2, E[x]=µ, Var[x]=σ 2 • For D-dimensional vector x, multivariate Gaussian ⎧ ⎫ ⎪ ⎪ 1 1 1 T −1 ⎪ ⎪ N(x | µ,Σ) = exp − (x − µ) Σ (x − µ) ⎨ ⎬ D/2 1/2 ⎪ ⎪ (2π) | Σ | ⎪ ⎪ ⎩ 2 ⎭ µ is a mean vector, Σ is a D x D covariance matrix, |Σ| is the determinant of Σ Σ-1 is also referred to as the precision matrix 2 Machine Learning srihari Covariance Matrix • Gives a measure of the dispersion of the data • It is a D x D matrix – Element in position i,j is the covariance between the ith and jth variables. • Covariance between two variables xi and xj is defined as E[(xi-µi)(yi-µj)] • Can be positive or negative – If the variables are independent then the covariance is zero. • Then all matrix elements are zero except diagonal elements which represent the variances 3 Machine Learning srihari Importance of Gaussian One variable histogram (uniform over [0,1]) • Gaussian arises in many different contexts, e.g., – For a single variable, Gaussian maximizes entropy (for given mean and variance) – Sum of set of random variables becomes increasingly Gaussian Mean of two variables The two values could be 0.8 and 0.2 whose average is 0.5 More ways of getting 0.5 than say 0.1 Mean of ten variables 4 Machine Learning srihari Geometry of Gaussian Two dimensional Gaussian x = (x1,x2) • Functional dependence of Gaussian on x is through Δ2 = (x − µ)T Σ−1(x − µ) – Called Mahanalobis Distance – reduces to Euclidean distance when Σ is an identity matrix • Matrix Σ is symmetric – Has an Eigenvector equation Σui = λiui ui are Eigen vectors λi are Eigen values Red: Elliptical contour of constant density Major axes: eigenvectors ui 5 Machine Learning srihari Contours of Constant Density (a) General form • Determined by Covariance Matrix – Covariances represent how features vary together (b) Diagonal matrix (aligned with coordinate axes) (c) Proportional to Identity matrix (concentric circles) 6 Machine Learning Joint Gaussian implies that Marginal and Conditional are Gaussian • If two sets of variables xa,xb are jointly Gaussian then the two conditional densities and the two marginals are also Gaussian • Given joint Gaussian N(x|µ,Σ) with Λ=Σ-1 and x = [xa,xb]T where xa are first m components of x and xb are next D-m components • Conditionals p ( x a | x b ) = N ( x | µ a|b , Λ−aa1 ) where µ a|b = µ a − Λ−aa1 Λ ab ( x b − µb ) srihari Joint p(xa, xb) Marginal p(xa) and Conditional p(xa|xb) • Marginals ⎛Σ p ( xa ) = N ( xa | µ a , Σ aa ) where Σ = ⎜⎜ aa ⎝ Σ ba Σ ab ⎞ ⎟⎟ Σ bb ⎠ 7 Machine Learning If Marginals are Gaussian, Joint need not be Gaussian srihari • Constructing such a joint pdf: – Consider 2-D Gaussian, zero-mean uncorrelated rvs x and y Due to symmetry about x- and y-axes, we can write marginals: i.e., we only need to integrate over hatched regions – Take original 2-D Gaussian and set it to zero over non-hatched quadrants and multiply remaining by 2 we get a 2-D pdf that is definitely NOT Gaussian Machine Learning srihari Maximum Likelihood for the Gaussian • Given a data set X=(x1,..xN)T where the observations {xn} are drawn independently • Log-likelihood function is given by ND N 1 ln p ( X | µ , Σ) = − ln(2π ) − ln | Σ | − ∑ ( x − µ ) Σ ( x − µ ) 2 2 2 • Derivative wrt µ is µ ln p ( X | µ , Σ) = ∑ Σ ( x − µ ) N n =1 N ∂ ∂ T n −1 n −1 n n =1 1 = N N µ ∑x • Whose solution is • Maximization w.r.t. Σ is more involved. Yields ML Σ ML 1 = N n =1 n N T ( x − µ ) ( x − µ ) ∑ n ML n ML n =1 9 Machine Learning srihari Bias of M. L. Estimate of Covariance Matrix • For N(µ,Σ), m.l.e. of Σ for samples x1,..xN Σ ML 1 = N N ∑ (x n =1 n − µ ML )( x n − µ ML )T • arithmetic average of N matrices: • Since is 1 N E[ΣML ] = (xn − µML )(xn − µML )T ∑ N −1 n=1 (xn − µML )(xn − µML )T we have E[ΣML ] = N −1 Σ N – m.l.e. is smaller than the true value of Σ – Thus m.l.e. is biased • irrespective of no of samples does not give exact value. – For large N inconsequential. • Rule of thumb: use 1/N for known mean and 1/(N-1) for estimated mean. • Bias does not exist in Bayesian solution. 10 Machine Learning srihari Sequential Estimation • In on-line applications and large data sets batch processing of all data points in infeasible – Real-time learning scenario where steady stream of data is arriving and predictions must be made before all data is seen • Sequential methods allow data points to be processed one-at-a-time and then discarded – Sequential learning arises naturally with Bayesian viewpoint • M.L.E. for parameters of Gaussian gives a convenient opportunity to discuss more general discussion of sequential estimation for maximum likelihood 11 Machine Learning srihari Sequential Estimation of Gaussian Mean • By dissecting contribution of final data point µ ML • Same as earlier batch result • Nice interpretation: 1 = N =µ N ∑x n =1 N -1 ML n 1 N −1 + ( x N − µ ML ) N – After observing N-1 data points we have estimated µ by µMLN-1 – We now observe data point xN and we obtain revised estimate by moving old estimate by small amount – As N increases contribution from successive points smaller 12 Machine Learning srihari General Sequential Estimation • Sequential algorithms cannot always be factored out • Robbins and Monro (1951) gave a general solution • Consider pair of random variables θ and z with joint distribution p(z,θ) • Conditional expectation of z given θ is f (θ ) = E[ z | θ ] = ∫ zp ( z | θ )dz • Which is called a regression function – Same as one that minimizes expected squared loss seen earlier • It can be shown that maximum likelihood solution is equivalent to finding the root of the regression function – Goal is to find θ* at which f(θ*)=0 13 Machine Learning srihari Robbins-Monro Algorithm • Defines sequence of successive estimates of root θ* as follows θ ( N ) = θ ( N −1) + a N −1 z (θ ( N −1) ) • Where z(θ(N))is observed value of z when θ takes the value θ(N) • Coefficients {aN} satisfy reasonable conditions lim a N = 0, N →∞ ∞ ∑a N =1 N = ∞, ∞ ∑a N =1 2 N <∞ • Solution has a form where z involves a derivative of p(x|θ) wrt θ • Special case of Robbons-Monro is solution for Gaussian mean 14 Machine Learning srihari Bayesian Inference for the Gaussian • MLE framework gives point estimates for parameters µ and Σ • Bayesian treatment introduces prior distributions over parameters • Case of known variance • Likelihood of N observations X={x1,..xN} is 1 ⎧ 1 p ( X | µ ) = ∏ p ( xn | µ ) = exp ⎨− 2 2 N /2 (2πσ ) n =1 ⎩ 2σ N N ∑ (x n =1 2⎫ ) − µ ⎬ n ⎭ • Likelihood function is not a probability distribution over µ and is not normalized • Note that likelihood function is quadratic in µ 15 Machine Learning srihari Bayesian formulation for Gaussian mean • Likelihood function 1 ⎧ 1 p ( X | µ ) = ∏ p ( xn | µ ) = exp⎨− 2 2 N /2 (2πσ ) n =1 ⎩ 2σ N ⎫ (xn − µ ) ⎬ ∑ n =1 ⎭ N 2 • Note that likelihood function is quadratic in µ • Thus if we choose a prior p(θ) which is Gaussian it will be a conjugate distribution for the likelihood because product of two exponentials will also be a Gaussian p(µ) = N(µ|µ0,σ02) 16 Machine Learning srihari Bayesian inference: Mean of Gaussian • Given Gaussian prior p(µ) = N(µ|µ0,σ02) • Posterior is given by p(µ|X) α p(X|µ)p(µ) • Simplifies to P(µ|X) = N(µ|µN,σN2) where Prior and posterior have same form: conjugacy Data points from mean=0.8 and known variance=0.1 Nσ 02 σ2 µN = µ0 + µ ML If N=0 reduces to prior mean 2 2 2 2 Nσ 0 + σ Nσ 0 + σ If N à∞ posterior mean is ML solution 1 1 1 = + Precision is additive: sum of precision of prior plus one contribution σ N2 σ 02 σ 2 of data precision from each observed data point 17 Machine Learning srihari Bayesian Inference of the Variance • Known Mean • Wish to infer variance • Analysis simplified if we choose conjugate form for prior distribution • Likelihood function with precision λ=1/σ 2 N p ( X | λ ) = ∏ N ( xn | µ , λ − 1) n =1 α λ N/ 2 ⎧ λ N 2⎫ exp⎨− ∑ (x − µ ) ⎬ ⎩ 2 n =1 ⎭ • Conjugate prior is given by Gamma distribution 18 Machine Learning srihari Gamma Distribution • Gaussian with known mean but unknown variance • Conjugate prior for the precision of a Gaussian is given by a Gamma distribution Gamma distribution Gam(λ|a,b) for various values of a and b – Precision l = 1/σ 2 1 a a −1 Gam(λ | a, b) = b λ exp(−bλ ) Γ(a ) ∞ Γ( x) = ∫ u x −1e −u du 0 – Mean and Variance E[λ ] = a , b var[λ ] = a b2 19 Machine Learning srihari Gamma Distribution Inference • Given prior distribution Gam(λ|a0,b0) • Multiplying by likelihood function • The posterior distribution has the form Gam(λ|aN,bN) where N a N = a0 + 2 Effect of N observations is to increase a by N/2 Interpret a0 as 2a0 effective prior observations 2 1 N bN = b0 + ∑ (xn − µ ) 2 n =1 N 2 = b0 + σ ML 2 20 Machine Learning srihari Both Mean and Variance are Unknown • Consider dependence of likelihood function on µ and λ 1/ 2 ⎛ λ ⎞ p( X | µ , λ ) = ∏ ⎜ ⎟ 2 π ⎠ n =1 ⎝ N ⎧ λ 2⎫ exp⎨− (xn − µ ) ⎬ ⎩ 2 ⎭ • Identify a prior distribution p(µ,λ) that has same functional dependence on µ and λ as likelihood function • Normalized prior takes the form ( ) p ( µ , λ ) = N µ | µ 0 , (βλ ) Gam(λ | a, b ) −1 – Called normal-gamma or Gaussian-gamma distribution 21 Machine Learning srihari Normal Gamma • Both mean and precision unknown • Contour plot with µ0=0, β=2, a=5 and b=6 22 Machine Learning srihari Estimation for Multivariate Case • For a multivariate Gaussian distribution N(x| µ,Λ-1) for a D-dimensional variable x – Conjugate prior for mean µ assuming known precision is Gaussian – For known mean and unknown precision matrix Λ, conjugate prior is Wishart distribution – If both mean and precision are unknown conjugate prior is Gaussian-Wishart 23 Machine Learning srihari Student’s t-distribution • Conjugate prior for precision of Gaussian is given by Gamma • If we have a univariate Gaussian N(x|µ,τ -1) together with Gamma prior Gam(τ|a,b) and we integrate out the precision we obtain marginal distribution of x ∞ • Has the form p ( x | µ , a, b) = ∫ N ( x | µ ,τ −1 )Gam(τ | a, b)dτ 0 1/ 2 St ( x | µ , λ ,ν ) = Γ(ν / 2 + 1 / 2) ⎛ λ ⎞ ⎡ λ ( x − µ ) ⎤ ⎜ ⎟ ⎢1 + ⎥ Γ(ν + 2) ⎝ πν ⎠ ⎣ ν ⎦ 2 −ν / 2 −1 / 2 • Parameter ν=2a is called degrees of freedom and λ=a/b as the precision of the t distribution ν à ∞ becomes Gaussian • Infinite mixture of Gaussians with same mean but different precisions – Obtained by adding many Gaussian distributions – Result has longer tails than Gaussian 24 Machine Learning srihari Robustness of Student’s t • Has longer tails than Gaussian • ML solution can be found by expectation maximization algorithm • Effect of small no of outliers much less on t distributions • Can also obtain multivariate form of t-distribution Max likelihood fit using tand Gaussian Gaussian is strongly distorted by outliers 25 Machine Learning srihari Periodic Variables • Gaussian inappropriate for continuous variables that are periodic or angular – Wind direction on several days – Calendar time – Fingerprint minutiae direction • If we choose standard Gaussian, results depend on choice of origin – With 0* as origin two observations θ1=1* and θ2=359* will have mean at 180* and std dev 179* – With 180* as origin mean=0*, std dev= 1* • Quantity represented by polar coordinates 0 < θ <2ρ 26 Machine Learning srihari Conversion to Polar Coords • Observations D={θ1,..θN},θ measured in radians • Viewed as points on unit circle – Represent data as 2-d vectors x1,..xN where ||xn||=1 1 N – Mean x = ∑ x n N n =1 • Cartesian coords: xn=(cosθn,sinθn) • Coords of sample mean x = (r cos θ , r sin θ ) • Simplifying and solving gives ⎧⎪ ∑n sin θ n ⎫⎪ θ = tan ⎨ ⎬ ⎪⎩ ∑n cos θ n ⎪⎭ −1 27 Machine Learning srihari Von Mises Distribution • Periodic generalization of Gaussian 2π – p(θ) satisfies p (θ ) ≥ 0, ∫ p(θ )dθ = 1, p (θ + 2π ) = p (θ ) 0 • Consider 2-d Gaussian x= (x1,x2), µ = (µ1,µ2), Σ = σ 2I ⎧ ( x1 − µ1 ) 2 + ( x2 − µ 2 ) 2 ⎫ 1 p ( x1 , x2 ) = exp⎨− ⎬ 2 2 2πσ 2 σ ⎩ ⎭ – Contours of constant density are circles • Transforming to polar coordinates (r,θ) x1 = r cos θ , x2 = r sin θ and µ1 = r0 cosθ 0 , µ 2 = r0sinθ 0 • Defining µ=r0/s 2 distribution of p(θ) along unit circle p (θ | θ 0 , m) = 1 exp{m cos(θ − θ 0 )} Von Mises distribution 2πI 0 (m) Where I0(m) is the Bessel Function 1 I 0 ( m) = 2π 2π ∫ exp{m cosθ }dθ 0 28 Machine Learning srihari Von Mises Plots Cartesian Plot Polar Plot Two different parameter values For large µ distribution Is approximately Gaussian • Zeroth-order Bessel Function of the first kind I0(m) 29 Machine Learning srihari ML estimates of von Mises parameters • Parameters are θ0 and m • Log-likelihood function N ln p ( D | θ 0 , m) = − N ln(2π ) − N ln I 0 (m) + m∑ cos(θ n − θ 0 ) n =1 • Setting derivative wrt q0 equal to zero gives ⎧⎪ ∑ sin θ ⎫⎪ n θ 0ML = tan −1 ⎨ n ⎬ ⎪⎩ ∑n cos θ n ⎪⎭ • Maximizing wrt µ gives solution for A(µML) which can be inverted to get µ 30 Machine Learning • • srihari Mixtures of Gaussians Gaussian has limitations in modeling real data sets Old Faithful (Hydrothermal Geyser in Yellowstone) – 272 observations – Duration (mins, horiz axis) vs Time to next eruption (vertical axis) – Simple Gaussian unable to capture structure – Linear superposition of two Gaussians is better • Linear combinations of Gaussians can give very complex densities K p( x ) = ∑ π k N ( x | µ k , Σ k ) k =1 πk are mixing coefficients that sum to one • One –dimension – Three Gaussians in blue – Sum in red 31 Machine Learning srihari Mixture of 3 Gaussians • Contours of constant density for components • Contours of mixture density p(x) • Surface plot of distribution p(x) 32 Machine Learning srihari Estimation for Gaussian Mixtures • Log likelihood function is ⎧K ⎫ ln p ( X | π , µ , Σ) = ∑ ln ⎨∑ π k N (x n | µ k , Σ k )⎬ n =1 ⎩ k =1 ⎭ N • Situation is more complex • No closed form solution • Use either iterative numerical optimization techniques or Expectation Maximization 33