Gaussian Distribution

advertisement
Machine Learning
srihari
Gaussian Distribution
Sargur N. Srihari
1
Machine Learning
srihari
The Gaussian Distribution
Carl Friedrich Gauss
1777-1855
•  For single real-valued variable x
N(x | µ,σ 2 ) =
⎧ 1
1
2⎫
exp
−
(x
−
µ
)
⎨
⎬
2 1/ 2
2
⎩ 2σ
⎭
(2πσ )
•  Parameters:
68% of data lies within σ of mean
95% within 2σ
–  Mean µ, variance σ 2,
•  Standard deviation σ
•  Precision β =1/σ 2, E[x]=µ, Var[x]=σ 2
•  For D-dimensional vector x, multivariate Gaussian
⎧
⎫
⎪
⎪
1
1
1
T −1
⎪
⎪
N(x | µ,Σ) =
exp
−
(x
−
µ)
Σ
(x
−
µ)
⎨
⎬
D/2
1/2
⎪
⎪
(2π) | Σ |
⎪
⎪
⎩ 2
⎭
µ is a mean vector, Σ is a D x D covariance matrix, |Σ| is the determinant of Σ
Σ-1 is also referred to as the precision matrix
2
Machine Learning
srihari
Covariance Matrix
•  Gives a measure of the dispersion of the data
•  It is a D x D matrix
–  Element in position i,j is the covariance between
the ith and jth variables.
•  Covariance between two variables xi and xj is defined as
E[(xi-µi)(yi-µj)]
•  Can be positive or negative
–  If the variables are independent then the
covariance is zero.
•  Then all matrix elements are zero except diagonal
elements which represent the variances
3
Machine Learning
srihari
Importance of Gaussian
One variable histogram
(uniform over [0,1])
•  Gaussian arises in
many different
contexts, e.g.,
–  For a single variable,
Gaussian maximizes
entropy (for given
mean and variance)
–  Sum of set of random
variables becomes
increasingly Gaussian
Mean of two variables
The two values
could be 0.8 and 0.2
whose average is 0.5
More ways of getting 0.5
than say 0.1
Mean of ten variables
4
Machine Learning
srihari
Geometry of Gaussian
Two dimensional Gaussian
x = (x1,x2)
•  Functional dependence of
Gaussian on x is through
Δ2 = (x − µ)T Σ−1(x − µ)
–  Called Mahanalobis Distance
–  reduces to Euclidean distance when
Σ is an identity matrix
•  Matrix Σ is symmetric
–  Has an Eigenvector equation
Σui = λiui
ui are Eigen vectors
λi are Eigen values
Red: Elliptical contour of constant density
Major axes: eigenvectors ui
5
Machine Learning
srihari
Contours of Constant Density
(a) General form
•  Determined by
Covariance Matrix
–  Covariances represent
how features vary together
(b) Diagonal matrix
(aligned with
coordinate axes)
(c) Proportional to
Identity matrix
(concentric circles)
6
Machine Learning
Joint Gaussian implies that Marginal and
Conditional are Gaussian
•  If two sets of variables xa,xb are
jointly Gaussian then the two
conditional densities and the two
marginals are also Gaussian
•  Given joint Gaussian N(x|µ,Σ) with
Λ=Σ-1 and x = [xa,xb]T where xa are
first m components of x and xb are
next D-m components
•  Conditionals
p ( x a | x b ) = N ( x | µ a|b , Λ−aa1 ) where µ a|b = µ a − Λ−aa1 Λ ab ( x b − µb )
srihari
Joint p(xa, xb)
Marginal p(xa) and
Conditional p(xa|xb)
•  Marginals
⎛Σ
p ( xa ) = N ( xa | µ a , Σ aa ) where Σ = ⎜⎜ aa
⎝ Σ ba
Σ ab ⎞
⎟⎟
Σ bb ⎠
7
Machine Learning
If Marginals are Gaussian, Joint need
not be Gaussian
srihari
•  Constructing such a joint pdf:
–  Consider 2-D Gaussian, zero-mean uncorrelated
rvs x and y
Due to symmetry about x- and
y-axes, we can write marginals:
i.e., we only need to integrate
over hatched regions
–  Take original 2-D Gaussian and set it to zero over
non-hatched quadrants and multiply remaining by 2
we get a 2-D pdf that is definitely NOT Gaussian
Machine Learning
srihari
Maximum Likelihood for the Gaussian
•  Given a data set X=(x1,..xN)T where the
observations {xn} are drawn independently
•  Log-likelihood function is given by
ND
N
1
ln p ( X | µ , Σ) = −
ln(2π ) − ln | Σ | − ∑ ( x − µ ) Σ ( x − µ )
2
2
2
•  Derivative wrt µ is
µ ln p ( X | µ , Σ) = ∑ Σ ( x − µ )
N
n =1
N
∂
∂
T
n
−1
n
−1
n
n =1
1
=
N
N
µ
∑x
•  Whose solution is
•  Maximization w.r.t. Σ is more involved. Yields
ML
Σ ML
1
=
N
n =1
n
N
T
(
x
−
µ
)
(
x
−
µ
)
∑ n ML n ML
n =1
9
Machine Learning
srihari
Bias of M. L. Estimate of Covariance Matrix
•  For N(µ,Σ), m.l.e. of Σ for samples x1,..xN
Σ ML
1
=
N
N
∑ (x
n =1
n
− µ ML )( x n − µ ML )T
•  arithmetic average of N matrices:
•  Since
is
1 N
E[ΣML ] =
(xn − µML )(xn − µML )T
∑
N −1 n=1
(xn − µML )(xn − µML )T
we have
E[ΣML ] =
N −1
Σ
N
–  m.l.e. is smaller than the true value of Σ
–  Thus m.l.e. is biased
•  irrespective of no of samples does not give exact value.
–  For large N inconsequential.
•  Rule of thumb: use 1/N for known mean and 1/(N-1) for
estimated mean.
•  Bias does not exist in Bayesian solution.
10
Machine Learning
srihari
Sequential Estimation
•  In on-line applications and large data sets batch
processing of all data points in infeasible
–  Real-time learning scenario where steady stream of
data is arriving and predictions must be made before
all data is seen
•  Sequential methods allow data points to be
processed one-at-a-time and then discarded
–  Sequential learning arises naturally with Bayesian
viewpoint
•  M.L.E. for parameters of Gaussian gives a
convenient opportunity to discuss more general
discussion of sequential estimation for maximum
likelihood
11
Machine Learning
srihari
Sequential Estimation of Gaussian Mean
•  By dissecting contribution of final data point
µ ML
•  Same as earlier batch result
•  Nice interpretation:
1
=
N
=µ
N
∑x
n =1
N -1
ML
n
1
N −1
+ ( x N − µ ML
)
N
–  After observing N-1 data points we have estimated µ by
µMLN-1
–  We now observe data point xN and we obtain revised
estimate by moving old estimate by small amount
–  As N increases contribution from successive points smaller
12
Machine Learning
srihari
General Sequential Estimation
•  Sequential algorithms cannot always be factored
out
•  Robbins and Monro (1951) gave a general solution
•  Consider pair of random variables θ and z with joint
distribution p(z,θ)
•  Conditional expectation of z given θ is
f (θ ) = E[ z | θ ] = ∫ zp ( z | θ )dz
•  Which is called a regression function
–  Same as one that minimizes expected squared loss seen
earlier
•  It can be shown that maximum likelihood solution is
equivalent to finding the root of the regression
function
–  Goal is to find θ* at which f(θ*)=0
13
Machine Learning
srihari
Robbins-Monro Algorithm
•  Defines sequence of successive estimates of root θ*
as follows
θ ( N ) = θ ( N −1) + a N −1 z (θ ( N −1) )
•  Where z(θ(N))is observed value of z when θ takes the
value θ(N)
•  Coefficients {aN} satisfy reasonable conditions
lim a N = 0,
N →∞
∞
∑a
N =1
N
= ∞,
∞
∑a
N =1
2
N
<∞
•  Solution has a form where z involves a derivative of
p(x|θ) wrt θ
•  Special case of Robbons-Monro is solution for
Gaussian mean
14
Machine Learning
srihari
Bayesian Inference for the Gaussian
•  MLE framework gives point estimates for
parameters µ and Σ
•  Bayesian treatment introduces prior distributions
over parameters
•  Case of known variance
•  Likelihood of N observations X={x1,..xN} is
1
⎧ 1
p ( X | µ ) = ∏ p ( xn | µ ) =
exp
⎨− 2
2 N /2
(2πσ )
n =1
⎩ 2σ
N
N
∑ (x
n =1
2⎫
)
−
µ
⎬
n
⎭
•  Likelihood function is not a probability distribution
over µ and is not normalized
•  Note that likelihood function is quadratic in µ
15
Machine Learning
srihari
Bayesian formulation for Gaussian mean
•  Likelihood function
1
⎧ 1
p ( X | µ ) = ∏ p ( xn | µ ) =
exp⎨− 2
2 N /2
(2πσ )
n =1
⎩ 2σ
N
⎫
(xn − µ ) ⎬
∑
n =1
⎭
N
2
•  Note that likelihood function is quadratic in µ
•  Thus if we choose a prior p(θ) which is
Gaussian it will be a conjugate distribution for
the likelihood because product of two
exponentials will also be a Gaussian
p(µ) = N(µ|µ0,σ02)
16
Machine Learning
srihari
Bayesian inference: Mean of Gaussian
•  Given Gaussian prior
p(µ) = N(µ|µ0,σ02)
•  Posterior is given by
p(µ|X) α p(X|µ)p(µ)
•  Simplifies to
P(µ|X) = N(µ|µN,σN2) where
Prior and posterior have same
form: conjugacy
Data points from mean=0.8
and known variance=0.1
Nσ 02
σ2
µN =
µ0 +
µ ML
If N=0 reduces to prior mean
2
2
2
2
Nσ 0 + σ
Nσ 0 + σ
If N à∞ posterior mean is ML solution
1
1
1
=
+
Precision is additive: sum of precision of prior plus one contribution
σ N2 σ 02 σ 2
of data precision from each observed data point
17
Machine Learning
srihari
Bayesian Inference of the Variance
•  Known Mean
•  Wish to infer variance
•  Analysis simplified if we choose conjugate form
for prior distribution
•  Likelihood function with precision λ=1/σ 2
N
p ( X | λ ) = ∏ N ( xn | µ , λ − 1)
n =1
α
λ
N/ 2
⎧ λ N
2⎫
exp⎨− ∑ (x − µ ) ⎬
⎩ 2 n =1
⎭
•  Conjugate prior is given by Gamma distribution
18
Machine Learning
srihari
Gamma Distribution
•  Gaussian with known mean but
unknown variance
•  Conjugate prior for the
precision of a Gaussian is
given by a Gamma distribution
Gamma distribution
Gam(λ|a,b) for various
values of a and b
–  Precision l = 1/σ 2
1 a a −1
Gam(λ | a, b) =
b λ exp(−bλ )
Γ(a )
∞
Γ( x) = ∫ u x −1e −u du
0
–  Mean and Variance
E[λ ] =
a
,
b
var[λ ] =
a
b2
19
Machine Learning
srihari
Gamma Distribution Inference
•  Given prior distribution Gam(λ|a0,b0)
•  Multiplying by likelihood function
•  The posterior distribution has the form
Gam(λ|aN,bN) where
N
a N = a0 +
2
Effect of N observations is to increase a by N/2
Interpret a0 as 2a0 effective prior observations
2
1 N
bN = b0 + ∑ (xn − µ )
2 n =1
N 2
= b0 + σ ML
2
20
Machine Learning
srihari
Both Mean and Variance are Unknown
•  Consider dependence of likelihood function on µ
and λ
1/ 2
⎛ λ ⎞
p( X | µ , λ ) = ∏ ⎜
⎟
2
π
⎠
n =1 ⎝
N
⎧ λ
2⎫
exp⎨− (xn − µ ) ⎬
⎩ 2
⎭
•  Identify a prior distribution p(µ,λ) that has same
functional dependence on µ and λ as likelihood
function
•  Normalized prior takes the form
(
)
p ( µ , λ ) = N µ | µ 0 , (βλ ) Gam(λ | a, b )
−1
–  Called normal-gamma or Gaussian-gamma
distribution
21
Machine Learning
srihari
Normal Gamma
•  Both mean and
precision unknown
•  Contour plot with
µ0=0, β=2, a=5 and b=6
22
Machine Learning
srihari
Estimation for Multivariate Case
•  For a multivariate Gaussian distribution N(x|
µ,Λ-1) for a D-dimensional variable x
–  Conjugate prior for mean µ assuming known
precision is Gaussian
–  For known mean and unknown precision matrix Λ,
conjugate prior is Wishart distribution
–  If both mean and precision are unknown conjugate
prior is Gaussian-Wishart
23
Machine Learning
srihari
Student’s t-distribution
•  Conjugate prior for precision of Gaussian is given by Gamma
•  If we have a univariate Gaussian N(x|µ,τ -1) together with
Gamma prior Gam(τ|a,b) and we integrate out the precision we
obtain marginal distribution
of x
∞
•  Has the form
p ( x | µ , a, b) = ∫ N ( x | µ ,τ −1 )Gam(τ | a, b)dτ
0
1/ 2
St ( x | µ , λ ,ν ) =
Γ(ν / 2 + 1 / 2) ⎛ λ ⎞ ⎡ λ ( x − µ ) ⎤
⎜
⎟ ⎢1 +
⎥
Γ(ν + 2) ⎝ πν ⎠ ⎣
ν
⎦
2
−ν / 2 −1 / 2
•  Parameter ν=2a is called degrees of freedom and λ=a/b as the
precision of the t distribution
ν à ∞ becomes Gaussian
•  Infinite mixture of Gaussians with same mean but different
precisions
–  Obtained by adding many Gaussian distributions
–  Result has longer tails than Gaussian
24
Machine Learning
srihari
Robustness of Student’s t
•  Has longer tails than
Gaussian
•  ML solution can be found
by expectation
maximization algorithm
•  Effect of small no of outliers
much less on t distributions
•  Can also obtain multivariate
form of t-distribution
Max
likelihood
fit using tand
Gaussian
Gaussian
is strongly
distorted
by outliers
25
Machine Learning
srihari
Periodic Variables
•  Gaussian inappropriate for continuous variables that
are periodic or angular
–  Wind direction on several days
–  Calendar time
–  Fingerprint minutiae direction
•  If we choose standard Gaussian, results depend on
choice of origin
–  With 0* as origin two observations θ1=1* and θ2=359* will
have mean at 180* and std dev 179*
–  With 180* as origin mean=0*, std dev= 1*
•  Quantity represented by polar coordinates 0 < θ <2ρ
26
Machine Learning
srihari
Conversion to Polar Coords
•  Observations D={θ1,..θN},θ measured in radians
•  Viewed as points on unit circle
–  Represent data as 2-d vectors x1,..xN where ||xn||=1
1 N
–  Mean x = ∑ x n
N
n =1
•  Cartesian coords: xn=(cosθn,sinθn)
•  Coords of sample mean
x = (r cos θ , r sin θ )
•  Simplifying and solving gives
⎧⎪ ∑n sin θ n ⎫⎪
θ = tan ⎨
⎬
⎪⎩ ∑n cos θ n ⎪⎭
−1
27
Machine Learning
srihari
Von Mises Distribution
•  Periodic generalization of
Gaussian
2π
–  p(θ) satisfies
p (θ ) ≥ 0,
∫ p(θ )dθ = 1,
p (θ + 2π ) = p (θ )
0
•  Consider 2-d Gaussian x= (x1,x2), µ = (µ1,µ2), Σ = σ 2I
⎧ ( x1 − µ1 ) 2 + ( x2 − µ 2 ) 2 ⎫
1
p ( x1 , x2 ) =
exp⎨−
⎬
2
2
2πσ
2
σ
⎩
⎭
–  Contours of constant density are circles
•  Transforming to polar coordinates (r,θ)
x1 = r cos θ , x2 = r sin θ and µ1 = r0 cosθ 0 , µ 2 = r0sinθ 0
•  Defining µ=r0/s 2 distribution of p(θ) along unit circle
p (θ | θ 0 , m) =
1
exp{m cos(θ − θ 0 )} Von Mises distribution
2πI 0 (m)
Where I0(m) is the Bessel Function
1
I 0 ( m) =
2π
2π
∫ exp{m cosθ }dθ
0
28
Machine Learning
srihari
Von Mises Plots
Cartesian Plot
Polar Plot
Two different
parameter values
For large µ distribution
Is approximately Gaussian
•  Zeroth-order Bessel Function of the first kind
I0(m)
29
Machine Learning
srihari
ML estimates of von Mises parameters
•  Parameters are θ0 and m
•  Log-likelihood function
N
ln p ( D | θ 0 , m) = − N ln(2π ) − N ln I 0 (m) + m∑ cos(θ n − θ 0 )
n =1
•  Setting derivative wrt q0 equal
to zero gives
⎧⎪ ∑ sin θ ⎫⎪
n
θ 0ML = tan −1 ⎨ n
⎬
⎪⎩ ∑n cos θ n ⎪⎭
•  Maximizing wrt µ gives
solution for A(µML) which can
be inverted to get µ
30
Machine Learning
• 
• 
srihari
Mixtures
of
Gaussians
Gaussian has limitations in modeling real
data sets
Old Faithful (Hydrothermal Geyser in
Yellowstone)
–  272 observations
–  Duration (mins, horiz axis) vs Time to next
eruption (vertical axis)
–  Simple Gaussian unable to capture
structure
–  Linear superposition of two Gaussians is
better
• 
Linear combinations of Gaussians can
give very complex densities
K
p( x ) = ∑ π k N ( x | µ k , Σ k )
k =1
πk are mixing coefficients that sum to one
• 
One –dimension
–  Three Gaussians in blue
–  Sum in red
31
Machine Learning
srihari
Mixture of 3 Gaussians
•  Contours of constant density
for components
•  Contours of mixture density
p(x)
•  Surface plot of distribution
p(x)
32
Machine Learning
srihari
Estimation for Gaussian Mixtures
•  Log likelihood function is
⎧K
⎫
ln p ( X | π , µ , Σ) = ∑ ln ⎨∑ π k N (x n | µ k , Σ k )⎬
n =1
⎩ k =1
⎭
N
•  Situation is more complex
•  No closed form solution
•  Use either iterative numerical optimization
techniques or Expectation Maximization
33
Download