lecture_05

advertisement
ECE 8527
to Machine Learning and Pattern Recognition
8443 – Introduction
Pattern Recognition
LECTURE 05: BAYESIAN ESTIMATION
• Objectives:
Bayesian Estimation
Example
• Resources:
D.H.S.: Chapter 3 (Part 2)
J.O.S.: Bayesian Parameter Estimation
A.K.: The Holy Trinity
A.E.: Bayesian Methods
J.H.: Euro Coin
Introduction to Bayesian Parameter Estimation
• In Chapter 2, we learned how to design an optimal classifier if we knew the
prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).
• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.
• Bayesian learning: sharpen the a posteriori density causing it to peak near
the true value.
• Supervised vs. unsupervised: do we know the class assignments of the
training data.
• Bayesian estimation and ML estimation produce very similar results in many
cases.
• Reduces statistical inference (prior knowledge or beliefs about the world) to
probabilities.
ECE 8527: Lecture 05, Slide 1
Class-Conditional Densities
• Posterior probabilities, P(ωi|x), are central to Bayesian classification.
• Bayes formula allows us to compute P(ωi|x) from the priors, P(ωi), and the
likelihood, p(x|ωi).
• But what If the priors and class-conditional densities are unknown?
• The answer is that we can compute the posterior, P(ωi|x), using all of the
information at our disposal (e.g., training data).
• For a training set, D, Bayes formula becomes:
P(i | x, D) 
likelihood  prior

evidence
p(x i , D) P(i D)
c
 p(x  j , D) P( j D)
j 1
• We assume priors are known: P(ωi|D) = P(ωi).
• Also, assume functional independence:
Di have no influence on
This gives:
P(i | x, D) 
p(x  j , D) if i  j
p(x i , Di ) P(i )
c
 p(x  j , D j ) P( j )
j 1
ECE 8527: Lecture 05, Slide 2
The Parameter Distribution
• Assume the parametric form of the evidence, p(x), is known: p(x|θ).
• Any information we have about θ prior to collecting samples is contained in a
known prior density p(θ).
• Observation of samples converts this to a posterior, p(θ|D), which we hope is
peaked around the true value of θ.
• Our goal is to estimate a parameter vector:
p(x D)   p(x, D)d
• We can write the joint distribution as a product:
p(x D)   p(x  , D ) p( D)d
  p(x  ) p( D)d
because the samples are drawn independently.
• This equation links the class-conditional density p(x D)
to the posterior, p( D) . But numerical solutions are typically required!
ECE 8527: Lecture 05, Slide 3
Univariate Gaussian Case
• Case: only mean unknown
p(x  )  N (, 2 )
• Known prior density:
p( )  N (0 , 02 )
• Using Bayes formula:
p(  D) p( D)  p( D  ) p(  )
• Rationale: Once a value of μ is
known, the density for x is
completely known. α is a
normalization factor that
p(  D) 

p( D  ) p(  )
p( D)
p( D  ) p( )
 p ( D  ) p (  ) d
  [ p ( D  ) p (  )]
n
depends on the data, D.
ECE 8527: Lecture 05, Slide 4
   p( x k  ) p(  )
k 1
Univariate Gaussian Case
• Applying our Gaussian assumptions:
2
 n  1
 1 

x


1
 k
    1
0

p  | D     
exp 
exp 
  
 k 1  2 
 2   0
 2       2  0
 



2
 

 
 1      2 n  x    2 
0
    k
   exp 
 
  
 2   0 
k 1 


 1    2  2  0   02  n  x 2k
x k   2  
    2  2 2  2  
   exp  
2

0

  
 2  
 k 1  
n
 1   02
x 2k
   exp  2   2
 2   0 k 1 
 1    2 2  0

 exp   2 


 02
 2    0

 1    2 2  0
   exp   2 

 02
 2    0
ECE 8527: Lecture 05, Slide 5
  n
x k   2  
    (2 2  2 )   




 
  k 1
  n
x k   2  
    (2 2  2 )   



 
  k 1
Univariate Gaussian Case (Cont.)
• Now we need to work this into a simpler form:
 1    2 2 0   n
x k   2  
    (2 2  2 )   
p | D     exp   2 
2 




 0   k 1
 
 2    0
 1  n  2   2   n 

x  
   exp     2   2       2 k 2    2 20
0
 
 2    k 1    0   k 1 
 1 2 2

   exp  n 2  2  2 2

0
 2  
n
 x k  2
k 1
 0 

2 
 0 
 1  n
1
   exp(  2  2
0
 2  
 2
 1  n
0

   2
n

x
2
 2  k 


1

k


0


 1  n
1
   exp(  2  2
0
 2  
1 n
where ˆ n   x k
n k 1
 2


   2 1 (nˆ n )  0
 2

 02


ECE 8527: Lecture 05, Slide 6




 
  

  
 
  

  
Univariate Gaussian Case (Cont.)
• p(μ|D) is an exponential of a quadratic function, which makes it a normal
distribution. Because this is true for any n, it is referred to as a reproducing
density.
• p(μ) is referred to as a conjugate prior.
• Write p(μ|D) ~ N(μn,σn
2):
p(  D) 
1   n 2
exp[  (
) ]
2  n
2 n
1
• Expand the quadratic term:
1   n 2
1
1  2  2 n   n2
p( D) 
exp[ (
) ]
exp[ (
)]
2
2 n
2
n
2  n
2  n
1
• Equate coefficients of our two functions:
 1   2  2 n   n2  
exp  
  
2

n
2  n

 2
 1  n
 2
 1
  

1
0

  exp   2  2    2 2 (nˆ n )  2  
 2  
 0   
0 



1
ECE 8527: Lecture 05, Slide 7
Univariate Gaussian Case (Cont.)
• Rearrange terms so that the dependencies on μ are clear:
 1   n2  
 1 1 2
 n  






exp   2  exp   2   2 2   
 2  
 2 
 n  
2  n
 n 
 n


 1  n
 1
 0   
1  2


  exp   2  2    2 2 (nˆ n )  2   
 2  
 0   
0 



1
• Associate terms related to σ2 and μ:
• There is actually a third equation involving terms not related to :
 1   n2  
exp   2       or 
 2  
2  n
 n 

1
 1   n2  
1
exp   2    
 2  
2  n
2  0
 n 

1
 1

 2 



n
but we can ignore this since it is not a function of μ and is a complicated
equation to solve.
ECE 8527: Lecture 05, Slide 8
Univariate Gaussian Case (Cont.)
• Two equations and two unknowns. Solve for μn and σn2. First, solve for μn2 :
 2 02
 02 2
 


n
1
n 02   2 n 02   2
 2
2

0
2
n
1
• Next, solve for μn:
n n2
  n2 
 n  ˆ n ( 2 )   0  2 

0 
 1
n   02 2 

 2
 ˆ n ( 2 )


0
2
2 
  n 0   
0
 n 02
 ˆ n 
2
2
 n 0  
  02 2

2
2
 n 0  

 2
  0 

 n 2   2

 0




• Summarizing:
 2



n  ( 2
) ˆ 
2 n 
2
2 0
n 0  
 n 0   
n 02
 n2
 02 2

n 02   2
ECE 8527: Lecture 05, Slide 9



Bayesian Learning
• μn represents our best guess after n samples.
• σn2 represents our uncertainty about this guess.
• σn2 approaches σ2/n for large n – each additional observation decreases our
uncertainty.
• The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is
known as Bayesian learning.
ECE 8527: Lecture 05, Slide 10
Class-Conditional Density
• How do we obtain p(x|D) (derivation is tedious):
p ( x D)   p ( x  ) p(  D)d


1
1 x 2
1
1   n 2
exp[  (
) ][
exp[  (
) ]d
2 
2 
2  n
2 n
1 x
exp[  ( 2 n2 ) 2 ] f ( , n )
2n
2  n
1
where:
f ( , n ) 
• Note that:
2
2 
2
2

1  n
 n x   n 
d
 exp[  ( 2 2 )  
2
2

2  n 
 n 
p(x | D) ~ N (n , 2   n2 )
2
• The conditional mean, μn, is treated as the true mean.
• p(x|D) and P(ωj) can be used to design the classifier.
ECE 8527: Lecture 05, Slide 11
Multivariate Case
• Assume: p(x ) ~ N (, ) and p() ~ N (0 , 0 )
where  0 , and  0 are assumed to be known.
• Applying Bayes formula:
n
p | D     p(x k | ) p 
k 1
n
 1 t

1
1
t
1

  exp  ( (n    0 )  2 (  x k   01  0 )) 
 2

i 1
which has the form:
 1

p | D     exp  (   n ) t  1 (   n )
 2

• Once again: p | D  ~ N ( n ,  n )
and we have a reproducing density.
ECE 8527: Lecture 05, Slide 12
Estimation Equations
• Equating coefficients between the two Gaussians:
 n1  n  1   01
 n1  n  n  1 ˆ n   01 0
ˆ n 
1 n
 xk
n k 1
• The solution to these equations is:
1
1
1
 n   0 ( 0   ) 1ˆ n   ( 0   ) 1 0
n
n
n
1
1
 n   0 ( 0   ) 1 
n
n
• It also follows that: p(x | D) ~ N ( n ,    n )
ECE 8527: Lecture 05, Slide 13
General Theory
• p(x | D) computation can be applied to any situation in which the unknown
density can be parameterized.
• The basic assumptions are:
 The form of p(x | θ) is assumed known, but the value of θ is not known
exactly.
 Our knowledge about θ is assumed to be contained in a known prior
density p(θ).
 The rest of our knowledge about θ is contained in a set D of n random
variables x1, x2, …, xn drawn independently according to the unknown
probability density function p(x).
ECE 8527: Lecture 05, Slide 14
Formal Solution
• The posterior is given by:
p(x | D)   p(x | ) p( | D)d
• Using Bayes formula, we can write p(D| θ ) as:
p ( | D ) 
p (D | ) p  
 p (D | ) p  d
• and by the independence assumption:
n
p ( D | )   p ( x k  
k 1
• This constitutes the formal solution to the problem because we have an
expression for the probability of the data given the parameters.
• This also illuminates the relation to the maximum likelihood estimate:
• Suppose p(D| θ ) reaches a sharp peak at θ  θˆ .
ECE 8527: Lecture 05, Slide 15
Comparison to Maximum Likelihood
• This also illuminates the relation to the maximum likelihood estimate:
 Suppose p(D| θ ) reaches a sharp peak at θ  θˆ .
 p(θ| D) will also peak at the same place if p(θ) is well-behaved.
 
 p(x|D) will be approximately p x | ˆ , which is the ML result.
 If the peak of p(D| θ ) is very sharp, then the influence of prior information
on the uncertainty of the true value of θ can be ignored.
 However, the Bayes solution tells us how to use all of the available
information to compute the desired density p(x|D).
ECE 8527: Lecture 05, Slide 16
Recursive Bayes Incremental Learning
• To indicate explicitly the dependence on the number of samples, let:
Dn  x1, x2 ,..., xn
• We can then write our expression for p(D| θ ):
p( Dn | )  p(xn | ) p( Dn1 | )
0
where p( D | )  p() .
• We can write the posterior density using a recursive relation:
p ( | D ) 
n

p ( D n | ) p  
n
 p ( D | ) p  d
p ( xn |  ) p ( D n 1 |  )
n 1
 p ( xn |  ) p ( | D ) d
0
• where p( | D )  p( ) .
• This is called the Recursive Bayes Incremental Learning because we have a
method for incrementally updating our estimates.
ECE 8527: Lecture 05, Slide 17
When do ML and Bayesian Estimation Differ?
• For infinite amounts of data, the solutions converge. However, limited data
is always a problem.
• If prior information is reliable, a Bayesian estimate can be superior.
• Bayesian estimates for uniform priors are similar to an ML solution.
• If p(θ| D) is broad or asymmetric around the true value, the approaches are
likely to produce different solutions.
• When designing a classifier using these techniques, there are three
sources of error:
 Bayes Error: the error due to overlapping distributions
 Model Error: the error due to an incorrect model or incorrect assumption
about the parametric form.
 Estimation Error: the error arising from the fact that the parameters are
estimated from a finite amount of data.
ECE 8527: Lecture 05, Slide 18
Noninformative Priors and Invariance
• The information about the prior is based on the designer’s knowledge of the
problem domain.
• We expect the prior distributions to be “translation and scale invariance” –
they should not depend on the actual value of the parameter.
• A prior that satisfies this property is referred to as a
“noninformative prior”:
 The Bayesian approach remains applicable even when little or no prior
information is available.
 Such situations can be handled by choosing a prior density giving equal
weight to all possible values of θ.
 Priors that seemingly impart no prior preference, the so-called
noninformative priors, also arise when the prior is required to be invariant
under certain transformations.
 Frequently, the desire to treat all possible values of θ equitably leads to
priors with infinite mass. Such noninformative priors are called improper
priors.
ECE 8527: Lecture 05, Slide 19
Example of Noninformative Priors
• For example, if we assume the prior distribution of a mean of a continuous
random variable is independent of the choice of the origin, the only prior
that could satisfy this is a uniform distribution (which isn’t possible).
• Consider a parameter σ, and a transformation of this variable:
new variable, ~  ln(  ) . Suppose we also scale by a positive constant:
ˆ  ln( a )  ln a  ln  . A noninformative prior on σ is the inverse distribution
p(σ) = 1/ σ, which is also improper.
ECE 8527: Lecture 05, Slide 20
Sufficient Statistics
• Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging
(e.g. neural networks)
• We need a parametric form for p(x|θ) (e.g., Gaussian)
• Gaussian case: computation of the sample mean and covariance, which was
straightforward, contained all the information relevant to estimating the
unknown population mean and covariance.
• This property exists for other distributions.
• A sufficient statistic is a function s of the samples D that contains all the
information relevant to a parameter, θ.
• A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ:
p ( | s , D ) 
p ( D | s , ) p ( | s )
 p ( | s )
p( D | s)
ECE 8527: Lecture 05, Slide 21
The Factorization Theorem
• Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written
as: p(D | )  g (s, )h(D) .
• There are many ways to formulate sufficient statistics
(e.g., define a vector of the samples themselves).
• Useful only when the function g() and the sufficient statistic are simple
(e.g., sample mean calculation).
• The factoring of p(D|θ) is not unique:
g ( s, )  f ( s) g ( s, )
h( D)  h( D) / f ( s)
• Define a kernel density invariant to scaling:
g ( s , )
g~ ( s, ) 
 g ( s ,  ) d
• Significance: most practical applications of parameter estimation involve
simple sufficient statistics and simple kernel densities.
ECE 8527: Lecture 05, Slide 22
Gaussian Distributions
 1

t 1




exp

x



x


k
k
 2

d / 2 1/ 2
k 1 2 

n
1
p ( D )  
 1 n t 1
t 1
t 1 

exp







x

x

k
k 
 2
d / 2 1/ 2
 k 1

2  
1
n
 n t 1

t 1
 exp          x k 
 k 1 
 2

1
 1 n t 1  


exp   x k  x k 
 2 d / 2  1/ 2
 2 k 1
 

• This isolates the θ dependence in the first term, and hence, the sample mean
is a sufficient statistic using the Factorization Theorem.
• The kernel is:
~ ,  
g~ 
n
 1
~ t  1  1   
~ 

exp





n
n 
n


2 d / 2  1/ 2  2 
1
ECE 8527: Lecture 05, Slide 23
The Exponential Family
t
• This can be generalized: p(x | )   x exp[ a()  b() c()]
n
n
k 1
k 1
and: p( D | )  exp[ na()  b()  c(x k )]   x k   g (s, )h( D)
t
• Examples:
ECE 8527: Lecture 05, Slide 24
Summary
• Introduction of Bayesian parameter estimation.
• The role of the class-conditional distribution in a Bayesian estimate.
• Estimation of the posterior and probability density function assuming the only
unknown parameter is the mean, and the conditional density of the “features”
given the mean, p(x|θ), can be modeled as a Gaussian distribution.
• Bayesian estimates of the mean for the multivariate Gaussian case.
• General theory for Bayesian estimation.
• Comparison to maximum likelihood estimates.
• Recursive Bayesian incremental learning.
• Noninformative priors.
• Sufficient statistics
• Kernel density.
ECE 8527: Lecture 05, Slide 25
“The Euro Coin”
• Getting ahead a bit, let’s see how we can put these ideas to work on a simple
example due to David MacKay, and explained by Jon Hamaker.
ECE 8527: Lecture 05, Slide 26
Download