Bayesian Learning Thanks to Nir Friedman, HU .

advertisement
Bayesian Learning
Thanks to Nir Friedman, HU
.
Example
 Suppose
we are required to build a controller that
removes bad oranges from a packaging line
 Decision are made based on a sensor that reports
the overall color of the orange
Bad
oranges
Classifying oranges
Suppose we know all the aspects of the problem:
Prior Probabilities:
 Probability of good (+1) and bad (-1) oranges
 P(C = +1) = probability of a good orange
 P(C = -1) = probability of a bad orange
 Note: P(C = +1) + P(C = -1) = 1
 Assumption:
oranges are independent
The occurrence of a bad orange does not depend
on previous
Classifying oranges (cont)
Sensor performance:
 Let X denote sensor measurement from each type
of oranges
p(X | C = -1 )
p(X | C = +1 )
Bayes Rule
 Given
this knowledge, we can compute the
posterior probabilities
Bayes Rule
P (C )P (X  x | C )
P (C | X  x ) 
P (X  x )
P (X  x )  P (C  1)P (X  x | C  1) 
P (C  1)P (X  x | C  1)
Posterior of Oranges
Data likelihood
… combined with prior… after normalization …
P(C = -1 ) p(X | C = -1 )
P(C = +1)p(X | C = +1 )
1
p(X | C = -1 )
p(X | C = +1 )
p(C = -1 |X)
P(C = +1|X)
0
Decision making
Intuition:
 Predict “Good” if P(C=+1 | X) > P(C=-1 | X)
 Predict “Bad”, otherwise
1
p(C = -1 |X)
P(C = +1|X)
0
bad
good
bad
Loss function
we have classes +1, -1
 Suppose we can make predictions a1,…,ak
 Assume
loss function L(ai, cj) describes the loss
associated with making prediction ai when the class
is cj
A
Real Label
Prediction
Bad
Good
-1
+1
1
10
5
0
Expected Risk
 Given
the estimates of P(C | X) we can compute
the expected conditional risk of each decision
R (a | X )   L(a , c )P (C  c | X )
c
The Risk in Oranges
1
p(C = -1 |X)
P(C = +1|X)
0
10
R(Good|X)
5
R(Bad|X)
0
-1
+1
Bad
1
5
Good
10
0
Optimal Decisions
Goal:
 Minimize risk
Optimal decision rule:
“Given X = x, predict ai if R(ai|X=x) = mina R(a|X=x) “
 (break ties arbitrarily)
Note: randomized decisions do not help
0-1 Loss
 If
we don’t have prior knowledge, it is common to
use the 0-1 loss
 L(a,c) = 0 if a = c
 L(a,c) = 1 otherwise
Consequence:
 R(a|X) = P(a c|X)
 Decision rule:
“choose ai if P(C = ai | X) = maxa P(C = a|X) “
Bayesian Decisions: Summery
Decisions based on two components:
 Conditional distribution P(C|X)
 Loss function L(A,C)
Pros:
 Specifies optimal actions in presence of noisy
signals
 Can deal with skewed loss functions
Cons:
 Requires P(C|X)
Simple Statistics : Binomial Experiment
Head
Tail
 When
tossed, it can land in one of two positions:
Head or Tail
 We denote by  the (unknown) probability P(H).
Estimation task:
 Given a sequence of toss samples x[1], x[2], …,
x[M] we want to estimate the probabilities P(H)= 
and P(T) = 1 - 
Why Learning is Possible?
Suppose we perform M independent flips of the
thumbtack
 The number of head we see is a binomial
distribution
M  k
P (# Heads  k )    (1   )M k
k 
 and thus E[# Heads ]  M
This suggests, that we can estimate  by # Heads
M
Maximum Likelihood Estimation
MLE Principle:
Learn parameters that maximize the
likelihood function
 This
is one of the most commonly used estimators
in statistics
 Intuitively appealing
 Well studied properties
Computing the Likelihood Functions
To
compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
L( : D )  NH  (1  )NT
Applying
the MLE principle we get
ˆ 
NH
NH
NH  NT
and NT are sufficient statistics for the
binomial distribution
Sufficient Statistics
A
sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
 Formally, s(D) is a sufficient statistics if for any
two datasets D and D’
 s(D) = s(D’ )
 L( |D) = L( |D’)
Datasets
Statistics
Maximum A Posterior (MAP)
 Suppose
we observe the sequence
 H, H
 MLE
estimate is P(H) = 1, P(T) = 0
 Should
we really believe that tails are impossible at
this stage?
 Such an estimate can have disastrous effect
 If we assume that P(T) = 0, then we are willing
to act as though this outcome is impossible
Laplace Correction
Suppose we observe n coin flips with k heads
 MLE
k
P (H ) 
n
 Laplace
correction:
k 1
P (H ) 
n 2
As though we observed one additional H and one
additional T
 Can
we justify this estimate? Uniform prior!
Bayesian Reasoning
 In
Bayesian reasoning we represent our
uncertainty about the unknown parameter  by a
probability distribution
 This
probability distribution can be viewed as
subjective probability
 This is a personal judgment of uncertainty
Bayesian Inference
We start with
 P() - prior distribution about the values of 
 P(x1, …, xn|) - likelihood of examples given a
known value 
Given examples x1, …, xn, we can compute posterior
distribution on 
P (x1, xn |  )P ( )
P ( | x1, xn ) 
P (x1, xn )
Where the marginal likelihood is
P (x1,  xn )   P (x1,  xn |  )P ( )d
Binomial Distribution: Laplace Est.
this case the unknown parameter is  = P(H)
 Simplest prior P() = 1 for 0< <1
 Likelihood
 In
P (x1,  xn |  )   k (1   )n k
where k is number of heads in the sequence
 Marginal
Likelihood:
1
P (x1,  xn )    k (1   )n k d 
0
Marginal Likelihood
Using integration by parts we have:
1
P (x1,  xn )    k (1   )n k d 
0
1

 k 1 (1   )n k
k 1
1
1
0
n k
k 1
n k 1


(
1


)
d

k 1 0
1
n k
k 1
n k 1


(
1


)
d

k 1 0
Multiply both side by n choose k, we have
n  k
 n  k 1
n k
    (1   ) d   
   (1   )n k 1d 
k  0
k  1  0
1
1
Marginal Likelihood - Cont
 The
recursion terminates when k = n
n  n
1
n n
n
    (1   ) d     d  
n 1
n  0
0
1
1
Thus
1 n 
k
n k
P (x1,  xn )    (1   ) d  
 
n  1 k 
0
1
We conclude that the posterior is
n  k
P ( | x1,  xn )  (n  1)  (1   )n k
k 
1
Bayesian Prediction
 How
do we predict using the posterior?
 We can think of this as computing the probability of
the next element in the sequence
P (xn 1 | x1, , xn )   P (xn 1,  | x1, , xn )d
  P (xn 1 |  , x1, , xn )P ( | x1, , xn )d
  P (xn 1 |  )P ( | x1, , xn )d
 Assumption: if we know , the probability of Xn+1
is independent of X1, …, Xn
P (xn 1 |  , x1, , xn )  P (xn 1 |  )
Bayesian Prediction
 Thus,
we conclude that
P (xn 1  H | x1, , xn )   P (xn 1 |  )P ( | x1, , xn )d
  P ( | x1, , xn )d
 n  k 1
 (n  1)    (1   )n k d
k 
n  1 n  1 


 (n  1) 
k  n  2 k  1 
1
k 1

n 2
Naïve Bayes
.
Bayesian Classification:
Binary Domain
Consider the following situation:
 Two classes: -1, +1
 Each example is described by by N attributes
 Xn is a binary variable with value 0,1
Example dataset:
X1 X 2
…
XN
C
0
1
1
+1
1
0
1
-1
1
1
0
+1
…
…
…
…
0
0
0
+1
Binary Domain - Priors
How do we estimate P(C) ?
 Simple
Binomial estimation
 Count # of instances with C = -1, and with C = +1
X1 X2
…
XN
C
0
1
1
+1
1
0
1
-1
1
1
0
+1
…
…
…
…
0
0
0
+1
Binary Domain - Attribute Probability
How do we estimate P(X1,…,XN|C) ?
Two sub-problems:
Training set for P(X1,…,XN|C=+1):
Training set for P(X1,…,XN|C=-1):
X1 X2
…
XN
C
0
1
1
+1
1
0
1
-1
1
1
0
+1
…
…
…
…
0
0
0
+1
Naïve Bayes
Naïve Bayes:
 Assume
P (X1 , , XN | C )  P (X1 | C ) P (XN | C )
This is an independence assumption
 Each attribute Xi is independent of the other
attributes once we know the value of C
Naïve Bayes:Boolean Domain
i |1  P (Xi  1 | C  1)
 Parameters:
 i |1
i |1
for each i
i |1  P (Xi  1 | C  1)
How do we estimate 1|+1?
 Simple
binomial estimation
 Count #1 and #0 values of X1
in instances where C=+1
X1 X2
…
XN
C
0
1
1
+1
1
0
1
-1
1
1
0
+1
…
…
…
…
0
0
0
+1
Interpretation of Naïve Bayes
P ( 1 | X1 ,..., Xn )
P (X1 ,..., Xn | 1)P ( 1)
log
 log
P ( 1 | X1 ,..., Xn )
P (X1 ,..., Xn | 1)P ( 1)
P (Xi | 1)
P ( 1)
 log
 log 
P ( 1)
i P (Xi | 1)
P (Xi | 1)
P ( 1)
 log
  log
P ( 1) i
P (Xi | 1)
Interpretation of Naïve Bayes
P (1 | X1 ,..., Xn )
P (Xi | 1)
P (1)
log
 log
 log
P (1 | X1 ,..., Xn )
P (1) i
P (Xi | 1)
Xi “votes” about the prediction
 If P(Xi|C=-1) = P(Xi|C=+1) then Xi has no say in
classification
 If P(Xi|C=-1) = 0 then Xi overrides all other votes
(“veto”)
 Each
Interpretation of Naïve Bayes
P (1 | X1 ,..., Xn )
P (Xi | 1)
P (1)
log
 log
 log
P (1 | X1 ,..., Xn )
P (1) i
P (Xi | 1)
P (Xi  1 | 1)
P (Xi  0 | 1)
 log
Set wi  log
P (Xi  1 | 1)
P (Xi  0 | 1)
P (Xi  0 | 1)
P ( 1)
b  log
  log
P ( 1) i
P (Xi  0 | 1)
Classification rule sign (b  wi xi )
i
Normal Distribution
The Gaussian distribution:
X ~ N ( ,  )
2
if
1
p (x ) 
e
2 
0.4
2
1 x  
 

2  
N(0,12)
N(4,22)
0.3
0.2
0.1
0
-4
-2
0
2
4
6
8
10
Maximum Likelihood Estimate
we observe x1, …, xm
Simple calculations show that the MLE is
 Suppose
1
   xm
M m
1
1
2
2
   (xm  ) 
M m
M
 Sufficient
statistics are
 xm  2
2
m
1
M

M
 xm ,
m
2
x


 m
m
1
M
2
x
 m
m
Naïve Bayes with Gaussian
Distributions
 Recall,
P (X1 , , XN | C )  P (X1 | C ) P (XN | C )
 Assume:
P (Xi | C ) ~ N ( i ,C ,  i2 )
 Mean of Xi depends on class
 Variance of Xi does not
i ,1
i ,1
Naïve Bayes with Gaussian
Distributions
Recall
P (1 | X1 ,..., Xn )
P (Xi | 1)
P (1)
log
 log
 log
P (1 | X1 ,..., Xn )
P (1) i
P (Xi | 1)
P (Xi | 1) i , 1  i , 1 1  i , 1  i , 1

log


X

i 
P (Xi | 1)
i
i 
2

Distance between means
Distance of Xi to midway point
i ,1
i ,1
Different Variances?
 If
we allow different variances, the classification rule
is more complex
 The
term log
P (Xi | 1)
is quadratic in Xi
P (Xi | 1)
Download