Document 13510803

advertisement
9.913 Pattern Recognition for Vision
Class IV
Part I – Bayesian Decision Theory
Yuri Ivanov
Fall 2004
TOC
• Roadmap to Machine Learning
• Bayesian Decision Making
–
–
–
–
Fall 2004
Minimum Error Rate Decisions
Minimum Risk Decisions
Minimax Criterion
Operating Characteristics
Pattern Recognition for Vision
Notation
x - scalar variable
x - vector variable, sometimes x when clear
p ( x ) - probability density function (continuous variable)
P( x) - probability mass function (discrete variable)
p ( a�
, b, c�
, d | e, f , g , h ) - conditional density
���
�
���
�
densityof these
functionof these
� f (x)dx ” � f ( x , x ,...x )dx dx ...dx
” � � ...� f ( x , x ,...x ) dx dx ...dx
1
2
n
1
2
1
n
2
n
1
2
n
w1
A > B - if A > B then the answer is ω1, otherwise ω2
w2
Fall 2004
Pattern Recognition for Vision
Machine Learning
Learning machine:
y
x
G
S
LM
ŷ = f ( x, w)
G – Generator, or Nature: implements p(x)
S – Supervisor: implements p(y|x)
LM - Learning Machine: implements f(x, w)
Fall 2004
Pattern Recognition for Vision
Loss and Risk
L( y , f ( x, w )) - Loss Function – how much penalty we get for
deviations from true y
z
R( w ) = L( y , f (x, w )) p( x, y) dxdy - Expected Risk – how
much penalty we get on
average
Goal of learning is to find f(x,w*) such that R(w*) is minimal.
Fall 2004
Pattern Recognition for Vision
Quick Illustration
What does it mean?
R ( w) = � � L( y, f ( x, w)) p( x, y ) dxdy
From basic probability:
If no noise:
p( x, y ) = p ( y | x) p ( x )
p( y | x) = d ( y, g ( x ))
R ( w) = � L( g ( x), f ( x, w)) p( x ) dx
Fall 2004
Pattern Recognition for Vision
Illustration cont.
So,
R ( w) = � L( g ( x), f ( x, w)) p( x ) dx
y
x
Fall 2004
Pattern Recognition for Vision
Example
Classification:
Measurements: x ˛ [ -1, 1]
Labels:
y = {0,1}
0
-1
1
y
0
<= p(x, y)
1
x
…
…
y=0
y =1
x
Find f(x,w*) such that R(w*) is minimal.
Fall 2004
Pattern Recognition for Vision
Example
Let’s choose
f ( x, a) = H ( x, a) - step function
L( y, f ( x, a )) = 1 - d ( y - H ( x, a)) - +1 for every mistake
(superimposed)
1
R (a ) = � � [1 - d (Y - H ( x, a)) ] p( x, y = Y ) dx
Y =0
Fall 2004
Pattern Recognition for Vision
Learning in Reality
Fundamental problem: where do we get p(x, y)???
What we want:
z
R( w ) = L( y , f (x, w )) p( x, y) dxdy
Approximate: estimate risk functional by averaging loss over
observed (training) data.
What we get:
1 n
R ( w ) ‹ Re ( w ) = � L( yi , f (x i , w ))
n i =1
Replace expected risk with empirical risk
Fall 2004
Pattern Recognition for Vision
What We Call Learning
Taxonomies of machine learning:
• by source of evaluation – Supervised, Transductive,
Unsupervised, Reinforcement
• by inductive principle – ERM, SRM, MDL, Bayesian
estimation
• by objective – classification, regression, vector
quantization and density estimation
Fall 2004
Pattern Recognition for Vision
Taxonomy by Evaluation Source
• Supervised (classification, regression)
Evaluation source - immediate error vector, that is, we get to see the true y
• Transductive
Evaluation source – immediate error vector for SOME of the data
• Unsupervised (clustering, density estimation)
Evaluation source - internal metric – we don’t get to see true y
• Reinforcement
Evaluation source - environment – we get to see some scalar value
(possibly delayed) that in some way related to whether the label we chose
was correct…
Fall 2004
Pattern Recognition for Vision
Taxonomy by Inductive Principle
• Empirical Risk Minimization (ERM)
1 n
min � L( yi , f ( xi , w )),
w n
i =1
f (x i , w ) ˛`
• Structural Risk Minimization (SRM) “complexity”
1 n
min( � L( yi , f ( xi , w )) + F( h )),
w ,h n i =1
f (x i , w ) ˛ `k , `1 � `2 � … � `k � … , k � h
• Minimum Description Length
1 n
�( D, H ) = � ( D| H ) + � ( H ) � min( � L( yi , f ( xi , w )) + F( w))
w
n i=1
• Bayesian Estimation
z
P (x| C) = P( x|q ) p(q | C )dq
Fall 2004
�
1 n
min( � L( yi , f (x i , w)) + F( P( w )))
n i =1
Pattern Recognition for Vision
Taxonomy by Learning Objective
• Classification
L( y , f ( x , w )) = 1 - d ( y , f ( x , w ))
• Regression
L ( y , f (x , w )) = ( y - f (x , w )) 2
• Density Estimation
L( f (x , w )) = - log( f (x , w ))
• Clustering/Vector Quantization
L ( f (x , w )) = ( x - f ( x, w )) � ( x - f ( x, w ))
Fall 2004
Pattern Recognition for Vision
The Lay of the Land
1 n
find f (or w ) = arg min( � L( yi , f (xi , w )) + F )
n i =1
f ( or w )
Classification
Supervised
ERM
Regression
Transductive
SRM
Density Estimation
Unsupervised
MDL
Vector Quantization/Clustering
Reinforcement
Bayesian
Regularization
Fall 2004
Pattern Recognition for Vision
Class Priors
Making a decision about observation x is finding a rule that says:
If x is in region A, decide a, if x is in region B, decide b…
w
w = {w i }i =1
C
- state of nature
P(w ) - Prior probability
Shorthand
P (w i ) ” P (w = w i )
C
� P(w ) = 1
i =1
i
Poor man’s decision rule:
Decide w1 if P (w1 ) > P (w 2 ) otherwise w 2
Fall 2004
Pattern Recognition for Vision
Class-Conditional Density
p(x | w )
- class-conditional density function
¥
� p( x | w )dx = 1
i
-¥
p ( x | w ) - density for x given that the nature is in the state w
How do we decide which
class x came from?
Deciding on this is not fair
Fall 2004
?
Pattern Recognition for Vision
Joint Density
p ( x, w ) = p ( x | w ) P (w )
Good – It is fair
- joint density function
P (w1 ) = 0.2 P (w 2 ) = 0.8
Bad – not very convenient.
This relates to the measurement
probabilistically, we need a function!
Fall 2004
Pattern Recognition for Vision
Bayes Rule and Posterior
We want a probability of ω for each value of x: P (w | x)
Note that
p ( x, w ) = p ( x | w ) P (w ) = P(w | x) p ( x)
likelihood
prior
�
Bayes rule
p ( x | w ) P (w )
� P(w | x ) =
p( x)
posterior
evidence
Continuum of binary distributions
Fall 2004
Pattern Recognition for Vision
Marginalization
Bayes Rule: how to convert prior to posterior by using measurements:
p ( x | w ) P (w )
P(w | x ) =
p( x)
What is p(x)?
C
C
i =1
i =1
p ( x )= � p ( x , w i ) = � p ( x | w i ) P (w i )
Or, more generally:
p ( x )= � p ( x, y ) dy - marginalization
Fall 2004
Pattern Recognition for Vision
Making Decisions
With posteriors we can compare class probabilities
Intuitively:
w = arg max ( p(w i | x ) )
What is the probability of error?
x
� p (w 2 | x )
p (error | x ) = �
� p (w1 | x )
if we choose w1
if we choose w 2
P (error ) = � P( error , x) dx = � P (error | x) p (x ) dx
Fall 2004
Pattern Recognition for Vision
Errors
How bad is a decision threshold?
P(e) = P ( x ˛ R1 ,w 2 ) + P( x ˛ R2 , w1 )
T*
Recall that P ( a ˛ A) =
�
p ( a) da �
a˛ A
P (e ) =
�
p ( x, w 2 ) dx +
x˛R1
R2
x˛R2
=
�
x˛ R1
Fall 2004
�
R1
p ( x, w1 )dx =
p ( x | w 2 ) P (w 2 ) dx +
�
p ( x | w1 ) P (w1 )dx
x˛ R2
Pattern Recognition for Vision
Bayes Decision Rule
To minimize P (error )=
� P(error | x) p (x )dx
we need to make P(error|x) as small as possible for all x
�
min [ P (error )] = � min [ P (error | x ) ] p ( x )dx
w
Bayes error
= � min [ P(w1 | x ), P(w 2 | x ) ] p (x )dx
Bayes decision rule:
�
Decide w1 if P (w1 | x) >P (w 2 | x ); otherwise w 2
Fall 2004
Pattern Recognition for Vision
Bayes Decision Rule
For Bayes decision rule (back to the 1st example):
R1 – region where we always choose ω1
R2 – region where we always choose ω2
Fall 2004
Pattern Recognition for Vision
Loss Function
{w1 , w 2 ,..., w c }
{a1 , a 2 ,...,a a }
L (a i | w j )
- set of classes
- set of actions
- Loss function, penalty sustained for taking
an action αi when the state of nature is ωj
We make this up
For x ˛ R d conditional risk for taking an action αi in ωj is the
expected (average) loss for classifying ONE x:
c
R (a i | x) = � L (a i | w j ) P (w j | x )
j =1
Fall 2004
Pattern Recognition for Vision
Bayes Risk
We will see a lot of x –es. To see how well we do, we average again:
Ø c
ø
R = � R (a i | x) p ( x )dx = � Œ � L (a i | w j ) P (w j | x) œ p ( x ) dx
º j =1
ß
Ø c
ø
= � Œ � L(a i | w j ) p (w j , x) œ dx
º j =1
ß
This is exactly the expression for expected risk from before
Similarly to the earlier argument about P(error):
min [ R ] = � min [ R(a i | x) ] p ( x ) dx = R * - Bayes risk
Fall 2004
Pattern Recognition for Vision
Quick Summary
L (a i | w j )
- loss
R (a i | x ) = Ew | x غ L(a i | w j ) øß
- conditional risk (expected loss)
R = E x [ R (a i | x) ]
- total risk (expected cond. risk)
R * = min [ R ]
- Bayes risk (minimum risk)
Fall 2004
Pattern Recognition for Vision
Minimum Risk Classification
Two classes and a simple action:
Ø -1 0.2ø
L=Œ
0 ϧ
º3
a i - decide to choose ωi
lij = L (a i | w j )
w = {w1 , w 2 }
Then:
� R (a1 | x ) = l11 P (w 1 | x ) + l12 P (w 2 | x )
�
� R (a 2 | x ) = l21 P (w1 | x ) + l22 P (w 2 | x )
Obvious decision – decide in favor of the class with minimal risk:
w1
R (a 1 | x) < R (a 2 | x )
w2
Fall 2004
Pattern Recognition for Vision
Likelihood Ratio Test
Rewriting R(αi|x)’s:
w1
(l21 - l11 ) P (w1 | x ) > (l12 - l22 ) P (w 2 | x) �
w2
w1
(l21 - l11 ) p ( x | w1 ) P (w1 ) > (l12 - l22 ) p ( x | w 2 ) P (w 2 ) �
w2
“Class model”
“Class prior”
p ( x | w1 ) w1 (l12 - l22 ) P (w 2 )
>
p ( x | w 2 ) w 2 (l21 - l11 ) P (w1 )
Likelihood Ratio Test
Fall 2004
Pattern Recognition for Vision
LRT Example
• You are driving to Blockbuster’s to return a video due today
• It is 5 min to midnight
• You hit a red light
• You see a car that you 60% sure looks like a police car
• Traffic fine is $5 AND you are late
• Blockbuster’s fine is $10
Should you run the red light?
Fall 2004
Pattern Recognition for Vision
Minimum Risk
P( police | x ) = 0.6
P ( police | x ) = 0.4
You pay
police
not police
run
$15
$0
not run
$10
$10
� 15 0 �
L=�
�
Ł10 10 ł
�� R ( run | x ) = l11P ( police | x ) + l12 P ( police | x) = $9
�
� R ( wait | x ) = l21P ( police | x ) + l22P ( police | x ) = $10
The risk is higher if you wait
Fall 2004
Pattern Recognition for Vision
LRT Way
OK
Let’s say we have this “policeness” feature
WAIT
Decision threshold
p ( x | w1 ) w1 (l12 - l22 ) P (w 2 )
>
p( x | w 2 ) w 2 (l21 - l11 ) P (w1 )
Fall 2004
Pattern Recognition for Vision
LRT Example
OK
WAIT
OK
WAIT
Threshold is dependent on priors
p ( x | w1 ) w1 (l12 - l22 ) P (w 2 )
>
p( x | w 2 ) w 2 (l21 - l11 ) P (w1 )
Fall 2004
Pattern Recognition for Vision
LRT Example
� 15 -60 �
L=�
�
10
10
Ł
ł
� 15 0 �
L=�
�
10
10
Ł
ł
OK
WAIT
Threshold is dependent on loss
OK
OK
WAIT
p ( x | w1 ) w1 (l12 - l22 ) P (w 2 )
>
p( x | w 2 ) w 2 (l21 - l11 ) P (w1 )
Fall 2004
Pattern Recognition for Vision
Minimum Error Rate Classification
Let‘s simplify the Min. Risk classification:
� 0 1 1�
�
�
L = �1 � 1�
�1 1 0�
Ł
ł
- zero-one loss, just counts errors
Then the conditional risk becomes:
c
R (a i | x ) = � L (a i | w j ) P (w j | x )
j =1
= � P (w j | x ) = 1 - P (w i | x )
"j „ i
So, ωi having the highest value of the posterior minimizes the risk:
wi
P (w i | x ) >P(w j | x ) "j „ i
Fall 2004
- good ol’ Bayes decision rule
Pattern Recognition for Vision
Minimax Criterion
Is there a decision rule such that the risk is insensitive to priors?
R=
� [l
11
P (w1 | x) + l12 P (w 2 | x )] dx
R1
+ � [ l21 P (w1 | x ) + l22 P (w 2 | x ) ] dx
R2
After some algebra:
R ( P (w1 )) = l22 + ( l12 - l22 ) � p ( x | w 2 ) dx + P (w1 ) f (T )
R1
Minimax risk
Goal – find the decision boundary T, such that f(T) is 0.
At the boundary for which the minimum risk is maximal the risk is
independent of priors.
Fall 2004
Pattern Recognition for Vision
Messy Illustration of the Minimax Solution
Rmm + P(w1 ) f (T ) -cannot be smaller than Bayes risk
Rmm + P(w1 ) f (T )
R ( P (w1 ))
Bayes risk is concave down
0
1
P(w1 )
T : f (T ) = 0
T : f (T ) „ 0
Worst possible Bayes risk is independent of priors
Fall 2004
Pattern Recognition for Vision
Discriminant Functions and Decision Surfaces
Discriminant functions conveniently represent classifiers:
C = {g1 ( x), g 2 ( x),...g n ( x)}
w i : i = arg max ( g i ( x) )
Eg:
g i ( x ) = -R (a i | x )
gi ( x ) = P (w i | x )
g i ( x ) = ln p ( x | w i ) + ln P (w i )
Discriminants DO NOT have to relate to probabilities.
Fall 2004
Pattern Recognition for Vision
Discriminant for Binary Classification
For two-class problem:
g ( x ) = g1 ( x ) - g 2 ( x )
Then (assuming that classes are encoded as -1 and +1):
w i = sign ( g( x))
Eg:
g ( x ) = P (w1 | x) - P (w 2 | x )
P (w1 )
p ( x | w1 )
g ( x ) = ln
+ ln
P (w 2 )
p( x | w2 )
Fall 2004
Pattern Recognition for Vision
Boundary for Two Normal Distributions
If we assume a Gaussian for a class model:
p( x | wi ) =
1
(2p )
d /2
| Si |
1/2
e
-
(
1
( x- m i )T S i-1 ( x- m i )
2
)
… and the minimum error rate classifier:
g i ( x ) = ln [ p ( x | w i ) P (w i ) ] =
1
T -1
d/2
1/2
Ø
= - ( ( x - m i ) Si ( x - m i ) ) - ln º (2p ) | Si | ßø + ln P (w i )
2
Fall 2004
Pattern Recognition for Vision
Boundary Between Two Normal Distributions
Discriminant (after some algebra):
g ( x ) = g1 ( x ) - g 2 ( x ) == xT Wx + wx + w0 - quadratic
where
1 -1
W = - ( S1 - S 2-1 )
2
w = S1-1m1 - S 2-1m 2
wo = ... well, the rest of it
- a matrix
- a vector
- a scalar
Special cases:
1) S1 = S 2 � W = 0 � g ( x ) - linear
Ø 1
1 ø
2) S i = s i I � W = Œ
œ I = s I � g ( x ) - a circle
º 2s 2 2s 1 ß
Fall 2004
Pattern Recognition for Vision
Evaluating Decisions
Is 70% classification rate good or bad?
70% = BAD
70% = GOOD
Discriminability:
d'=
m1 - m 2
s
High d’ means that the classes are easy to discriminate.
Fall 2004
Pattern Recognition for Vision
ROC
We do not know m1 , m 2 , s
but we can get:
P ( x > x * | x ˛ w 2 ) - probability of hit
P ( x > x * | x ˛ w1 ) - probability of false alarm
P ( x < x * | x ˛ w 2 ) - probability of miss
P ( x < x * | x ˛ w1 ) - probability of correct rejection
Each x* corresponds to a point on hit/false_alarm plane.
This is called an ROC curve
Fall 2004
Pattern Recognition for Vision
ROC Curve
Fall 2004
Pattern Recognition for Vision
Reality
In practice, it is done for a single parameter
Using the data for which true ω is known:
• Identify a parameter of interest
• Identify the parameter range
• Vary the parameter within the range
• Compute P(hit) and P(false_alarm) empirically for each value
It tells us how well the classifier can deal with the data set.
Fall 2004
Pattern Recognition for Vision
Homework – Part I
• A “chance” puzzle
– Try to solve it
– Understand the solution
– Simulate in Matlab
• Build an ROC curve
– Almost like in class
– Can you implement it efficiently?
Fall 2004
Pattern Recognition for Vision
Download