Baysian Decision Theory (Classfication)

advertisement
Bayesian Decision Theory
(Classification)
主講人:虞台文
Contents







Introduction
Generalize Bayesian Decision Rule
Discriminant Functions
The Normal Distribution
Discriminant Functions for the Normal
Populations.
Minimax Criterion
Neyman-Pearson Criterion
Bayesian Decision Theory
(Classification)
Introduction
What is Bayesian Decision Theory?

Mathematical foundation for decision making.

Using probabilistic approach to help making
decision (e.g., classification) so as to
minimize the risk (cost).
Preliminaries and Notations
i {1 , 2 ,, c } : a state of nature
P(i ) : prior probability
x : feature vector
class-conditional
p(x | i ) :
density
P(i | x) : posterior probability
Bayesian Rule
p(x | i ) P(i )
P(i | x) 
p ( x)
c
p(x)   p(x | i ) P(i )
j 1
Decision
p(x | i ) P(i )
P(i | x) 
p(x) unimportant in
making decision
D (x)  arg max P(i | x)
i
Decision
p(x | i ) P(i )
P(i | x) 
p ( x)
D (x)  arg max P(i | x)
Decide i if P(i|x) > P(j|x)
i
ji
Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i
Special cases:
1. P(1)=P(2)=   =P(c)
2. p(x|1)=p(x|2) =   = p(x|c)
Decide i if P(i|x) > P(j|x)
ji
Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i
Two Categories
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Special cases:
1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
Special cases:
1. P(1)=P(2)
Example
R2
Decide 1 if p(x|> p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
R1
P(1)=P(2)
Example
P(1)=2/3
P(2)=1/3
R2
R2
R1
R1
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Classification Error
P(error )   p(error , x)dx
  P (error | x) p (x)dx
Consider two categories:
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
P(2 | x) if we decide 1
 min[ P(1 | x), P(2 | x)]
P(error | x)  
 P(1 | x) if we decide 2
Classification Error
P(error )   p(error , x)dx
  P (error | x) p (x)dx
Consider two categories:
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
P(2 | x) if we decide 1
 min[ P(1 | x), P(2 | x)]
P(error | x)  
 P(1 | x) if we decide 2
Bayesian Decision Theory
(Classification)
Generalized
Bayesian Decision
Rule
The Generation
  {1 , 2 ,, c } : a set of c states of nature
  {1 ,  2 ,,  a } : a set of a possible actions
ij   ( i |  j ) :
can be zero.
The loss incurred for
taking action i when the
true state of nature is j.
We want to minimize the expected loss in making decision.
Risk
Conditional Risk
c
c
j 1
j 1
R( i | x)    ( i |  j ) P( j | x)   ij P( j | x)
x
Given , the expected loss (risk)
associated with taking action
i.
0/1 Loss Function
c
c
j 1
j 1
R( i | x)    ( i |  j ) P( j | x)   ij P( j | x)
0  i is a correct decision assiciated with  j
 ( i |  j )  
1 otherwise
R(i | x)  P(error | x)
Decision
c
c
j 1
j 1
R( i | x)    ( i |  j ) P( j | x)   ij P( j | x)
Bayesian Decision Rule:
 (x)  arg min R( i | x)
i
Overall Risk
R   R( (x) | x) p(x)dx
Decision function
Bayesian decision rule:  (x)  arg min R( i | x)
i
the optimal one to minimize the overall risk
Its resulting overall risk is called the Bayesian risk
Two-Category Classification
State of Nature
  {1, 2}
Action
  {1 , 2 }
Loss Function
1
2
1 2
11 12
21 22
R(1 | x)  11P(1 | x)  12 P(2 | x)
R( 2 | x)  21P(1 | x)  22 P(2 | x)
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
21P(1 | x)  22 P(2 | x)  11P(1 | x)  12 P(2 | x)
(21  11) P(1 | x)  (12  22 ) P(2 | x)
R(1 | x)  11P(1 | x)  12 P(2 | x)
R( 2 | x)  21P(1 | x)  22 P(2 | x)
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
21P(1 | x)  22 P(2 | x)  11P(1 | x)  12 P(2 | x)
(21  11) P(1 | x)  (12  22 ) P(2 | x)
positive
positive
Posterior probabilities are scaled before comparison.
p(x | i ) P(i )
P(i | x) 
p ( x)
irrelevan
t
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
21P(1 | x)  22 P(2 | x)  11P(1 | x)  12 P(2 | x)
(21  11) P(1 | x)  (12  22 ) P(2 | x)
(21  11) p(x | 1 ) P(1 )  (12  22 ) p(x | 2 ) P(2 )
p(x | 1 ) (12  22 ) P(2 )

p(x | 2 ) (21  11 ) P(1 )
This slide will be recalled later.
Two-Category Classification
Likelihood
Ratio
Perform 1 if
Threshold
p(x | 1 ) (12  22 ) P(2 )

p(x | 2 ) (21  11 ) P(1 )
Bayesian Decision Theory
(Classification)
Discriminant Functions
How to define discriminant functions?
The Multicategory Classification
g1(x)
x
g2(x)
gi(x)’s are called the
discriminant functions.
Action
(e.g., classification)
gc(x)
Assign x to i if
(x)
gi(x) > gj(x) for all j  i.
If f(.) is a monotonically increasing function,
than f(gi(.) )’s are also be discriminant functions.
Simple Discriminant Functions
Minimum Risk case:
gi (x)   R( i | x)
Minimum Error-Rate case:
gi (x)  P(i | x)
gi (x)  p(x | i ) P(i )
gi (x)  ln p(x | i )  ln P(i )
Decision Regions
R i  {x | g i (x)  g j (x) j  i}
Two-category example
Decision regions are
separated by decision
boundaries.
Bayesian Decision Theory
(Classification)
The Normal
Distribution
Basics of Probability
Discrete random variable (X) - Assume integer
Probability mass function (pmf):
Cumulative distribution function (cdf):
p( x)  P( X  x)
F ( x)  P( X  x) 
x
 p(t )
t  
Continuous random variable (X)
Probability density function (pdf):
p( x) or f ( x) not a probability
x
Cumulative distribution function (cdf):
F ( x)  P( X  x)   p(t )dt

Expectations
Let g be a function of random variable X.
 
  g ( x) p( x)
E[ g ( X )]   x  

 g ( x) p ( x)dx

The kth moment
X is discrete
X is continuous
E[ X k ]
The 1st moment  X  E[X ]
The kth central moments
E[( X   X ) k ]
Fact: Var[ X ]  E[ X 2 ]  ( E[ X ]) 2
Important Expectations
Mean
 
  xp( x)
 X  E[ X ]   x  

 xp( x)dx

X is discrete
X is continuous
Variance
 
2
(
x


)
p( x)
X

2
2
 X  Var[ X ]  E[( X   X ) ]   x  

 ( x   X ) 2 p( x)dx

X is discrete
X is continuous
Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.
 
  p( x) ln p( x)
H [ X ]   x  


p( x) ln p( x)dx

 
X is discrete
X is continuous
Properties:
1. Maximize the entropy
2. Central limit theorem
Univariate Gaussian Distribution
X~N(μ,σ2)

1
p ( x) 
e
2 
E[X] =μ
Var[X] =σ2
p(x)
( x )2
2 2
μ
σ
2σ
3σ
x
σ
2σ
3σ
X  ( X 1 , X 2 , , X d )
T
Random Vectors
A d-dimensional
random vector
Vector Mean:
X:  R
d
μ  E[ X]  ( 1 ,  2 , ,  d )
Covariance Matrix:
Σ  E[( X  μ)( X  μ) ]
T
  12  12

2


2
  21
 


 d 1  d 2
T
  1d 

  2d 
  
2 
  d 

1
p ( x) 
e
2 
( x )2
2 2
Multivariate Gaussian Distribution
X~N(μ,Σ)
p ( x) 
A d-dimensional random vector
1
(2 ) d / 2 | Σ |1/ 2
 1

T
1
exp  (x  μ) Σ (x  μ)
 2

E[X] =μ
E[(X-μ) (X-μ)T] =Σ
Properties of N(μ,Σ)
X~N(μ,Σ)
A d-dimensional random vector
Let Y=ATX, where A is a d × k matrix.
Y~N(ATμ, ATΣA)
Properties of N(μ,Σ)
X~N(μ,Σ)
A d-dimensional random vector
Let Y=ATX, where A is a d × k matrix.
Y~N(ATμ, ATΣA)
On Parameters of N(μ,Σ)
X~N(μ,Σ)
X  ( X 1 , X 2 , , X d )
μ  E[ X]  ( 1 ,  2 , ,  d )
T
T
i  E[ X i ]
Σ  E[( X  μ)( X  μ) ]  [ ij ]d d
T
 ij  E[( X i  i )( X j   j )]  Cov( X i , X j )
 ii   i2  E[( X i  i ) 2 ]  Var ( X i )
X i  X j   ij  0
Σ  (ΦΛ )(ΦΛ )
1/ 2
1/ 2 T
More On Covariance Matrix
 is symmetric and positive semidefinite.
Σ  ΦΛΦ  ΦΛ Λ Φ
T
1/ 2
1/ 2
T
: orthonormal matrix, whose columns are eigenvectors of .
: diagonal matrix (eigenvalues).
Σ  E[( X  μ)( X  μ) ]  [ ij ]d d
T
 ij  E[( X i  i )( X j   j )]  Cov( X i , X j )
 ii   i2  E[( X i  i ) 2 ]  Var ( X i )
X i  X j   ij  0
Σ  (ΦΛ )(ΦΛ )
1/ 2
1/ 2 T
Whitening Transform
X~N(μ,Σ)
Y=ATX
Let Aw  ΦΛ
Y~N(ATμ, ATΣA)
1 / 2
A w X ~ N ( A μ, A ΣA w )
T
w
T
w
ATw ΣA w  (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 )  I
A w X ~ N ( A μ, I )
T
w
Σ  (ΦΛ )(ΦΛ )
1/ 2
1/ 2 T
Whitening Transform
X~N(μ,Σ)
Y=ATX
Let Aw  ΦΛ
Whitening
Linear
Transform
T
Y~N(A μ, ATΣA)
1 / 2
Projection
A w X ~ N ( A μ, A ΣA w )
T
w
T
w
ATw ΣA w  (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 )  I
A w X ~ N ( A μ, I )
T
w
r 2  (x  μ)T Σ 1 (x  μ)
Mahalanobis Distance
X~N(μ,Σ)
p ( x) 
1
(2 ) d / 2 | Σ |1/ 2
depends on
constant
the value of r2
 1

T
1
exp  (x  μ) Σ (x  μ)
 2

r2
r 2  (x  μ)T Σ 1 (x  μ)
Mahalanobis Distance
X~N(μ,Σ)
p ( x) 
1
(2 ) d / 2 | Σ |1/ 2
depends on
constant
the value of r2
 1

T
1
exp  (x  μ) Σ (x  μ)
 2

r2
Bayesian Decision Theory
(Classification)
Discriminant
Functions for the
Normal Populations
gi (x)  P(i | x)
gi (x)  p(x | i ) P(i )
Minimum-Error-Rate Classification
gi (x)  ln p(x | i )  ln P(i )
Xi~N(μi,Σi)
1
 1

T
1
p(x | i ) 
exp

(
x

μ
)
Σ
(
x

μ
)
i
i
i 
 2
(2 ) d / 2 | Σi |1/ 2

1
d
1
T
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln 2  ln | Σ i |  ln P(i )
2
2
2
Minimum-Error-Rate Classification
Three Cases:
Case 1: Σ i   2 I
Classes are centered at different mean, and their feature
components are pairwisely independent have the same variance.
Case 2: Σ i  Σ
Classes are centered at different mean, but have the same variation.
Case 3: Σ i  Σ j
Arbitrary.
1
d
1
T
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln 2  ln | Σ i |  ln P(i )
2
2
2
Case 1. i =
g i ( x)  
1
2
I
1
i
Σ 
1

2
I
|| x  μ i ||2  ln P(i )
2 2
1
  2 (xT x  2μTi x  μTi μ i )  ln P(i )
2
irrelevant
 1 T

g i ( x)  2 μ x   
μ
μ

ln
P
(

)
i i
i 
2

2



irrelevant
1
T
i
1
d
1
T
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln 2  ln | Σ i |  ln P(i )
2
2
2
w i  12 μi
Case 1. i =
2
I
wi 0   21 2 μTi μi  ln P(i )
g i (x)  w Ti x  wi 0
 1 T

g i ( x)  2 μ x   
μ
μ

ln
P
(

)
i i
i 
2

2



1
T
i
w i  12 μi
Case 1. i =
2
I
wi 0   21 2 μTi μi  ln P(i )
g i (x)  w x  wi 0
wTi x  wi 0  wTj x  w j 0
g i ( x)  g j ( x)
(w  w )x  w j 0  wi 0
T
i
T
j
P(i )
(μ  μ )x  (μ μi  μ μ j )   ln
P( j )
T
i
T
j
T
i
1
2
T
j
2
(μ  μ )x  (μ  μ )(μi  μ j )  
T
i
T
j
1
2
T
i
j
i
T
i
T
j
Boundary btw.
i and j
T
T
(
μ

μ
i
j )(μ i  μ j )
2
|| μ i  μ j ||2
P(i )
ln
P( j )
The decision boundary will be a hyperplane
perpendicular to the line btw. the means at somewhere.
Case 1. i =
2
I
i
w (x  x 0 )  0
T
w
x0
w  μi  μ j
x 0  12 (μ i  μ j ) 
2
|| μ i  μ j ||2
ln
P(i )
(μ i  μ j )
P( j )
midpoint
0 if P(i)=P(j)
wT
(μ  μ )x  (μ  μ )(μi  μ j )  
T
i
T
j
x j
xx0
1
2
T
i
T
j
g i ( x)  g j ( x)
Boundary btw.
i and j
T
T
(
μ

μ
i
j )(μ i  μ j )
2
|| μ i  μ j ||2
P(i )
ln
P( j )
2
P(1 )
x 0  (μ1  μ 2 ) 
ln
(μ1  μ 2 )
2
|| μ1  μ 2 ||
P(2 )
1
2
Case 1. i =
2
I
P(1 )  P(2 )
Minimum distance classifier (template matching)
2
P(1 )
x 0  (μ1  μ 2 ) 
ln
(μ1  μ 2 )
2
|| μ1  μ 2 ||
P(2 )
1
2
Case 1. i =
P(1 )  P(2 )
2
I
2
P(1 )
x 0  (μ1  μ 2 ) 
ln
(μ1  μ 2 )
2
|| μ1  μ 2 ||
P(2 )
1
2
Case 1. i =
P(1 )  P(2 )
2
I
2
P(1 )
x 0  (μ1  μ 2 ) 
ln
(μ1  μ 2 )
2
|| μ1  μ 2 ||
P(2 )
1
2
Case 1. i =
P(1 )  P(2 )
2
I
Demo
Case 2. i = 
1
g i (x)   (x  μ i )T Σ 1 (x  μ i )  ln P(i )
2
Mahalanobis Irrelevant if
Distance P(i)= P(j) i, j
irrelevant
1
d
1
T
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln 2  ln | Σ i |  ln P(i )
2
2
2
w i  Σ 1μ i
Case 2. i = 
wi 0   12 μTi Σ 1μ i  ln P(i )
1
g i (x)   (x  μ i )T Σ 1 (x  μ i )  ln P(i )
2
1 T 1
  (x Σ x  2μTi Σ 1x  μTi Σ 1μ i )  ln P(i )
2
Irrelevant
g i (x)  w Ti x  wi 0
w i  Σ 1μ i
wi 0   12 μTi Σ 1μ i  ln P(i )
Case 2. i = 
g i (x)  w Ti x  wi 0
i
x
w
j
x0
w T (x  x 0 )  0
g i ( x)  g j ( x)
1
w  Σ (μi  μ j )
x 0  (μ i  μ j ) 
1
2
ln[ P(i ) / P( j )]
1
(μ i  μ j ) Σ (μ i  μ j )
T
(μ i  μ j )
Case 2. i = 
Demo
Case 2. i = 
1 1 w  Σ 1μ w   1 μT Σ 1μ  1 ln | Σ 1 |  ln P( )
i
i
i
i0
i
i
i
i
2 i
2
Wi   Σ i
2
Case 3. i   j
1
1
T
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln | Σ i |  ln P(i )
2
2
g i (x)  x Wi x  w x  wi 0
T
T
i
Without this term
In Case 1 and 2
Decision surfaces are hyperquadrics, e.g.,
• hyperplanes
• hyperspheres
• hyperellipsoids
• hyperhyperboloids
irrelevant
1
d
1
T
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln 2  ln | Σ i |  ln P(i )
2
2
2
Case 3. i   j
Non-simply connected decision
regions can arise in one dimensions
for Gaussians having unequal
variance.
Case 3. i   j
Case 3. i   j
Case 3. i   j
Demo
Multi-Category Classification
Bayesian Decision Theory
(Classification)
Minimax
Criterion
Bayesian Decision Rule:
Two-Category Classification
Likelihood
Ratio
Decide 1 if
Threshold
p(x | 1 ) (12  22 ) P(2 )

p(x | 2 ) (21  11 ) P(1 )
Minimax criterion deals with the case that
the prior probabilities are unknown.
Basic Concept on Minimax
To choose the worst-case prior probabilities
(the maximum loss) and, then, pick the
decision rule that will minimize the overall risk.
Minimize the maximum possible overall risk.
R(1 | x)  11P(1 | x)  12 P(2 | x)
R( 2 | x)  21P(1 | x)  22 P(2 | x)
Overall Risk
R   R( (x) | x) p(x)dx
  R(1 | x) p(x)dx   R( 2 | x) p (x)dx
R1
R2
R   [11P(1 | x)  12 P(2 | x)] p (x)dx 
R1

R2
[21P(1 | x)  22 P(2 | x)] p(x)dx
p(x | i ) P(i )
P(i | x) 
p ( x)
Overall Risk
R   [11P(1 ) p(x | 1 )  12 P(2 ) p(x | 2 )]dx 
R1

R2
[21P(1 ) p(x | 1 )  22 P(2 ) p(x | 2 )]dx
R   [11P(1 | x)  12 P(2 | x)] p (x)dx 
R1

R2
[21P(1 | x)  22 P(2 | x)] p(x)dx
P(2 )  1  P(1 )
Overall Risk
R   [11P(1 ) p(x | 1 )  12 P(2 ) p(x | 2 )]dx 
R1

R2
[21P(1 ) p(x | 1 )  22 P(2 ) p(x | 2 )]dx
R   {11P(1 ) p(x | 1 )  12[1  P(1 )] p(x | 2 )}dx 
R1

R2
{21P(1 ) p(x | 1 )  22[1  P(1 )] p(x | 2 )}dx
R  12  p(x | 2 )dx  22  p(x | 2 )dx
R1
R2
 11 P(1 )  p(x | 1 )dx  12 P(1 )  p(x | 2 )dx
R1
R1
 21 P(1 )  p(x | 1 )dx  22 P(1 )  p(x | 2 )dx
R2
R2

R1
p(x | i )dx   p(x | i )dx  1
R2
Overall Risk
R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
R1
 P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx


R2
R1
R  12  p(x | 2 )dx  22  p(x | 2 )dx
R1
R2
 11 P(1 )  p(x | 1 )dx  12 P(1 )  p(x | 2 )dx
R1
R1
 21 P(1 )  p(x | 1 )dx  22 P(1 )  p(x | 2 )dx
R2
R2
R(x) = ax + b
Overall Risk
The value depends on
the setting of decision boundary
R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
R1
 P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx


R2
R1
The value depends on
the setting of decision boundary
The overall risk for a particular P(1).
R(x) = ax + b
Overall Risk
= Rmm, minimax risk
R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
R1
 P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx


R2
R1
= 0 for minimax solution
Independent on the value of P(i).
Minimax Risk
R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
R1
 P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx


R2
R1
Rmm  22  (12  22 )  p(x | 2 )dx
R1
 11  (21  11 )  p(x | 1 )dx
R2
Use 0/1 loss function
Error Probability
R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
R1
 P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx


R2
R1
Perror[ P(1 )]   p(x | 2 )dx
R1
 P(1 )  p(x | 1 )dx   p(x | 2 )dx
 R 2

R1
Use 0/1 loss function
Minimax Error-Probability
Pmm (error )   p(x | 2 )dx   p (x | 1 )dx
R2
R1
P(1|2)
P(2|1)
Perror[ P(1 )]   p(x | 2 )dx
R1
 P(1 )  p(x | 1 )dx   p(x | 2 )dx
 R 2

R1
Perror[ P(1 )]   p(x | 2 )dx
R1
 P(1 )  p(x | 1 )dx   p(x | 2 )dx
 R 2

R1
Minimax Error-Probability
Pmm (error )   p(x | 2 )dx   p (x | 1 )dx
R2
R1
P(1|2)
P(2|1)
1
2
R1
R2
Perror[ P(1 )]   p(x | 2 )dx
R1
 P(1 )  p(x | 1 )dx   p(x | 2 )dx
 R 2

R1
Minimax Error-Probability
Bayesian Decision Theory
(Classification)
Neyman-Pearson
Criterion
Bayesian Decision Rule:
Two-Category Classification
Likelihood
Ratio
Decide 1 if
Threshold
p(x | 1 ) (12  22 ) P(2 )

p(x | 2 ) (21  11 ) P(1 )
Neyman-Pearson Criterion deals with the
case that both loss functions and the prior
probabilities are unknown.
Signal Detection Theory

The theory of signal detection theory evolved
from the development of communications and
radar equipment the first half of the last century.

It migrated to psychology, initially as part of
sensation and perception, in the 50's and 60's as
an attempt to understand some of the features of
human behavior when detecting very faint stimuli
that were not being explained by traditional
theories of thresholds.
The situation of interest

A person is faced with a stimulus (signal) that is
very faint or confusing.

The person must make a decision, is the signal
there or not.

What makes this situation confusing and difficult
is the presences of other mess that is similar to the
signal. Let us call this mess noise.
Example
Noise is present both in the
environment and in the sensory
system of the observer.
The observer reacts to the
momentary total activation of
the sensory system, which
fluctuates from moment to
moment, as well as responding
to environmental stimuli, which
may include a signal.
Example

A radiologist is examining a CT scan, looking for
evidence of a tumor.
A hard job, because there is always some uncertainty.

There are four possible outcomes:

–
–
–
–
hit (tumor present and doctor says "yes'')
Two types
miss (tumor present and doctor says "no'')
false alarm (tumor absent and doctor says "yes") of Error
correct rejection (tumor absent and doctor says "no").
Signal detection theory was developed to help us understand how a
continuous and ambiguous signal can lead to a binary yes/no decision.
The Four Cases
Signal (tumor)
No (1)
Absent (1)
Present (2)
P(1|1)
P(1|2)
Correct Rejection
Miss
P(2|1)
P(2|2)
False Alarms
Hit
Decision
Yes (2)
Discriminability
d'
|  2  1 |

Decision Making
Criterion
d’
Noise
Based on expectancy
(decision bias)
Noise + Signal
Hit
P(2|2)
False
P(2|1)
Alarm
1
No (1)
2
Yes (2)
ROC Curve
(Receiver Operating Characteristic)
Hit
PH=P(2|2)
False
Alarm
PFA=P(2|1)
Neyman-Pearson Criterion
Hit
PH=P(2|2)
NP:
max. PH
subject to PFA ≦ a
False
Alarm
PFA=P(2|1)
Likelihood Ratio Test
0
 ( x)  
1
p ( x|1 )
p ( x| 2 )
p ( x|1 )
p ( x| 2 )
T
R 1  {x | p(x | 1 )  Tp(x | 2 )}
T
R 2  {x | p(x | 1 )  Tp(x | 2 )}
where T is a threshold that meets the PFA constraint (≦ a).
How to determine T?
PFA  E[ (X) | 1 ]
PH  E[ (X) | 2 ]
Likelihood Ratio Test
0
 ( x)  
1
p ( x|1 )
p ( x| 2 )
p ( x|1 )
p ( x| 2 )
T
R 1  {x | p(x | 1 )  Tp(x | 2 )}
T
R 2  {x | p(x | 1 )  Tp(x | 2 )}
PFA   p (x | 1 )dx
R2
   (x) p(x | 1 )dx
PH
PFA
PH   p(x | 2 )dx
R2
   (x) p(x | 2 )dx
R1
R2

0
 ( x)  

1
p ( x|1 )
p ( x| 2 )
p ( x|1 )
p ( x| 2 )
PFA  E[ (X) | 1 ]
T
T
PH  E[ (X) | 2 ]
Neyman-Pearson Lemma
Consider the aforementioned rule with T chosen to give PFA() =a. There is
no decision rule ’ such that PFA(’)  a and PH(’) > PH() .
Pf) Let ’ be a decision rule with PFA ( ' )  E[ ' (X) | 1 ]  a.
 [ (x)  ' (x)][Tp(x |  )  p(x |  )]dx  0
2
=1
0
>0
1
1

 ( x)  
0
p ( x|1 )
p ( x| 2 )
T
PFA  E[ (X) | 1 ]
p ( x|1 )
p ( x| 2 )
T
PH  E[ (X) | 2 ]
Neyman-Pearson Lemma
Consider the aforementioned rule with T chosen to give PFA() =a. There is
no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() .
Pf) Let ’ be a decision rule with PFA ( ' )  E[ ' (X) | 1 ]  a.
 [ (x)  ' (x)][Tp(x |  )  p(x |  )]dx  0 
2
1
=0
0
0
1

 ( x)  
0
p ( x|1 )
p ( x| 2 )
T
PFA  E[ (X) | 1 ]
p ( x|1 )
p ( x| 2 )
T
PH  E[ (X) | 2 ]
Neyman-Pearson Lemma
Consider the aforementioned rule with T chosen to give PFA() =a. There is
no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() .
OK
Pf) Let ’ be a decision rule with PFA ( ' )  E[ ' (X) | 1 ]  a.

 [ (x)  ' (x)][Tp(x |  )  p(x |  )]dx  0
2

1
 T   (x) p(x | 2 )dx    ' (x) p(x | 2 )dx    (x) p(x | 1 )dx    ' (x) p(x | 1 )dx
 T[ PH ( )  PH ( ' )]  [ PFA ( )  PFA ( ' )]  0
0
PH ( )  PH ( ' )

Download