Bayesian Decision Theory (Classification) 主講人:虞台文 Contents Introduction Generalize Bayesian Decision Rule Discriminant Functions The Normal Distribution Discriminant Functions for the Normal Populations. Minimax Criterion Neyman-Pearson Criterion Bayesian Decision Theory (Classification) Introduction What is Bayesian Decision Theory? Mathematical foundation for decision making. Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost). Preliminaries and Notations i {1 , 2 ,, c } : a state of nature P(i ) : prior probability x : feature vector class-conditional p(x | i ) : density P(i | x) : posterior probability Bayesian Rule p(x | i ) P(i ) P(i | x) p ( x) c p(x) p(x | i ) P(i ) j 1 Decision p(x | i ) P(i ) P(i | x) p(x) unimportant in making decision D (x) arg max P(i | x) i Decision p(x | i ) P(i ) P(i | x) p ( x) D (x) arg max P(i | x) Decide i if P(i|x) > P(j|x) i ji Decide i if p(x|i)P(i) > p(x|j)P(j) j i Special cases: 1. P(1)=P(2)= =P(c) 2. p(x|1)=p(x|2) = = p(x|c) Decide i if P(i|x) > P(j|x) ji Decide i if p(x|i)P(i) > p(x|j)P(j) j i Two Categories Decide 1 if P(1|x) > P(2|x); otherwise decide 2 Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2 Special cases: 1. P(1)=P(2) Decide 1 if p(x|1) > p(x|2); otherwise decide 1 2. p(x|1)=p(x|2) Decide 1 if P(1) > P(2); otherwise decide 2 Special cases: 1. P(1)=P(2) Example R2 Decide 1 if p(x|> p(x|2); otherwise decide 1 2. p(x|1)=p(x|2) Decide 1 if P(1) > P(2); otherwise decide 2 R1 P(1)=P(2) Example P(1)=2/3 P(2)=1/3 R2 R2 R1 R1 Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2 Classification Error P(error ) p(error , x)dx P (error | x) p (x)dx Consider two categories: Decide 1 if P(1|x) > P(2|x); otherwise decide 2 P(2 | x) if we decide 1 min[ P(1 | x), P(2 | x)] P(error | x) P(1 | x) if we decide 2 Classification Error P(error ) p(error , x)dx P (error | x) p (x)dx Consider two categories: Decide 1 if P(1|x) > P(2|x); otherwise decide 2 P(2 | x) if we decide 1 min[ P(1 | x), P(2 | x)] P(error | x) P(1 | x) if we decide 2 Bayesian Decision Theory (Classification) Generalized Bayesian Decision Rule The Generation {1 , 2 ,, c } : a set of c states of nature {1 , 2 ,, a } : a set of a possible actions ij ( i | j ) : can be zero. The loss incurred for taking action i when the true state of nature is j. We want to minimize the expected loss in making decision. Risk Conditional Risk c c j 1 j 1 R( i | x) ( i | j ) P( j | x) ij P( j | x) x Given , the expected loss (risk) associated with taking action i. 0/1 Loss Function c c j 1 j 1 R( i | x) ( i | j ) P( j | x) ij P( j | x) 0 i is a correct decision assiciated with j ( i | j ) 1 otherwise R(i | x) P(error | x) Decision c c j 1 j 1 R( i | x) ( i | j ) P( j | x) ij P( j | x) Bayesian Decision Rule: (x) arg min R( i | x) i Overall Risk R R( (x) | x) p(x)dx Decision function Bayesian decision rule: (x) arg min R( i | x) i the optimal one to minimize the overall risk Its resulting overall risk is called the Bayesian risk Two-Category Classification State of Nature {1, 2} Action {1 , 2 } Loss Function 1 2 1 2 11 12 21 22 R(1 | x) 11P(1 | x) 12 P(2 | x) R( 2 | x) 21P(1 | x) 22 P(2 | x) Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x) (21 11) P(1 | x) (12 22 ) P(2 | x) R(1 | x) 11P(1 | x) 12 P(2 | x) R( 2 | x) 21P(1 | x) 22 P(2 | x) Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x) (21 11) P(1 | x) (12 22 ) P(2 | x) positive positive Posterior probabilities are scaled before comparison. p(x | i ) P(i ) P(i | x) p ( x) irrelevan t Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x) (21 11) P(1 | x) (12 22 ) P(2 | x) (21 11) p(x | 1 ) P(1 ) (12 22 ) p(x | 2 ) P(2 ) p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) This slide will be recalled later. Two-Category Classification Likelihood Ratio Perform 1 if Threshold p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) Bayesian Decision Theory (Classification) Discriminant Functions How to define discriminant functions? The Multicategory Classification g1(x) x g2(x) gi(x)’s are called the discriminant functions. Action (e.g., classification) gc(x) Assign x to i if (x) gi(x) > gj(x) for all j i. If f(.) is a monotonically increasing function, than f(gi(.) )’s are also be discriminant functions. Simple Discriminant Functions Minimum Risk case: gi (x) R( i | x) Minimum Error-Rate case: gi (x) P(i | x) gi (x) p(x | i ) P(i ) gi (x) ln p(x | i ) ln P(i ) Decision Regions R i {x | g i (x) g j (x) j i} Two-category example Decision regions are separated by decision boundaries. Bayesian Decision Theory (Classification) The Normal Distribution Basics of Probability Discrete random variable (X) - Assume integer Probability mass function (pmf): Cumulative distribution function (cdf): p( x) P( X x) F ( x) P( X x) x p(t ) t Continuous random variable (X) Probability density function (pdf): p( x) or f ( x) not a probability x Cumulative distribution function (cdf): F ( x) P( X x) p(t )dt Expectations Let g be a function of random variable X. g ( x) p( x) E[ g ( X )] x g ( x) p ( x)dx The kth moment X is discrete X is continuous E[ X k ] The 1st moment X E[X ] The kth central moments E[( X X ) k ] Fact: Var[ X ] E[ X 2 ] ( E[ X ]) 2 Important Expectations Mean xp( x) X E[ X ] x xp( x)dx X is discrete X is continuous Variance 2 ( x ) p( x) X 2 2 X Var[ X ] E[( X X ) ] x ( x X ) 2 p( x)dx X is discrete X is continuous Entropy The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution. p( x) ln p( x) H [ X ] x p( x) ln p( x)dx X is discrete X is continuous Properties: 1. Maximize the entropy 2. Central limit theorem Univariate Gaussian Distribution X~N(μ,σ2) 1 p ( x) e 2 E[X] =μ Var[X] =σ2 p(x) ( x )2 2 2 μ σ 2σ 3σ x σ 2σ 3σ X ( X 1 , X 2 , , X d ) T Random Vectors A d-dimensional random vector Vector Mean: X: R d μ E[ X] ( 1 , 2 , , d ) Covariance Matrix: Σ E[( X μ)( X μ) ] T 12 12 2 2 21 d 1 d 2 T 1d 2d 2 d 1 p ( x) e 2 ( x )2 2 2 Multivariate Gaussian Distribution X~N(μ,Σ) p ( x) A d-dimensional random vector 1 (2 ) d / 2 | Σ |1/ 2 1 T 1 exp (x μ) Σ (x μ) 2 E[X] =μ E[(X-μ) (X-μ)T] =Σ Properties of N(μ,Σ) X~N(μ,Σ) A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA) Properties of N(μ,Σ) X~N(μ,Σ) A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA) On Parameters of N(μ,Σ) X~N(μ,Σ) X ( X 1 , X 2 , , X d ) μ E[ X] ( 1 , 2 , , d ) T T i E[ X i ] Σ E[( X μ)( X μ) ] [ ij ]d d T ij E[( X i i )( X j j )] Cov( X i , X j ) ii i2 E[( X i i ) 2 ] Var ( X i ) X i X j ij 0 Σ (ΦΛ )(ΦΛ ) 1/ 2 1/ 2 T More On Covariance Matrix is symmetric and positive semidefinite. Σ ΦΛΦ ΦΛ Λ Φ T 1/ 2 1/ 2 T : orthonormal matrix, whose columns are eigenvectors of . : diagonal matrix (eigenvalues). Σ E[( X μ)( X μ) ] [ ij ]d d T ij E[( X i i )( X j j )] Cov( X i , X j ) ii i2 E[( X i i ) 2 ] Var ( X i ) X i X j ij 0 Σ (ΦΛ )(ΦΛ ) 1/ 2 1/ 2 T Whitening Transform X~N(μ,Σ) Y=ATX Let Aw ΦΛ Y~N(ATμ, ATΣA) 1 / 2 A w X ~ N ( A μ, A ΣA w ) T w T w ATw ΣA w (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 ) I A w X ~ N ( A μ, I ) T w Σ (ΦΛ )(ΦΛ ) 1/ 2 1/ 2 T Whitening Transform X~N(μ,Σ) Y=ATX Let Aw ΦΛ Whitening Linear Transform T Y~N(A μ, ATΣA) 1 / 2 Projection A w X ~ N ( A μ, A ΣA w ) T w T w ATw ΣA w (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 ) I A w X ~ N ( A μ, I ) T w r 2 (x μ)T Σ 1 (x μ) Mahalanobis Distance X~N(μ,Σ) p ( x) 1 (2 ) d / 2 | Σ |1/ 2 depends on constant the value of r2 1 T 1 exp (x μ) Σ (x μ) 2 r2 r 2 (x μ)T Σ 1 (x μ) Mahalanobis Distance X~N(μ,Σ) p ( x) 1 (2 ) d / 2 | Σ |1/ 2 depends on constant the value of r2 1 T 1 exp (x μ) Σ (x μ) 2 r2 Bayesian Decision Theory (Classification) Discriminant Functions for the Normal Populations gi (x) P(i | x) gi (x) p(x | i ) P(i ) Minimum-Error-Rate Classification gi (x) ln p(x | i ) ln P(i ) Xi~N(μi,Σi) 1 1 T 1 p(x | i ) exp ( x μ ) Σ ( x μ ) i i i 2 (2 ) d / 2 | Σi |1/ 2 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(i ) 2 2 2 Minimum-Error-Rate Classification Three Cases: Case 1: Σ i 2 I Classes are centered at different mean, and their feature components are pairwisely independent have the same variance. Case 2: Σ i Σ Classes are centered at different mean, but have the same variation. Case 3: Σ i Σ j Arbitrary. 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(i ) 2 2 2 Case 1. i = g i ( x) 1 2 I 1 i Σ 1 2 I || x μ i ||2 ln P(i ) 2 2 1 2 (xT x 2μTi x μTi μ i ) ln P(i ) 2 irrelevant 1 T g i ( x) 2 μ x μ μ ln P ( ) i i i 2 2 irrelevant 1 T i 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(i ) 2 2 2 w i 12 μi Case 1. i = 2 I wi 0 21 2 μTi μi ln P(i ) g i (x) w Ti x wi 0 1 T g i ( x) 2 μ x μ μ ln P ( ) i i i 2 2 1 T i w i 12 μi Case 1. i = 2 I wi 0 21 2 μTi μi ln P(i ) g i (x) w x wi 0 wTi x wi 0 wTj x w j 0 g i ( x) g j ( x) (w w )x w j 0 wi 0 T i T j P(i ) (μ μ )x (μ μi μ μ j ) ln P( j ) T i T j T i 1 2 T j 2 (μ μ )x (μ μ )(μi μ j ) T i T j 1 2 T i j i T i T j Boundary btw. i and j T T ( μ μ i j )(μ i μ j ) 2 || μ i μ j ||2 P(i ) ln P( j ) The decision boundary will be a hyperplane perpendicular to the line btw. the means at somewhere. Case 1. i = 2 I i w (x x 0 ) 0 T w x0 w μi μ j x 0 12 (μ i μ j ) 2 || μ i μ j ||2 ln P(i ) (μ i μ j ) P( j ) midpoint 0 if P(i)=P(j) wT (μ μ )x (μ μ )(μi μ j ) T i T j x j xx0 1 2 T i T j g i ( x) g j ( x) Boundary btw. i and j T T ( μ μ i j )(μ i μ j ) 2 || μ i μ j ||2 P(i ) ln P( j ) 2 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 Case 1. i = 2 I P(1 ) P(2 ) Minimum distance classifier (template matching) 2 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 Case 1. i = P(1 ) P(2 ) 2 I 2 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 Case 1. i = P(1 ) P(2 ) 2 I 2 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 Case 1. i = P(1 ) P(2 ) 2 I Demo Case 2. i = 1 g i (x) (x μ i )T Σ 1 (x μ i ) ln P(i ) 2 Mahalanobis Irrelevant if Distance P(i)= P(j) i, j irrelevant 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(i ) 2 2 2 w i Σ 1μ i Case 2. i = wi 0 12 μTi Σ 1μ i ln P(i ) 1 g i (x) (x μ i )T Σ 1 (x μ i ) ln P(i ) 2 1 T 1 (x Σ x 2μTi Σ 1x μTi Σ 1μ i ) ln P(i ) 2 Irrelevant g i (x) w Ti x wi 0 w i Σ 1μ i wi 0 12 μTi Σ 1μ i ln P(i ) Case 2. i = g i (x) w Ti x wi 0 i x w j x0 w T (x x 0 ) 0 g i ( x) g j ( x) 1 w Σ (μi μ j ) x 0 (μ i μ j ) 1 2 ln[ P(i ) / P( j )] 1 (μ i μ j ) Σ (μ i μ j ) T (μ i μ j ) Case 2. i = Demo Case 2. i = 1 1 w Σ 1μ w 1 μT Σ 1μ 1 ln | Σ 1 | ln P( ) i i i i0 i i i i 2 i 2 Wi Σ i 2 Case 3. i j 1 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln | Σ i | ln P(i ) 2 2 g i (x) x Wi x w x wi 0 T T i Without this term In Case 1 and 2 Decision surfaces are hyperquadrics, e.g., • hyperplanes • hyperspheres • hyperellipsoids • hyperhyperboloids irrelevant 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(i ) 2 2 2 Case 3. i j Non-simply connected decision regions can arise in one dimensions for Gaussians having unequal variance. Case 3. i j Case 3. i j Case 3. i j Demo Multi-Category Classification Bayesian Decision Theory (Classification) Minimax Criterion Bayesian Decision Rule: Two-Category Classification Likelihood Ratio Decide 1 if Threshold p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) Minimax criterion deals with the case that the prior probabilities are unknown. Basic Concept on Minimax To choose the worst-case prior probabilities (the maximum loss) and, then, pick the decision rule that will minimize the overall risk. Minimize the maximum possible overall risk. R(1 | x) 11P(1 | x) 12 P(2 | x) R( 2 | x) 21P(1 | x) 22 P(2 | x) Overall Risk R R( (x) | x) p(x)dx R(1 | x) p(x)dx R( 2 | x) p (x)dx R1 R2 R [11P(1 | x) 12 P(2 | x)] p (x)dx R1 R2 [21P(1 | x) 22 P(2 | x)] p(x)dx p(x | i ) P(i ) P(i | x) p ( x) Overall Risk R [11P(1 ) p(x | 1 ) 12 P(2 ) p(x | 2 )]dx R1 R2 [21P(1 ) p(x | 1 ) 22 P(2 ) p(x | 2 )]dx R [11P(1 | x) 12 P(2 | x)] p (x)dx R1 R2 [21P(1 | x) 22 P(2 | x)] p(x)dx P(2 ) 1 P(1 ) Overall Risk R [11P(1 ) p(x | 1 ) 12 P(2 ) p(x | 2 )]dx R1 R2 [21P(1 ) p(x | 1 ) 22 P(2 ) p(x | 2 )]dx R {11P(1 ) p(x | 1 ) 12[1 P(1 )] p(x | 2 )}dx R1 R2 {21P(1 ) p(x | 1 ) 22[1 P(1 )] p(x | 2 )}dx R 12 p(x | 2 )dx 22 p(x | 2 )dx R1 R2 11 P(1 ) p(x | 1 )dx 12 P(1 ) p(x | 2 )dx R1 R1 21 P(1 ) p(x | 1 )dx 22 P(1 ) p(x | 2 )dx R2 R2 R1 p(x | i )dx p(x | i )dx 1 R2 Overall Risk R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 R 12 p(x | 2 )dx 22 p(x | 2 )dx R1 R2 11 P(1 ) p(x | 1 )dx 12 P(1 ) p(x | 2 )dx R1 R1 21 P(1 ) p(x | 1 )dx 22 P(1 ) p(x | 2 )dx R2 R2 R(x) = ax + b Overall Risk The value depends on the setting of decision boundary R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 The value depends on the setting of decision boundary The overall risk for a particular P(1). R(x) = ax + b Overall Risk = Rmm, minimax risk R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 = 0 for minimax solution Independent on the value of P(i). Minimax Risk R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 Rmm 22 (12 22 ) p(x | 2 )dx R1 11 (21 11 ) p(x | 1 )dx R2 Use 0/1 loss function Error Probability R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 Use 0/1 loss function Minimax Error-Probability Pmm (error ) p(x | 2 )dx p (x | 1 )dx R2 R1 P(1|2) P(2|1) Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 Minimax Error-Probability Pmm (error ) p(x | 2 )dx p (x | 1 )dx R2 R1 P(1|2) P(2|1) 1 2 R1 R2 Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 Minimax Error-Probability Bayesian Decision Theory (Classification) Neyman-Pearson Criterion Bayesian Decision Rule: Two-Category Classification Likelihood Ratio Decide 1 if Threshold p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) Neyman-Pearson Criterion deals with the case that both loss functions and the prior probabilities are unknown. Signal Detection Theory The theory of signal detection theory evolved from the development of communications and radar equipment the first half of the last century. It migrated to psychology, initially as part of sensation and perception, in the 50's and 60's as an attempt to understand some of the features of human behavior when detecting very faint stimuli that were not being explained by traditional theories of thresholds. The situation of interest A person is faced with a stimulus (signal) that is very faint or confusing. The person must make a decision, is the signal there or not. What makes this situation confusing and difficult is the presences of other mess that is similar to the signal. Let us call this mess noise. Example Noise is present both in the environment and in the sensory system of the observer. The observer reacts to the momentary total activation of the sensory system, which fluctuates from moment to moment, as well as responding to environmental stimuli, which may include a signal. Example A radiologist is examining a CT scan, looking for evidence of a tumor. A hard job, because there is always some uncertainty. There are four possible outcomes: – – – – hit (tumor present and doctor says "yes'') Two types miss (tumor present and doctor says "no'') false alarm (tumor absent and doctor says "yes") of Error correct rejection (tumor absent and doctor says "no"). Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision. The Four Cases Signal (tumor) No (1) Absent (1) Present (2) P(1|1) P(1|2) Correct Rejection Miss P(2|1) P(2|2) False Alarms Hit Decision Yes (2) Discriminability d' | 2 1 | Decision Making Criterion d’ Noise Based on expectancy (decision bias) Noise + Signal Hit P(2|2) False P(2|1) Alarm 1 No (1) 2 Yes (2) ROC Curve (Receiver Operating Characteristic) Hit PH=P(2|2) False Alarm PFA=P(2|1) Neyman-Pearson Criterion Hit PH=P(2|2) NP: max. PH subject to PFA ≦ a False Alarm PFA=P(2|1) Likelihood Ratio Test 0 ( x) 1 p ( x|1 ) p ( x| 2 ) p ( x|1 ) p ( x| 2 ) T R 1 {x | p(x | 1 ) Tp(x | 2 )} T R 2 {x | p(x | 1 ) Tp(x | 2 )} where T is a threshold that meets the PFA constraint (≦ a). How to determine T? PFA E[ (X) | 1 ] PH E[ (X) | 2 ] Likelihood Ratio Test 0 ( x) 1 p ( x|1 ) p ( x| 2 ) p ( x|1 ) p ( x| 2 ) T R 1 {x | p(x | 1 ) Tp(x | 2 )} T R 2 {x | p(x | 1 ) Tp(x | 2 )} PFA p (x | 1 )dx R2 (x) p(x | 1 )dx PH PFA PH p(x | 2 )dx R2 (x) p(x | 2 )dx R1 R2 0 ( x) 1 p ( x|1 ) p ( x| 2 ) p ( x|1 ) p ( x| 2 ) PFA E[ (X) | 1 ] T T PH E[ (X) | 2 ] Neyman-Pearson Lemma Consider the aforementioned rule with T chosen to give PFA() =a. There is no decision rule ’ such that PFA(’) a and PH(’) > PH() . Pf) Let ’ be a decision rule with PFA ( ' ) E[ ' (X) | 1 ] a. [ (x) ' (x)][Tp(x | ) p(x | )]dx 0 2 =1 0 >0 1 1 ( x) 0 p ( x|1 ) p ( x| 2 ) T PFA E[ (X) | 1 ] p ( x|1 ) p ( x| 2 ) T PH E[ (X) | 2 ] Neyman-Pearson Lemma Consider the aforementioned rule with T chosen to give PFA() =a. There is no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() . Pf) Let ’ be a decision rule with PFA ( ' ) E[ ' (X) | 1 ] a. [ (x) ' (x)][Tp(x | ) p(x | )]dx 0 2 1 =0 0 0 1 ( x) 0 p ( x|1 ) p ( x| 2 ) T PFA E[ (X) | 1 ] p ( x|1 ) p ( x| 2 ) T PH E[ (X) | 2 ] Neyman-Pearson Lemma Consider the aforementioned rule with T chosen to give PFA() =a. There is no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() . OK Pf) Let ’ be a decision rule with PFA ( ' ) E[ ' (X) | 1 ] a. [ (x) ' (x)][Tp(x | ) p(x | )]dx 0 2 1 T (x) p(x | 2 )dx ' (x) p(x | 2 )dx (x) p(x | 1 )dx ' (x) p(x | 1 )dx T[ PH ( ) PH ( ' )] [ PFA ( ) PFA ( ' )] 0 0 PH ( ) PH ( ' )