CSE555: Introduction to Pattern Recognition Midterm Exam Solution (100 points, Closed book/notes) There are 5 questions in this exam. The last page is the Appendix that contains some useful formulas. 1. (15pts) Bayes Decision Theory. (a) (5pts) Assume there are c classes w1 , · · · , wc , and one feature vector x, give the Bayes rule for classification in terms of a priori probabilities of the classes and classconditional probability densities of x. Answer: Bayes rule for classification is Decide ωi if p(x|ωi )P (ωi ) > p(x|ωj )P (ωj ) f or all j 6= i and i, j = 1, · · · , c. (b) (10pts) Suppose we have a two-classes problem (A, ∼ A), with a single binaryvalued feature (x, ∼ x). Assume the prior probability P (A) = 0.33. Given the distribution of the samples as shown in the following table, use Bayes Rule to compute the values of posterior probabilities of classes. x ∼x A 248 82 ∼A 167 503 Answer: By Bayes formula, we have P (A|x) = p(x|A)P (A) p(x) we also know that p(x) = p(x|A)P (A) + p(x| ∼ A)P (∼ A) and 248 ≈ 0.7515 248 + 82 167 ≈ 0.2493 p(x| ∼ A) = 167 + 503 P (A) = 0.33 p(x|A) = P (∼ A) = 1 − P (A) = 0.67 thus P (A|x) = 0.7515 × 0.33 ≈ 0.5976 0.7515 × 0.33 + 0.2493 × 0.67 1 Similarly, we have P (∼ A|x) ≈ 0.4024 P (A| ∼ x) ≈ 0.1402 P (∼ A| ∼ x) ≈ 0.8598 2. (25pts) Fisher Linear Discriminant. (a) (5pts) What is the Fisher linear discriminant method? Answer: The Fisher linear discriminant finds a good subspace in which categories are best separated in a least-squares sense; other, general classification techniques can then be applied in the subspace. (b) Given the 2-d data for two classes: ω1 = [(1, 1), (1, 2), (1, 4), (2, 1), (3, 1), (3, 3)] and ω2 = [(2, 2), (3, 2), (3, 4), (5, 1), (5, 4), (5, 5)] as shown in the figure: 6 5 4 3 2 1 0 0 1 2 3 4 5 6 i. (10pts) Determine the optimal projection line in a single dimension. Answer: Let w be the direction of the projection line, then the Fisher linear discriminant method finds that the best w is the one for which the criterion function t J(w) = wwtSSBw ww is maximum, as follows w = S−1 w (m1 − m2 ) where Sw = S1 + S2 and Si = X x∈Di (x − mi )(x − mi )t 2 i = 1, 2 Thus, we first compute the sample means for each class and get m1 = " 11 6 2 # m2 = " 23 6 3 # Then we subtract the sample mean from each sample and get x − m1 = " 7 7 − 65 − 56 − 56 61 6 6 −1 0 2 −1 −1 1 # x − m2 = " − 11 − 56 − 65 67 67 76 6 −1 −1 1 −2 1 2 # therefore S1 = " 25+25+25+1+49+49 36 5+0−10−1−7+7 6 S2 = " 121+25+25+49+49+49 36 11+5−5−14+7+14 6 5+0−10−1−7+7 6 1+0+4+1+1+1 11+5−5−14+7+14 6 1+1+1+4+1+4 and then Sw = S1 + S2 = 1 S−1 w = |S | w " 20 −2 −2 41 3 # # = 1 808 3 " " 41 3 2 2 20 20 −2 −2 41 3 # = # " = −1 −1 8 # " # 29 36 53 6 3 3 12 # = " 15 202 3 − 404 ≈ " −0.1411 −0.0359 3 − 404 41 808 # Finally we have w= " 15 202 3 − 404 w = S−1 w (m1 − m2 ) 3 − 404 41 808 #" −2 −1 # = " 57 − 404 29 − 808 # # ii. (10pts) Show the mapping of the points to the line as well as the Bayes discriminant assuming a suitable distribution. Answer: The samples are mapped by x′ = wt x and we get w1′ = [−0.1770, −0.2129, −0.2847, −0.3181, −0.4592, −0.5309] w2′ = [−0.3540, −0.4950, −0.5668, −0.7413, −0.8490, −0.8849] and we compute the mean and the standard deviation as µ1 = 0.3304 σ1 = 0.1388 µ2 = 0.6485 σ2 = 0.2106 If we assume both p(x|ω1 ) and p(x|ω2 ) have a Gaussian distribution, then the Bayes decision rule will be Decide ω1 if p(x|ωi )P (ω1 ) > p(x|ω2 )P (ω2 ); otherwise decide ω2 3 where " 1 1 x − µi p(x|ωi ) = √ exp − 2 σi 2πσi 2 # If we assume the prior probabilities are equal, i.e. P (ω1′ ) = P (ω2′ ) = 0.5, then the threshold will be about −0.4933. That is, we decide ω1 if wt x > −0.4933, otherwise decide ω2 . 3. (20pts) Suppose p(x|w1 ) and p(x|w2 ) are defined as follows: 2 p(x|w1 ) = x √1 e− 2 2π , ∀x p(x|w2 ) = 1 4 , −2 < x < 2 (a) (7pts) Find the minimum error classification rule g(x) for this two-class problem, assuming P (w1 ) = P (w2 ) = 0.5. Answer: (i) In case of −2 < x < 2, because P (ω1 ) = P (ω2 ) = 0.5, we have the discriminant function g(x) as x2 4 p(x|ω1 ) = ln √ − g(x) = ln p(x|ω2 ) 2 2π The Bayes rule for classification will be Decide ω1 if g(x) > 0; otherwise decide ω2 or Decide ω1 if − 0.9668 < x < 0.9668; otherwise decide ω2 (ii) In case of x ≥ 2 or x ≤ −2, we always decide ω1 . (b) (10pts) There is a prior probability of class 1, designated as π1∗ , so that if P (w1 ) > π1∗ , the minimum error classification rule is to always decide w1 regardless of x. Find π1∗ . Answer: According to the question, π1∗ will satisfy the following equation p(x|ω1 )π1∗ = p(x|ω2 )(1 − π1∗ ) when x = 2 or x = −2 Therefore, we have 4 1 1 √ e− 2 π1∗ = (1 − π1∗ ) 4 2π ∗ π1 ≈ 0.8224 (c) (3pts) There is no π2∗ so that if P (w2 ) > π2∗ , we would always decide w2 . Why not? Answer: Because p(x|ω2 ) is only defined for −2 < x < 2, therefore we would always decide w1 for x ≥ 2 or x ≤ −2, no matter what is the prior probability p(w2 ). 4 4. (20pts) Let samples be drawn by successive, independent selections of a state of nature wi with unknown probability P (wi ). Let zik = 1 if the state of nature for the k th sample is wi and zik = 0 otherwise. (a) (7pts) Show that P (zi1 , · · · , zin |P (wi )) = n Y k=1 P (wi )zik (1 − P (wi ))1−zik Answer: We are given that zik = ( 1 if the state of nature f or the k th sample is ωi 0 otherwise The samples are drawn by successive independent selection of a state of nature wi with probability P (wi ). We have then P r[zik = 1|P (wi )] = P (wi ) and P r[zik = 0|P (wi )] = 1 − P (wi ) These two equations can be unified as P (zik |P (wi )) = [P (wi )]zik [1 − P (wi )]1−zik By the independence of the successive selection, we have P (zi1 , · · · , zin |P (wi )) = = n Y k=1 n Y k=1 P (zik |P (wi )) [P (wi )]zik [1 − P (wi )]1−zik (b) (10pts) Given the equation above, show that the maximum likelihood estimate for P (wi ) is n 1X zik P̂ (wi ) = n k=1 Answer: The log-likelihood as a function of P (wi ) is l(P (wi )) = ln P (zi1 , · · · , zin |P (wi )) = ln = " n X k=1 n Y k=1 zik 1−zik [P (wi )] [1 − P (wi )] # [zik ln P (wi ) + (1 − zik ) ln(1 − P (wi ))] Therefore, the maximum-likelihood values for the P (wi ) must satisfy n n X 1 X 1 ∇P (wi ) l(P (wi )) = zik − (1 − zik ) = 0 P (wi ) k=1 1 − P (wi ) k=1 5 We solve this equation and find (1 − P̂ (wi )) n X zik = P̂ (wi ) n X k=1 k=1 (1 − zik ) which can be rewritten as n X zik = P̂ (wi ) n X k=1 k=1 zik + nP̂ (wi ) − P̂ (wi ) The final solution is then P̂ (wi ) = n X zik k=1 n 1X zik n k=1 (c) (3pts) Interpret the meaning of your result in words. Answer: In this question, we apply the maximum-likelihood method to estimate the prior probability. From the result in part (b), it can be observed that the estimate of the probability of category wi is merely the probability of obtaining its indicatory value in the training data, just as we would expect. 5. (20pts) Consider an HMM with an explicit absorber state w0 and unique null visible symbol v0 with the following transition probabilities aij and symbol probabilities bjk (where the matrix indexes begin at 0): 1 0 0 aij = 0.2 0.3 0.5 0.4 0.5 0.1 bjk 1 0 0 = 0 0.7 0.3 0 0.4 0.6 (a) (7pts) Give a graph representation of this Hidden Markov Model. Answer: v0 1 0.2 ω0 1 0.4 0.5 ω1 0.3 0.3 v2 0.1 ω2 0.5 0.7 0.4 v1 v1 6 0.6 v2 (b) (10pts) Suppose the initial hidden state at t = 0 is w1 . Starting from t = 1, what is the probability it generates the particular sequence V3 = {v2 , v1 , v0 }? Answer: The probability of observing the sequence V3 is 0.03678. See the figure below for the details. v2 W0 v1 v0 0.03678 0 *.2*1 W1 1 *.3*.3 *.5*.6 W2 0 t=0 0.09 *.3*.7 0.1239 *.4*1 *.5*.4 *.5*.7 0.3 *.1*.4 1 0.03 2 3 (c) (3pts) Given the above sequence V3 , what is the most probable sequence of hidden states? Answer: From the figure above and by using the decoding algorithm, one can observe that the most probable sequence of hidden states is {w1 , w2 , w1 , w0 }. 7 Appendix: Useful formulas. • For a 2 × 2 matrix, A= " a b c d # the matrix inverse is −1 A 1 = |A| " d −b −c a # 1 = ad − bc " d −b −c a • The scatter matrices Si are defined as Si = X x∈Di (x − mi )(x − mi )t where mi is the d -dimensional sample mean. The within-class scatter matrix is defined as SW = S1 + S2 The between-class scatter matrix is defined as SB = (m1 − m2 )(m1 − m2 )t wt SB w is The solution for the w that optimizes J(w) = w t SW w w = S−1 W (m1 − m2 ) 8 #