Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork John L. Weatherwax∗ February 24, 2008 Problem Solutions Chapter 2 (Bayesian Decision Theory) Problem 11 (randomized rules) Part (a): Let R(x) be the average risk given measurement x. Then since this risk depends on the action and correspondingly the probability of taking such action, this risk can be expressed as m X R(x) = R(ai |x)P (ai |x) i=1 Then the total risk (over all possible x) is given by the average of R(x) or R= Z ( m X Ωx i=1 R(ai |x)P (ai |x))p(x)dx Part (b): One would pick P (ai |x) = 1 i = ArgMini R(ai |x) 0 i = anything else Thus for every x pick the action that minimizes the local risk. This amounts to a deterministic rule i.e. there is no benefit to using randomness. ∗ wax@alum.mit.edu 1 Part (c): Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be difficult to say. Problem 13 Let 0 i=j λr i = c + 1 λ(αi |ωj ) = λs otherwise Then since the definition of risk is expected loss we have R(αi |x) = c X i=1 λ(ai |ωj )P (ωj |x) i = 1, 2, . . . , c (1) R(αc+1 |x) = λr (2) Now for i = 1, 2, . . . , c the risk can be simplified as R(αi |x) = λs c X j=1, j6=i λ(ai |ωj )P (ωj |x) = λs (1 − P (ωi |x)) Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if λr < λs (1 − P (ωi|x)) This is equivalent to P (ωi |x) < 1 − λr λs ∀ i = 1, 2, . . . , c ∀ i = 1, 2, . . . , c This can be interpreted as if all posterior probabilities are below a threshold (1 − should reject. Otherwise we pick action i such that λs (1 − P (ωi |x)) < λs (1 − P (ωi|x)) λr ) λs we ∀ j = 1, 2, . . . , c, j 6= i This is equivalent to P (ωi|x) > P (ωj |x) ∀ j 6= i This can be interpreted as selecting the class with the largest posteriori probability. In summary we should reject if and only if P (ωi |x) < 1 − λr λs ∀ i = 1, 2, . . . , c (3) Note that if λr = 0 there is no loss for rejecting (equivalent benefit/reward for correctly classifying) and we always reject. If λr > λs , the loss from rejecting is greater than the loss from missclassifing and we should never reject. This can be seen from the above since λλrs > 1 so 1 − λλrs < 0 so equation 3 will never be satisfied. Problem 31 (the probability of error) Part (a): We can assume with out loss of generality that µ1 < µ2 . If this is not the case initially change the labels of class one and class two so that it is. We begin by recalling the probability of error Pe Pe = P (Ĥ = H1 |H2 )P (H2) + P (Ĥ = H2 |H1 )P (H1) Z ξ Z ∞ = p(x|H2 )dx P (H2) + p(x|H1 )dx P (H1 ) . −∞ ξ Here ξ is the decision boundary between the two classes i.e. we classify as class H1 when x < ξ and classify x as class H2 when x > ξ. Here P (Hi ) are the priors for each of the two classes. We begin by finding the Bayes decision boundary ξ under the minimum error criterion. From the discussion in the book, the decision boundary is given by a likelihood ratio test. That is we decide class H2 if p(x|H2 ) P (H1 ) C21 − C11 , ≥ p(x|H1 ) P (H2 ) C12 − C22 and classify the point as a member of H1 otherwise. The decision boundary ξ is the point at which this inequality is an equality. If we use a minimum probability of error criterion the costs Cij are given by Cij = 1 − δij , and the decision boundary reduces when P (H1) = P (H2 ) = 12 to p(ξ|H1 ) = p(ξ|H2) . If each conditional distributions is Gaussian p(x|Hi ) = N (µi , σi2 ), then the above expression becomes 1 (ξ − µ1 )2 1 (ξ − µ2 )2 1 1 √ √ exp{− exp{− } = }. 2 σ12 2 σ22 2πσ1 2πσ2 If we assume (for simplicity) that σ1 = σ2 = σ the above expression simplifies to (ξ − µ1 )2 = (ξ − µ2 )2 , or on taking the square root of both sides of this we get ξ − µ1 = ±(ξ − µ2 ). If we take the plus sign in this equation we get the contradiction that µ1 = µ2 . Taking the minus sign and solving for ξ we find our decision threshold given by 1 ξ = (µ1 + µ2 ) . 2 Again since we have uniform priors (P (H1 ) = P (H2 ) = 21 ), we have an error probability given by Z ∞ Z ξ 1 (x − µ2 )2 1 (x − µ1 )2 1 1 √ √ exp{− exp{− }dx + }dx . 2Pe = 2 σ2 2 σ2 2πσ 2πσ ξ −∞ or Z ∞ Z ξ √ 1 (x − µ1 )2 1 (x − µ2 )2 2 2πσPe = }dx + }dx . exp{− exp{− 2 σ2 2 σ2 ξ −∞ To evaluate the first integral use the substitution v = √ 2 so dv = √dx . This then gives use v = x−µ 2σ 2σ √ Z 2 2πσPe = σ 2 √ µ1 −µ2 √ 2 2σ −∞ −v2 e x−µ √ 2 2σ √ Z dv + σ 2 so dv = ∞ − µ1 −µ2 √ 2 2σ √dx , 2σ 2 e−v dv . and in the second or 1 Pe = √ 2 π Z µ1 −µ2 √ 2 2σ −v2 e −∞ 1 dv + √ 2 π Z ∞ − 2 e−v dv . µ1 −µ2 √ 2 2σ 2 Now since e−v is an even function, we can make the substitution u = −v and transform the second integral into an integral that is the same as the first. Doing so we find that Z µ1√−µ2 2 2σ 1 2 Pe = √ e−v dv . π −∞ Doing the same transformation we can convert the first integral into the second and we have the combined representation for Pe of Z µ1√−µ2 Z ∞ 2 2σ 1 1 2 −v2 e dv = √ e−v dv . Pe = √ π −∞ π − µ1√−µ2 2 2σ Since we have that µ1 < µ2 , we see that the lower limit on the second integral is positive. In the case when µ1 > µ2 we would switch the names of the two classes and still arrive at the above. In the general case our probability of error is given by Z ∞ 1 2 Pe = √ e−v dv , π √a 2 −µ2 | du and . To match exactly the book let v = √u2 so that dv = √ with a defined as a = |µ12σ 2 the above integral becomes Z ∞ 1 2 Pe = √ e−u /2 du . 2π a Rx 2 In terms of the error function (which is defined as erf(x) = √2π 0 e−t dt), we have that # " " # Z ∞ Z ∞ Z √a 2 1 2 2 2 1 2 2 2 √ √ e−v dv Pe = e−v dv − √ e−v dv = 2 π √a 2 π 0 π 0 2 " # Z √a 2 1 2 2 = 1− √ e−v dv 2 π 0 1 1 |µ1 − µ2 | √ = − erf . 2 2 2 2σ Since the error function is programmed into many mathematical libraries this form is easily evaluated. Part (b): From the given inequality 1 Pe = √ 2π Z ∞ 2 /2 e−t a dt ≤ √ 1 2 e−a /2 , 2πa and the definition a as the lower limit of the integration in Part (a) above we have that Pe 1 ≤ √ 2π = √ 1 |µ1 −µ2 | 2σ − e (µ1 −µ2 )2 2(4σ 2 ) 1 (µ1 − µ2 )2 2σ exp{− } → 0, 8 σ2 2π|µ1 − µ2 | 2| as |µ1 −µ → +∞. This can happen if µ1 and µ2 get far apart of σ shrinks to zero. In each σ case the classification problem gets easier. Problem 32 (probability of error in higher dimensions) As in Problem 31 the classification decision boundary are now the points (x) that satisfy ||x − µ1 ||2 = ||x − µ2 ||2 . Expanding each side of this expression we find ||x||2 − 2xt µ1 + ||µ1 ||2 = ||x||2 − 2xt µ2 + ||µ2 ||2 . Canceling the common ||x||2 from each side and grouping the terms linear in x we have that xt (µ1 − µ2 ) = Which is the equation for a hyperplane. 1 ||µ1 ||2 − ||µ2 ||2 . 2 Problem 34 (error bounds on non-Gaussian data) Part (a): Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(a, b) ≤ aβ b1−β , for a > 0, b > 0, and 0 ≤ β ≤ 1, one can show that P (error) ≤ P (ω1)β P (ω2)1−β e−k(β) for 0 ≤ β ≤ 1 . Here the expression k(β) is given by k(β) = 1 β(1 − β) (µ2 − µ1 )t [βΣ1 + (1 − β)Σ2 ]−1 (µ2 − µ1 ) + ln 2 2 |βΣ1 + (1 − β)Σ2 | |Σ1 |β |Σ2 |1−β ! . When we desire to compute the Bhattacharyya bound we assign β = 1/2. For a two class scalar problem with symmetric means and equal variances that is where µ1 = −µ, µ2 = +µ, and σ1 = σ2 = σ the expression for k(β) above becomes 2 β(1 − β) |βσ 2 + (1 − β)σ 2 | 1 2 −1 k(β) = (2µ) βσ + (1 − β)σ (2µ) + ln 2 2 σ 2β σ 2(1−β) 2β(1 − β)µ2 . = σ2 Since the above expression is valid for all β in the range 0 ≤ β ≤ 1, we can obtain the tightest possible bound by computing the maximum of the expression k(β) (since the dominant β expression is e−k(β) . To evaluate this maximum of k(β) we take the derivative with respect to β and set that result equal to zero obtaining d 2β(1 − β)µ2 =0 dβ σ2 which gives 1 − 2β = 0 or β = 1/2. Thus in this case the Chernoff bound equals the Bhattacharra bound. When β = 1/2 we have for the following bound on our our error probability P (error) ≤ p µ2 P (ω1)P (ω2 )e− 2σ2 . To further simplify things if we assume that our variance is such that σ 2 = µ2 , and that our priors over class are taken to be uniform (P (ω1 ) = P (ω2) = 1/2) the above becomes 1 P (error) ≤ e−1/2 ≈ 0.3033 . 2 Chapter 3 (Maximum Likelihood and Bayesian Estimation) Problem 34 In the problem statement we are told that to operate on a data of “size” n, requires f (n) numeric calculations. Assuming that each calculation requires 10−9 seconds of time to compute, we can process a data structure of size n in 10−9f (n) seconds. If we want our results finished by a time T then the largest data structure that we can process must satisfy 10−9 f (n) ≤ T or n ≤ f −1 (109 T ) . Converting the ranges of times into seconds we find that T = {1 sec, 1 hour, 1 day, 1 year} = {1 sec, 3600 sec, 86400 sec, 31536000 sec} . Using a very naive algorithm we can compute the largest n such that f (n) by simply incrementing n repeatedly. This is done in the Matlab script prob 34 chap 3.m and the results are displayed in Table XXX. Problem 35 (recursive v.s. direct calculation of means and covariances) Part (a): To compute µ̂n we have to add n, d-dimensional vectors resulting in nd computations. We then have to divide each component by the scalar n, resulting in another n computations. Thus in total we have nd + n = (1 + d)n , calculations. To compute the sample covariance matrix, Cn , we have to first compute the differences xk − µ̂n requiring d subtraction for each vector xk . Since we must do this for each of the n vectors xk we have a total of nd calculations required to compute all of the xk − µ̂n difference vectors. We then have to compute the outer products (xk − µ̂n )(xk − µ̂n )′ , which require d2 calculations to produce each (in a naive implementation). Since we have n such outer products to compute computing them all requires nd2 operations. We next, have to add all of these n outer products together, resulting in another set of (n − 1)d2 operations. Finally dividing by the scalar n − 1 requires another d2 operations. Thus in total we have nd + nd2 + (n − 1)d2 + d2 = 2nd2 + nd , calculations. Part (b): We can derive recursive formulas for µ̂n and Cn in the following way. First for µ̂n we note that X n+1 n 1 X xn+1 1 n µ̂n+1 = xk = xk + n+1 n+1 n+1 n k=1 k=1 n 1 µ̂n . xn+1 + = n+1 n+1 Writing n n+1 as n+1−1 n+1 =1− 1 n+1 the above becomes 1 1 xn+1 + µ̂n − µ̂n n + 1 n + 1 1 (xn+1 − µ̂n ) , = µ̂n + n+1 µ̂n+1 = as expected. To derive the recursive update formula for the covariance matrix Cn we begin by recalling the definition of Cn of n+1 1X Cn+1 = (xk − µ̂n+1 )(xk − µ̂n+1 )′ . n k=1 Now introducing the recursive definition of the mean µ̂n+1 as computed above we have that n+1 Cn+1 1X 1 1 = (xk − µ̂n − (xn+1 − µ̂n ))(xk − µ̂n − (xn+1 − µ̂n ))′ n k=1 n+1 n+1 n+1 n+1 X 1X 1 = (xk − µ̂n )(xk − µ̂n )′ − (xk − µ̂n )(xn+1 − µ̂n )′ n k=1 n(n + 1) k=1 n+1 n+1 X X 1 1 ′ (xn+1 − µ̂n )(xk − µ̂n ) − (xn+1 − µ̂n )(xn+1 − µ̂n )′ . − n(n + 1) k=1 n(n + 1)2 k=1 Since µ̂n = 1 n Pn k=1 xk we see that n 1X xk = 0 or nµ̂n − n k=1 n 1X (µ̂n − xk ) = 0 , n k=1 Using this fact the second and third terms the above can be greatly simplified. For example we have that n+1 n X X (xk − µ̂n ) = (xk − µ̂n ) + (xn+1 − µ̂n ) = xn+1 − µ̂n k=1 k=1 Using this identity and the fact that the fourth term has a summand that does not depend on k we have Cn+1 given by ! n 1 1 X n−1 (xk − µ̂n )(xk − µ̂n )′ + (xn+1 − µ̂n )(xn+1 − µ̂n )′ Cn+1 = n n − 1 k=1 n−1 2 1 (xn+1 − µ̂n )(xn+1 − µ̂n ) + (xn+1 − µ̂n )(xn+1 − µ̂n )′ n(n + 1) n(n + 1) 1 1 1 = 1− Cn + (xn+1 − µ̂n )(xn+1 − µ̂n )′ − n n n(n + 1) 1 n+1−1 1 (xn+1 − µ̂n )(xn+1 − µ̂n )′ = Cn − Cn + n n n+1 1 n−1 Cn + (xn+1 − µ̂n )(xn+1 − µ̂n )′ . = n n+1 − which is the result in the book. Part (c): To compute µ̂n using the recurrence formulation we have d subtractions to compute xn − µ̂n−1 and the scalar division requires another d operations. Finally, addition by the µ̂n−1 requires another d operations giving in total 3d operations. To be compared with the (d + 1)n operations in the non-recursive case. To compute the number of operations required to compute Cn recursively we have d computations to compute the difference xn − µ̂n−1 and then d2 operations to compute the outer product (xn − µ̂n−1 )(xn − µ̂n−1 )′ , another d2 operations to multiply Cn−1 by n−2 and finally n−1 2 d operations to add this product to the other one. Thus in total we have d + d2 + d2 + d2 + d2 = 4d2 + d . To be compare with 2nd2 + nd in the non recursive case. For large n the non-recursive calculations are much more computationally intensive. Chapter 6 (Multilayer Neural Networks) Problem 39 (derivatives of matrix inner products) Part (a): We present a slightly different method to prove this here. We desire the gradient (or derivative) of the function φ ≡ xT Kx . The i-th component of this derivative is given by ∂φ ∂ xT Kx = eTi Kx + xT Kei = ∂xi ∂xi where ei is the i-th elementary basis function for Rn , i.e. it has a 1 in the i-th position and zeros everywhere else. Now since (eTi Kx)T = xT K T eTi = eTi Kx , the above becomes ∂φ = eTi Kx + eTi K T x = eTi (K + K T )x . ∂xi Since multiplying by eTi on the left selects the i-th row from the expression to its right we see that the full gradient expression is given by ∇φ = (K + K T )x , as requested in the text. Note that this expression can also be proved easily by writing each term in component notation as suggested in the text. Part (b): If K is symmetric then since K T = K the above expression simplifies to ∇φ = 2Kx , as claimed in the book. Chapter 9 (Algorithm-Independent Machine Learning) Problem 9 (sums of binomial coefficients) Part (a): We first recall the binomial theorem which is n X n n xk y n−k . (x + y) = k k=0 If we let x = 1 and y = 1 then x + y = 2 and the sum above becomes n X n n , 2 = k k=0 which is the stated identity. As an aside, note also that if we let x = −1 and y = 1 then x + y = 0 and we obtain another interesting identity involving the binomial coefficients n X n (−1)k . 0= k k=0 Problem 23 (the average of the leave one out means) We define the leave one out mean µ(i) as µ(i) = 1 X xj . n−1 j6=i This can obviously be computed for every i. Now we want to consider the mean of the leave one out means. To do so first notice that µ(i) can be written as µ(i) = 1 X xj n − 1 j6=i n X 1 = n−1 j=1 xj − xi ! ! X n 1 1 n xj − xi = n−1 n j=1 n xi = µ̂ − . n−1 n−1 With this we see that the mean of all of the leave one out means is given by n n xi n µ̂ − n−1 n−1 X n n 1 1 = xi µ̂ − n−1 n − 1 n i=1 1X 1X µ(i) = n i=1 n i=1 = n µ̂ µ̂ − = µ̂ , n−1 n−1 as expected. Problem 40 (maximum likelihood estimation with a binomial random variable) Since k has a binomial distribution with parameters (n′ , p) we have that ′ ′ n pk (1 − p)n −k . P (k) = k Then the p that maximizes this expression is given by taking the derivative of the above (with respect to p) setting the resulting expression equal to zero and solving for p. We find that this derivative is given by ′ ′ d ′ n n k−1 n′ −k pk (1 − p)n −k−1 (n′ − k)(−1) . kp (1 − p) + P (k) = k k dp Which when set equal to zero and solve for p we find that p = nk′ , or the empirical counting estimate of the probability of success as we were asked to show.