Statistics 550 Notes 3 Reading: Section 1.3 Decision Theoretic Framework: Framework for evaluating and choosing statistical inference procedures I. Motivating Example A cofferdam protecting a construction site was designed to withstand flows of up to 1870 cubic feet per second (cfs). An engineer wishes to estimate the probability that the dam will be overtopped during the upcoming year. Over the previous 25-year periods, the annual maximum flood levels of the dam has exceeded 1870 cfs 5 times. The engineer models the data on whether the flood level has exceeded 1870 cfs as independent Bernoulli trials with the same probability p that the flood level will exceed 1870 cfs in each year. Some possible estimates of p based on iid Bernoulli trials X1 , , X n : (1) pˆ n i 1 Xi n 1 i 1 X i n (2) pˆ on p . n2 , the posterior mean for a uniform prior 1 2 i 1 X i n (3) pˆ , the posterior mean for a Beta(2,2) n4 prior on p (called the Wilson estimate, recommended by Moore and McCabe, Introduction to the Practice of Statistics). How should we decide which of these estimates to use? The answer depends in part on how errors in the estimation of p affect us. Example 1 of decision problem: The firm wants the engineer to provide her best “guess” of p , the probability of an overflow, i.e., to estimate p by p̂ . The firm wants the probability of an overflow to be at most 0.05. Based on the estimate p̂ of p , the engineer’s firm plans to spend an additional f (max(0, pˆ 0.05)) dollars to shore up the dam where f (0) 0 and f is an increasing function. By spending this money, the firm will make the probability of an overflow be max(0, p max(0, pˆ 0.05)) . The cost of an overflow to the firm is $C. The expected cost to the firm of using an estimate of p̂ (for a true initial probability of overflow of p ) is f (max(0, pˆ 0.05)) C *(max(0, p max(0, pˆ 0.05))) . We want to choose an estimate which provides low expected cost. Example 2 of decision problem: Another decision problem besides estimating p might be that the firm wants to decide 2 whether p 0.15 or p 0.15 ; if p 0.15 , the firm would like to build additional support for the additional dam. This is an example of a testing problem of deciding whether a parameter lives in one of two subsets that form a partition of the sample space. The cost to the firm of making the wrong decision about whether p 0.15 or p 0.15 depends on what type of error was made (deciding that p 0.15 when in fact p 0.15 or deciding that p 0.15 when in fact p 0.15 ). The decision theoretic framework involves: (1) clarifying the objectives of the study; (2) pointing to what the different possible actions are (3) providing assessments of risk, accuracy, and reliability of statistical procedures (4) providing guidance in the choice of procedures for analyzing outcomes of experiments. II. Components of the Decision Theory Framework (Section 1.3.1) We observe data X from a distribution P , where we do not know the true P but only know that P P = { P , } (the statistical model). The true parameter vector is sometimes called the “state of nature.” 3 Action space: The action space A is the set of possible actions, decisions or claims that we can contemplate making after observing the data X . For Example 1, the action space is the possible estimates of p (probability of the dam being overtopped), A [0,1] . For Example 2, the action space is {decide that p 0.15 , decide that p 0.15 }. Loss function: The loss function l ( a is the loss incurred by taking the action a when the true parameter vector is . The loss function is assumed to be nonnegative. We want the loss to be small. Relationship between loss function and utility function in economics. The loss function is related to the utility function in economics. If the utility of taking the action a when the true state of nature is is U ( , a ) , then we can define the loss as l ( a ) max ',a ' U ( ' a') -U ( a ) When there is uncertainty about the outcome of interest after taking the action (as in Example 1), then we can replace the utility with the expected utility under the von Neumann-Morganstern axioms for decision making under uncertainty (W. Nicholson, Microeconomic Theory, 6th ed., Ch. 12). Ideally, we choose the loss function based on the economics of the decision problem as in Example 1. 4 However, more commonly, the loss function is chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. Commonly used loss functions for point estimates of a real valued parameter q( ) : Denote our estimate of q( ) by a . The most commonly used loss function is 2 quadratic (squared error) loss: l ( a q( )- a) . Other choices that are less computationally convenient but perhaps more realistically penalize large errors less are: (1) absolute value loss, l ( a q( ) - a | ; (2 ) Huber’s loss functions, 2 if |q( ) - a | k (q( ) - a) l ( a 2 2k | q( ) - a | -k if |q( ) - a |> k for some constant k (3) zero-one loss function if |q( ) - a | k 0 l ( a if |q( ) - a |> k 1 for some constant k Decision procedures: A decision procedure or decision rule specifies how we use the data to choose an action a . A decision procedure is a function ( X ) from the sample space of the experiment to the action space. 5 For Example 1, decision procedures include (X ) n Xi i 1 n 1 i 1 X i n and ( X ) n 1 . Risk function: The loss of a decision procedure will vary over repetitions of the experiment because the data from the experiment X is random. The risk function R (θ , ) is the expected loss from using the decision procedure when the true parameter vector is : R(θ, ) E [l ( , ( X ))] Example: For quadratic loss in point estimation of q( ) , the risk function is the mean squared error: R(θ , ) E [l ( , ( X ))] E [( q( ( X )) 2 ] This mean square error can be decomposed as bias squared plus variance. Proposition 3.1: E [(q( ( X )) 2 ] (q( E [ ( X )]) 2 E {( ( X ) E [ ( X )]) 2} Proof: We have E [(q( ( X )) 2 ] E [({q( E [ ( X )]} {E [ ( X )] ( X ) ] (q( E [ ( X )]) 2 2{q( E [ ( X )]}E {E [ ( X )] ( X ) E [{E [ ( X )] ( X ) ] (q( E [ ( X )]) 2 E [{E [ ( X )] ( X ) ] {Bias[ ( X )]}2 Variance[ ( X )] 6 Example 3: Suppose that an iid sample X1,...,Xn is drawn from the uniform distribution on [0, ] where is an unknown parameter and the distribution of Xi is 1 0<x< f X ( x; ) 0 elsewhere Several point estimators: n E ( W ) W max X 1 1. 1 i i . Note: W1 is biased, n 1 . n 1 W 2. 2 n max i X i . Note: Unlike W1, W2 is unbiased because n 1 n 1 n E (W2 ) E (W1 ) 0 . n n n 1 3. W3=2 X . Note: W3 is unbiased, E [ X ] 0 x2 x dx 2 1 E [W3 ] 2 E [ X ] 2 2 0 2 Comparison of three estimators for uniform example using mean squared error criterion 1. W1 max i X i The sampling distribution for W1 is 7 w , X n w1 1 P(W1 w1 ) P X 1 , nw1n 1 ( w1 ) n 0 fW1 n 0 w1 elsewhere and 0 0 E [W1 ] w1 fW1 ( w1 )dw1 w1 nw1n 1 n nw1n 1 dw1 (n 1) n 0 n 1 n 1 n 1 2 To calculate Var (W1 ) , we calculate E (W1 ) and use the 2 2 formula Var ( X ) E ( X ) [ E ( X )] . Bias (W1 ) E [W1 ] E (W ) w f dw1 w 2 1 0 2 1 w1 nw1n 2 = (n 2) n 0 0 2 1 nw1n 1 n dw1 n 2 n2 2 n 2 n n Var (W1 ) 2 2 n2 (n 1) (n 2)(n 1) Thus, MSE (W1 ) {Bias (W1 )}2 Var (W1 ) 2 n 2 2 2 . 2 2 (n 1) (n 2)(n 1) (n 1)(n 2) 8 n n 1 n 1 W max i X i 2. 2 n n 1 W W1 . Note 2 n n 1 n 1 n E ( W ) E ( W ) , 1 Thus, 2 n n n 1 Bias (W2 ) 0 and n 1 n 1 Var (W2 ) Var ( W1 ) Var (W1 ) n n 2 n 1 n 1 2 2 2 n( n 2) n (n 2)(n 1) Because W2 is unbiased, 1 MSE (W2 ) Var (W2 ) 2 n(n 2) 2 3. W3 2 X To find the mean square error, we use the fact that if X 1 , , X n iid with mean and variance 2 , then X Xn X 1 and variance 2 / n has mean n We have E ( X ) x2 x dx 2 1 0 E ( X ) 2 0 0 x3 2 x dx 3 1 2 2 0 9 3 2 Var ( X ) 3 2 12 2 2 2 Thus, E ( X ) 2 , Var ( X ) 12n and 2 E (W3 ) 2 E ( X ) and Var (W3 ) 4Var ( X ) 3n . 2 W3 is unbiased and has mean square error 3n . The mean square errors of the three estimators are the following: MSE 2 2 (n 1)(n 2) 1 2 n(n 2) 1 2 3n W1 W2 W3 For n=1, the three estimators have the same MSE. 1 2 1 For n>1, n(n 2) (n 1)(n 2) 3n So W2 is best, W1 is second best and W3 is the worst. III. Admissibility/Inadmissibility of Decision Procedures A decision procedure is inadmissible if there exists another decision procedure ' such that 10 R( , ') R( , ) for all and R( , ') R( , ) for at least one . The decision procedure ' is said to dominate ; there is no justification for using rather than '. In Example 3, W1 and W3 are inadmissible point estimators under squared error loss for n 1 . A decision procedure is admissible if it is not inadmissible, i.e., if there does not exist a decision procedure ' such that R( , ') R( , ) for all and R( , ') R( , ) for at least one . IV. Selection of a decision procedure: We would like to choose a decision procedure which has a “good” risk function. Ideal: We would like to construct a decision procedure that is at least as good as all other decision procedures for all , i.e., ( x ) such that R( , ') R( , ) for all and all other decision procedures ' . This is generally impossible! Example 2: For X1,...,Xn iid N ( ,1) , ( X ) 1 is an admissible point estimator of for squared error loss. 11 Proof: Suppose ( X ) 1 is inadmissible. Then there exists a decision procedure ' that dominates . This implies that R(1, ') R(1, ) 0 . 2 Hence, 0 R(1, ') E 1[( '( x1 , , xn ) 1) ] . Since ( '( x1 , , xn ) 1) 2 is nonnegative, this implies P 1[( '( x1 , , xn ) 1) 0] 1 . Let B be the event that ( '( x1 , , xn ) 1) 0 . We will show that P ( B ) 0 for all (, ) . This means that '( x1 , , xn ) 1 with probability 1 for all (, ) , which means that R( , ) R( , ') for all (, ) ; this contradicts ' dominates and proves that ( X ) 1 is admissible. To show that P ( B ) 0 for all (, ) , we use the importance sampling idea that the expectation of a random variable X under a density f can be evaluated as the expectation of the random variable Xf(X)/g(X) under a density g as long as f and g have the same support: 12 P ( B) n ( xi ) 2 1 dx1 I B (2 )n / 2 exp i1 2 dxn n ( xi ) 2 1 i 1 exp n/2 1 (2 ) 2 n ( xi 1) 2 i 1 dx1 exp n/2 I B n 2 (2 ) 2 ( x 1) 1 i i 1 exp n/2 (2 ) 2 dxn n ( xi ) 2 1 i 1 exp n/2 2 (2 ) E 1 I B n 2 i 1 ( xi 1) 1 exp n/2 2 (2 ) (0.1) Since P 1 ( B ) =0, the random variable n ( xi ) 2 1 i 1 exp n/2 2 (2 ) IB n ( xi 1) 2 1 i 1 exp n/2 2 (2 ) is zero with probability one under 1 Thus, by (0.1), P ( B ) 0 for all (, ) . ■ Comparison of risk under squared error loss for 1 ( X ) 1 and 2 ( X ) X . R(, 1 ) E [(1 )2 ] (1 )2 R( , 2 ) E [( X ) 2 ] Var ( X ) 13 1 n Although 1 ( X ) 1 is admissible, it does not have good risk properties for many values of . Approaches to choosing a decision procedure with good risk properties: (1) Restrict class of decision procedures and try to choose optimal procedure within this class, e.g., for point estimation, we might only consider unbiased estimators ( x ) of q( ) such that E [ ( x)] q( ) for all . (2) Compare risk functions by global criterion. We shall discuss Bayes and minimax criteria. I. Example 1 (Example 1.3.5 from Bickel and Doksum) 14 We are trying to decide whether to drill a location for oil. There are two possible states of nature, 1 location contains oil and 2 location doesn’t contain oil. We are considering three actions, a1 =drill for oil, a2 =sell the location or a3 =sell partial rights to the location. The following loss function is decided on (Drill) (Sell) (Partial rights) a1 a2 a3 0 10 5 1 (Oil) (No oil) 2 12 1 6 An experiment is conducted to obtain information about resulting in the random variable X with possible values 0,1 and frequency function p( x, ) given by the following table: Rock formation X 0 1 0.3 0.7 1 (Oil) 0.6 0.4 (No oil) 2 X 1 represents the presence of a certain geological formation that is more likely to be present when there is oil. The possible nonrandomized decision procedures ( x) are Rule 1 2 3 4 5 6 7 8 9 15 x=0 a1 a1 a1 a2 a2 a2 a3 a3 a3 x=1 a1 a2 a3 a1 a2 a3 a1 a2 a3 The risk of at is R( , ) E [l ( , ( X ))] l ( , a1 ) P [ ( X ) a1 ] +l ( , a2 ) P [ ( X ) a2 ] l ( , a3 ) P [ ( X ) a3 ] The risk functions are 1 R(1 , ) 0 R( 2 , ) 12 Rule 5 6 10 6.5 2 7 3 3.5 4 3 7.6 9.6 5.4 16 1 3 7 1.5 8 8.5 9 5 8.4 4 6 The decision rules 2, 3, 8 and 9 are inadmissible but the decision rules 1, 4, 5, 6 and 7 are all admissible. V Bayes Criterion The Bayesian point of view leads to a natural global criterion. Suppose a person’s prior distribution about is ( and the model is that X | has probability density function (or probability mass function) p( x | ) . Then the joint (subjective) pdf (or pmf) of ( X , ) is ( p ( x | ) . The Bayes risk of a decision procedure for a prior distribution ( , denoted by r ( ) , is the expected value of the risk over the joint distribution of ( X , ) : r ( ) E [ E[l ( , ( X )) | ]] E [ R( , )] . For a person with subjective prior probability distribution ( , the decision procedure which minimizes r ( ) minimizes the person’s (subjective) expected loss and is the best procedure from this person’s point of view. The decision procedure which minimizes the Bayes risk for a prior ( is called the Bayes rule for the prior ( . Example continued: For prior, (1 ) 0.2 and (2 ) 0.8 , the Bayes risks are r ( ) 0.2R(1 , ) 0.8R(2 , ) 17 1 r ( ) 9.6 Rule 2 3 4 5 7.48 8.38 4.92 2.8 6 3.7 7 8 7.02 4.9 9 5.8 Thus, rule 5 is the Bayes rule for this prior distribution. The Bayes rule depends on the prior. For prior (1 ) 0.5 and (2 ) 0.5 , the Bayes risks are r ( ) 0.5R(1 , ) 0.5R(2 , ) Rule 3 4 5 6.55 4.2 5.5 1 2 6 7 8 9 7.3 4.75 4.95 6.25 5.5 r ( ) 6 Thus, rule 4 is the Bayes rule for this prior distribution. A non-subjective interpretation of Bayes rules: The Bayes approach leads us to compare procedures on the basis of r ( ) R( , ) ( ) if is discrete with frequency function ( or r ( ) R( , ) ( )d if is continuous with density ( . Such comparisons make sense even if we do not interpret ( as a prior density or frequency, but only as a weight function that reflects the importance we place on doing well at the different possible values of . 18 For example, in Example 1, if we felt that doing well at both 1 and 2 are equally important, we would set (1 ) (2 ) 0.5 . VI. Minimax Criteria The minimax criteria minimizes the worst possible risk. That is, we prefer to ' , if and only if sup R( , ) sup R( , ') . * A procedure is minimax (over a class of considered decision procedures) if it satisfies sup R( , *) inf sup R( , ) . Among the nine decision rules considered for Example 2, rule 4 is the minimax rule. Rule 1 2 3 4 5 6 7 8 9 0 7 3.5 3 10 6.5 1.5 8.5 5 R(1 , ) 12 7.6 9.6 5.4 1 3 8.4 4 6 R( 2 , ) max{ R(1 , ) , 12 7.6 9.6 5.4 10 6.5 8.4 8.5 6 R( 2 , ) } Game theory motivation for minimax criterion: Suppose we play a two-person zero sum game against Nature. Then the minimax decision procedure is the minimax strategy for the game. 19 Comments on the minimax criteria: The minimax criteria is very conservative. It aims to give maximum protection against the worst can happen. The principle would be compelling if the statistician believed that Nature was a malevolent “opponent” but in fact Nature is just the inanimate state of the world. Although the minimax criterion is conservative, in many cases the principle does lead to reasonable procedures. VII. Other Global Criteria for Decision Procedures Two compromises between Bayes and minimax criteria that have been proposed are: (1) Restricted Risk Bayes: Suppose that M is the maximum risk of the minimax decision procedure. Then, one may be willing to consider decision procedures whose maximum risk exceeds M , if the excess is controlled, say, if R( , ) M (1 ) for all (0.2) where is the proportional increase in risk that one is willing to tolerate. A restricted risk Bayes decision procedure for the prior is then obtained by minimizing the Bayes risk r ( ) among all decision procedures that satisfy (0.2). For Example 1 and prior (1 ) 0.2 , (2 ) 0.8 Rule 1 2 3 4 5 6 7 8 20 9 r ( ) 9.6 Max 12 Risk 7.48 8.38 4.92 2.8 3.7 7.02 4.9 5.8 7.6 6.5 8.4 6 9.6 5.4 10 8.5 For =0.1 (maximum risk allowed = (1+0.1)*5.4=5.94), decision rule 4 is the restricted risk Bayes procedure; for =0.25 (maximum risk allowed = (1+0.25)*5.4=6.75), decision rule 6 is the restricted risk Bayes procedure. (2) Gamma minimaxity. Let be a class of prior * distributions. A decision procedure is gamma-minimax (over a class of considered decision procedures) if inf sup r ( ) sup r ( * ) * Thus, the estimator minimizes the maximum Bayes risk over those priors in the class . Computational issues: We will study more on how to find Bayes and minimax point estimators in Chapter 3. The restricted risk Bayes procedure is appealing but it is difficult to compute. VIII. Randomized decision procedures A randomized decision procedure is a decision procedure which assigns to each possible outcome of the data X , a random variable Y( X ) , where the values of Y( X ) are actions in the action space. When X = x , a draw from the distribution of Y( x ) will be taken and will constitute the action taken. 21 We will show in Chapter 3 that for any prior, there is always a nonrandomized decision procedure that has at least as small Bayes risk as a randomized decision procedure (so we can ignore randomized decision procedures in looking for the Bayes rule). Students of game theory will realize that a randomized decision procedure may lead to a lower maximum risk than a nonrandomized decision procedure. Example: For Example 1, a randomized decision procedure is to flip a fair coin and use decision rule 4 if the coin lands heads and decision rule 6 if the coin lands tails – i.e., Y ( x 0) a2 with probability 1 and Y ( x 1) a1 with probability 0.5 and Y ( x 1) a3 with probability 0.5. The risk of this randomized decision procedure is 4.75 if =1 0.5 R ( , 4 ) 0.5 R( , 6 ) 4.20 if = 2 , which has lower maximum risk than decision rule 4 (the minimax rule among nonrandomized decision rules). Randomized decision procedures are somewhat impractical – it makes the statistician’s inferences seem less credible if she has to explain to a scientist that she flipped a coin after observing the data to determine the inferences. 22 We will show in Chapter 1.5 that a randomized decision procedure cannot lower the maximum risk if the loss function is convex. 23