Statistics 512 Notes 25: Decision Theory Decision Theoretic Approach to Statistics: Views statistics as a mathematical theory for making decisions in the face of uncertainty. The Decision Theory Paradigm: The decision maker chooses an action a from a set A of all possible actions based on the observation of a random variable, or data, X ( X1 , , X n ) , which has a probability distribution that depends on a parameter called the state of nature. The set of all possible value of is denoted by . The decision is made by a statistical decision function, d , which maps the sample space (the set of all possible data values) onto the action space A . Denoting the data by X , the action is random and is given as a d ( X ) . By taking the action a d ( X ) , the decision maker incurs a loss l ( , d ( X )) , which depends on both and d ( X ) . The comparison of different decision functions is based on the risk function, which is the expected loss, R( , d ) E [l ( , d ( X ))] . Here, the expectation is taken with respect to the probability distribution of X , which depends on . Note that the risk function depends on the true state of nature , , and on the decision function , d . Decision theory is concerned with methods of determining “good” decision functions, that is, decision functions that have small risk. Examples: (1) Sampling inspection: A manufacturer produces a lot consisting of N items, n of which are sampled randomly and determined to be either defective or nondefective. Let p denote the proportion of the N items that are defective. Let X i =0 or 1 according to whether the ith item is nondefective or defective and let X ( X1 , , X n ) denote the sample. Suppose that the lot is sold for a price, $M$, with a guarantee that if the proportion of defective items exceeds p0 , the manufacturer will pay a penalty, $P, to the buyer. For any lot, the manufacturer has two possible actions: either sell the lot or junk it at a cost, $C$. The action space is therefore A ={sell,junk} The data are X ( X1 , , X n ) and the state of nature is p . The loss function depends on the action d ( X ) and state of nature as shown in the following table State of Nature Sell Junk p p -$M $C p p $P $C Here a profit is expressed as a negative loss. Note that the decision rule depends on X ( X1 , , X n ) , which is random; the risk function is the expected loss, and the expectation is computed with the respect to the probability distribution of X ( X1 , , X n ) . This distribution depends on p . 0 0 (2) Classification: On the basis of several physiological measurements, X , a decision must be made concerning whether a patient has suffered a myocardial infarction (MI) and should be admitted to intensive care. Here A {admit, do not admit} and {MI, no MI}. The probability distribution of X depends on , perhaps in a complicated way. Some elements of the loss function may be difficult to specify; the economic costs of admission can be calculated, but the costs of not admitting when in fact the patient has suffered a myocardial infarction is more subjective. To make this problem more realistic, the action space could be expanded to include actions such as “send home” and “hospitalize for further observation.” (3) Estimation: Suppose that we want to estimate some function v ( ) on the basis of a sample X ( X1 , , X n ) , where the distribution of the X i depends on . Here d ( X ) is an estimator of v ( ) . The quadratic loss function l ( , d ( X )) [v( ) d ( X )]2 is often used. The risk function is then R( , d ) E [v( ) d ( X )]2 which is the familiar mean squared error. Note that, here again, the expectation is taken with respect to the distribution of X ( X1 , , X n ) , which depends on . Bayes Rules and Minimax Rules Decision theory is concerned with choosing a “good” decision function, that is, one that has a small risk R( , d ) E [l ( , d ( X ))] We have to face the difficulty that R depends on , which is not known. For example, there might be two decision rules, d1 and d 2 , and two values of , 1 and 2 , such that R(1 , d1 ) R(1 , d2 ) but R(2 , d1 ) R(2 , d2 ) Thus d1 is better if the state of nature is 1 , but d 2 is better if the state of nature is 2 . The two most widely used methods for confronting this difficulty are to use either a minimax rule or a Bayes rule. The minimax method proceeds as follows. First, for a given decision function d , consider the worst that the risk could be: max [ R( , d )] . * Then choose a decision function, d , that minimizes this maximum risk: min d max[ R( , d )] Such a decision rule is called a minimax rule. The weakness of the minimax rule is intuitively apparent. It is a very conservative procedure that places all its emphasis on guarding against the worst possible case. In fact, this worst case might not be very likely to occur. To make this idea more precise, we can assign a probability distribution to the state of nature ; this distribution is called the prior distribution of and denoted by ( ) . Given such a prior distribution, we can calculate the Bayes risk of a decision function d : B(d ) E ( ) [ R( , d )] where the expectation is taken with respect to the distribution ( ) . The Bayes risk is the average of the risk function with respect to the prior distribution of . A ** function d that minimizes the Bayes risk is called a Bayes rule. Example: As part of the foundation of a building, a steel section is to be driven down to a firm stratum below ground. The engineer has two choices (actions): a1 : select a 40-ft section a2 : select a 50-ft section There are two possible states of nature: d1 : depth of firm stratum is 40 ft d 2 : depth of firm stratum is 50 ft If the 40-ft section is incorrectly chosen, an additional length of steel must be spliced on at a cost of $400. If the 50-ft section is incorrectly chosen, 10 ft of steel must be scrapped at a cost of $100. The loss function is therefore represented in the following table: 1 2 0 $400 a1 a2 $100 0 A depth sounding is taken by means of a sonic test. Suppose that the measured depth, X , has three possible values, x1 40, x2 45, and x3 50 , and that the probability distribution of X depends on as follows: x 2 1 x1 .6 .1 x2 .3 .2 x3 .1 .7 We will consider the following four decision rules: x1 x2 x3 d1 a1 a1 a1 d2 a1 a2 a2 d3 a1 a1 a2 d4 a2 a2 a2 We will first find the minimax rule. To do so, we need to compute the risk of each of the decision function in the case where 1 and in the case where 2 . To do such computations for 1 , each risk function is computed as R(1 , di ) E1 [l (1 , di ( X ))] = j=1 l (1 , di ( x j )) P( X x j | 1 ) 3 We thus have R(1 , d1 ) 0*.6 0*.3 0*.1 0 R(1 , d 2 ) 0*.6 100*.3 100*.1 40 R(1 , d3 ) 0*.6 0*.3 100*.1 10 R(1 , d 4 ) 100*.6 100*.3 100*.1 100 Similarly, in the case where 2 , we have R( 2 , d1 ) 400 R( 2 , d 2 ) 40 R( 2 , d3 ) 120 R( 2 , d 4 ) 0 To find the minimax rule, we note that the maximum values of the risk of d1 , d2 , d3 and d4 are 400, 40, 120 and 100 respectively. Thus, d 2 is the minimax rule. We now consider computation of a Bayes rule. Suppose that on the basis of previous experience and from largescale maps, we take as the prior distribution (1 ) .8 and (2 ) .2 . Using this prior distribution and the risk functions computed above, we find for each decision function its Bayes risk B(d ) E ( ) [ R( ), d ] R(1 , d ) (1 ) R( 2 , d ) ( 2 ) Thus, we have B(d1 ) 0*.8 400*.2 80 B(d 2 ) 40*.8 40*.2 40 B(d3 ) 10*.8 120*.2 32 B(d 4 ) 100*.8 0*.2 32 Comparing these numbers, we see that d 3 is the Bayes rule (among these four rules). Note that this Bayes rule is less conservative than the minimax rule in that it chooses action a1 (40-ft length) based on observation x2 (45-ft sounding). That is because the prior distribution for this Bayes rule puts more weight on 1 . If the prior distribution were changed sufficiently, the Bayes rule would change. Posterior Analysis We now develop a method for finding the Bayes rule. The Bayes risk for a prior distribution ( ) is the expected loss of a decision rule d ( X ) when X is generated from the following probability model: First, the state of nature is generated according to the prior distribution ( ) Then, the data X is generated according to the distribution f ( X ; ) , which we will denote by f ( X | ) Under this probability model (call it the Bayes model), the marginal distribution of X is (for the continuous case) f X ( x) f ( x | ) ( )d Applying Bayes rule, the conditional distribution of given X is f X , ( x, ) f ( x | ) ( ) h( | x) f X ( x) f ( x | ) ( )d The conditional distribution h( | x ) is called the X x of . The words prior and posterior derive from the facts that ( ) is specified before (prior to) observing X and h( | x ) is calculated after (posterior to) observing X x . We will discuss later more about the interpretation of prior and posterior distributions. Suppose that we have observed X x . We define the posterior risk of an action a d ( x ) as the expected loss, where the expectation is taken with respect to the posterior distribution of . For continuous random variables, we have Eh ( | X x ) [l ( ), d ( x))] l ( , d ( x))h( | X x)d Theorem: Suppose there is a function d0 ( x) that minimizes the posterior risk. Then d0 ( x) is a Bayes rule. Proof: We will this for the continuous case. The discrete case is proved analogously. The Bayes risk of a decision function d is B(d ) E ( ) [ R( ), d ] E ( ) [ E X (l ( , d ( X )) | ] l ( , d ( x)) f ( x | ) dx ( ) d l ( , d ( x)) f x , ( x, ) dxd l ( , d ( x)) h( | x) d f X ( x) dx (We have used the relations f ( x | ) ( ) f X , ( x, ) f X ( x)h( | x) ) Now the inner integral is the posterior risk and since f X ( x) is nonnegative, B(d ) can be minimized by choosing d ( x) d 0 ( x) . The practical importance of this theorem is that it allows us to use just the observed data, x , rather than considering all possible values of X to find the action for the Bayes rule d ** given the data X x , d ** ( x) . In summary the ** algorithm for finding d ( x) is as follows: Step 1: Calculate the posterior distribution h( | X x) . Step 2: For each action a , calculate the posterior risk, which is E[l ( , a ) | X x] l ( , a)h( | X )d * The action a that minimizes the posterior risk is the Bayes ** * rule action d ( x) a Example: Consider again the engineering example. Suppose that we observe X x2 45 . In the notation of that example, the prior distribution is (1 ) .8 , (2 ) .2 . We first calculate the posterior distribution: f ( x | ) (1 ) h(1 | x2 ) 2 2 1 i 1 f ( x2 | i ) (i ) .3*.8 .3*.8 .2*.2 .86 Hence, h(2 | x2 ) .14 We next calculate the posterior risk (PR) for a1 and a2 : PR(a1 ) l (1 , a1 )h(1 | x2 ) l ( 2 , a1 )h( 2 | x2 ) 0 400*.14 56 and PR(a2 ) l (1 , a2 )h(1 | x2 ) l ( 2 , a2 )h( 2 | x2 ) 100*.86 0 86 Comparing the two, we see that a1 has the smaller posterior risk and is thus the Bayes rule.