Lecture 3: Decision-theoretic components Inferential approaches within statistics Descriptive/Explorative Explanatory Application of statistical models to data Estimation and interpretation of parameters Predictive Tables, Diagrams, Sample statistics No quantification of the uncertainty Application of statistical models to data Estimation and tuning (learning) of parameters for the prediction of new cases Decisive Main objective is to make decisions under uncertainty Estimation (and prediction) with statistical models is made by maximising the expected utility (or minimising the expected loss) Frequentist or Bayesian? • Statistical decision theory is not per definition limited to a specific paradigm (frequentist vs. Bayesian) • However, evolving from the explanatory or predictive approach to a decisive approach is very often coupled with the application of Bayesian statistical thinking • Two main principles: • • Maximise the expected utility ⇔ Minimise the expected loss – expected with respect to the posterior distribution of the state of nature (Bayesian) Minimax principle (non-Bayesian) Recall the Bayesian framework: Prior density: p(θ ) Probability distribution of “data”: f (x| θ ) [pdf or pmf] Likelihood: L(θ | x ) [ = Π f (xi| θ ) or Sample: x = (x1, … , xn) Posterior density: f (x| θ ) q(θ | x ) Relations through Bayes’ theorem: (∏ f (xi θ ))⋅ p(θ ) L(θ | x ) ⋅ p(θ ) L(θ | x ) ⋅ p(θ ) ∫ (∏ f (xi λ ))⋅ p( λ )dλ q(θ x ) = = = ∝ L(θ | x ) ⋅ p(θ ) f ( x θ )⋅ p(θ ) f (x) ∫ L( λ | x ) ⋅ p( λ)dλ f ( x λ )⋅ p( λ )dλ ∫ Decision-theoretic elements 1. One of a number of decisions (or actions) should be chosen 2. State of nature: A number of states possible – can be an infinite number. Usually represented by θ 3. The consequence of taking a particular action given a certain state of nature is known (for all combinations of states and actions) 4. For each state of nature the relative desirability of each of the different actions possible can be quantified 5. Prior information for the different states of nature may be available: Prior distribution of θ 6. Data may be available. Usually represented by x. Can be used to update the knowledge about the relative desirability of (each of) the different actions Classical approach True state of nature: θ Unknown. The Bayesian description of this uncertainty is the prior p(θ ) Data: x Observation of X, whose pdf (or pmf) depends on θ (data is thus assumed to be available) Decision rule: δ Action: δ (x) Loss function: LS (θ , δ (x) ) measures the loss from taking action δ (x) when θ holds Risk function: R (θ , δ ) = The decision rule becomes an action when applied to given data x ∫ x L S (θ , δ ( x ))L (θ | x )d x = E X (L S (θ , δ ( X ))) 123 Likelihood Expected loss with respect to variation in x Function of the decision rule (and not the action) Minimax decision rule: A procedure δ * is a minimax decision rule if R θ , δ * = min max R(θ , δ ) δ θ ( ) i.e. θ is chosen to be the “worst” possible value, and under that value the decision rule that gives the lowest possible risk is chosen. The minimax rule uses no prior information about θ , thus it is not a Bayesian rule. Example Suppose you are about to make a decision on whether you should buy or rent a new TV to have for two years = 24 months. δ 1 = “Buy the TV” δ 2 = “Rent the TV” Now, assume θ is the mean time until the TV breaks down for the first time. Let θ assume three possible values: 6, 12 and 24 months. The cost of the TV is $500 if you buy it and $30 per month if you rent it. If the TV breaks down after 12 months you’ll have to replace it for the same cost as you bought it if you bought it. If you rented it you will get a new TV for no cost provided you proceed with your contract. Let X be the time in months until the TV breaks down and assume this variable is exponentially distributed with mean θ. A loss function for an ownership of maximum 24 months may be defined as LS (θ , δ 1(X ) ) = 500 + 500 ⋅ 1{X – 12} and LS (θ , δ 2(X ) ) = 30 ⋅ 24 = 720 0 where 1{y} = 1 y<0 y≥0 Then ∞ R(θ , δ 1 ) = E (500 + 500 ⋅ 1{X −12} ) = 500 + 500 ∫ θ e −1 −θ −1 x = 12 ( = 500 ⋅ 1 + e −12 / θ ) R(θ , δ 2 ) = 720 Now compare the risks for the three possible values of θ. Clearly the risk for the first rule increases with θ while the risk for the second is constant. In searching for the minimax rule we therefore focus on the largest possible value of θ and there δ 2 has the smallest risk. δ 2 is the minimax decision rule. Bayes decision rule Bayes risk: B(δ ) = ∫ θ ∈Θ R (θ , δ ) p(θ )dθ i.e. integrates out θ with its prior distribution. Note! The integral is a sum when p (θ ) is a pmf. A Bayes rule is a procedure that minimizes the Bayes risk δ B = arg min ∫ δ θ ∈Υ R(θ , δ ) p(θ )dθ Note! This is about the decision rule, not a specific action Example cont. Assume the three possible values of θ (6, 12 and 24) have the prior probabilities 0.2, 0.3 and 0.5 respectively. θ =6 0.2 0.3 θ = 12 p (θ ) = 0.5 θ = 24 0 otherwise Then [( ) ( ) ( ) ] B (δ1 ) = 500 ⋅ 1 − e −12 6 ⋅ 0.2 + 1 − e −12 12 ⋅ 0.3 + 1 − e −12 24 ⋅ 0.5 = = 280 B (δ 2 ) = 720 (does not depend on θ ) Thus, the Bayes risk is minimized by δ 1 and therefore δ 1 is the Bayes decision rule. (pmf ) Bayesian decision theory A decision problem is defined in terms of 1. 2. 3. A set of possible decisions (or actions) D = {d1, d2, … } – often referred to as the decision space A set of states of natures (or uncertain events). A particular state of nature is denoted θ . The set of all possible states is denoted Θ A set of consequences C = {c1, c2, … } . A particular consequence in C is a function of the decision d and the state of nature θ : c(d, θ ) Hence, C should contain as many consequences as there are combinations of decisions and states of nature. The triple (D, Θ, C ) describes the structure of the decision problem Utility The decision maker is assumed to have an order of preference for the different consequences: ci p c j means that consequence cj is preferred to consequence ci ci ~ c j means that ci and cj are equally preferred ci p c j means that ci is not preferred to cj ~ Example Assume that when the temperature is above 25 °C and you have decided to wear long trousers and a long sleeves shirt, and you will as a consequence feel unusually hot c1 = c (d =“longs”, θ > 25 °C ) Assume that when the temperature is below 15 °C and you have decided to wear shorts and a t-shirt you will as a consequence feel unusually cold c2 = c (d =“shorts”, θ < 15 °C ) Your preference order would be one of c1 p c2 , c2 p c1 and c1 ~ c2 The preference order or relative desirability of different consequences is measured by the utility of each consequence. A utility function describes the utilities for all combinations of decision, d and state of nature, θ : U (d , θ ) = U (c ) with c = c(d , θ ) The utility does not have to be a “positive reward”, although when comparing non-desirable consequences it is common to rather speak of loss instead of utility (see coming slides). Example cont. Assume that even if you feel unusually hot you can still do what you’re supposed to do. e.g. go to work and earn one day’s salary (8 hours). Assume that if you feel unusually cold you will have to change clothes which hinders a bit what you are supposed to do, e.g. you loose an hour of paid salary. U(“longs”, θ > 25 °C) = 8, U(“shorts”, θ < 15 °C) = 7 Now, when the state of nature is unknown, decisions are made under uncertainty. This is the reason for “Statistical decision theory” A decision maker can have an order of preference for two consequences: c1 p c2 …but since the consequence depends on the state of nature θ that is unknown it is not possible to make a decision solely on the preference order. The probabilities of the corresponding states of nature must also be taken into account. Hence, measuring the relative desirability goes back to the underlying probability distributions of the state of nature. The expected utility of a decision d is obtained by integrating the utility function with the probability distribution of θ using its probability density (or mass) function g(θ ) : U (d , g ) = E g (U (d , g )) = ∫ U (d , θ ) ⋅ g (θ )dθ θ When data is not taken into account g (θ ) is the prior pdf/pmf p (θ ) : U (d , p ) = ∫ U (d , θ ) ⋅ p(θ ) dθ θ When data (x) is taken into account, g(θ ) is the posterior pdf/pmf q (θ | x ) : U (d , q, x ) = ∫ U (d , θ ) ⋅ q (θ x ) dθ θ Now, for any triple of consequences (c, c1, c2) such that c1 p c2 and c1 p c p c2 ~ ~ i.e. c2 is preferred to c1 , c1 is not preferred to c and c is not preferred to c2 it can be shown that there exist a unique number α ∈ [0, 1] such that c ~ α c1 + (1 − α )c2 i.e. c and a convex combination of c1 and c2 are equally preferable. This in turn is due to that preference of consequences can be expressed in terms of preferences of probability distributions. When c ~ αc1+(1 – α)c2 it can be shown that U (c ) = αU (c1 ) + (1 − α )U (c2 ) For a particular state of nature θ let c1 be the worst consequence and c2 be the best consequence Normalise – without loss of generality – the utility function U(c ) such that U(c1) = 0 and U(c2) = 1 For a particular decision d such that c1 p c(d , θ ) p c2 ~ ~ it is then possible to find α such that c(d,θ ) is equivalent with a hypothetical gamble with consequences c1 and c2 where Pr(c1| d,θ ) = α and Pr(c2|d,θ ) = 1 – α Hence U (d ,θ ) = α ⋅U (c1 ) + (1 − α ) ⋅ U (c2 ) = 1 − α 123 123 =0 =1 This means that U(d, θ) can be seen as the probability of obtaining the best consequence. Pr (Best consequence | d , θ ) ∝ U (d , θ ) ⇒ Pr (Best consequence | d ) ∝ ∫ U (d , θ ) ⋅ g (θ )dθ = U (d , g ) θ Hence, the optimal decision is the decision that maximises the expected utility under the probability distribution that rules the state of nature d g(optimal ) = arg max (U (d , g )) d ∈D ⇒ The Bayes decision is d (B ) arg max (U (d , p )) when no data are used d ∈D = ( U (d , q, x )) when data, x are used arg max d ∈D Example Assume you are choosing between fixing the interest rate of your mortgage loan for one year or keeping the floating interest rate for this period. Let us say that the floating rate for the moment is 4 % and the fixed rate is 5 %. The floating rate may however increase during the period and we may approximately assume that with probability g1 = 0.10 the average floating rate will be 7 %, with probability g2 = 0.20 the average floating rate will be 6 % and with probability g3 = 0.70 the floating rate will stay at 4 %. Let d1 = Fix the interest rate and d2 = Keep the floating interest rate Let θ = average floating rate for the coming period 4 − 5 = −1 θ = 4 U (d1 , θ ) = 6 − 5 = 1 θ = 6 7−5 = 2 θ = 7 5− 4 =1 θ = 4 U (d 2 , θ ) = 5 − 6 = −1 θ = 6 5 − 7 = −2 θ = 7 ⇒ U (d1 , g ) = (−1) ⋅ 0.7 + 1⋅ 0.2 + 2 ⋅ 0.1 = −0.3 U (d 2 , g ) = 1⋅ 0.7 + (− 1) ⋅ 0.2 + (− 2 ) ⋅ 0.1 = 0.3 ⇒ d ( B ) = d 2 Loss function When utilities are all non-desirable it is common to describe the decision problem in terms of losses than utilities. The loss function in Bayesian decision theory is defines as LS (d ,θ ) = max (U (a,θ )) − U (d ,θ ) a∈D Then, the Bayes action with the use of data can be written ( ) d ( B ) = arg max ∫ U (d , θ )q(θ x )dθ = d ∈D ([ θ ] ) = arg max ∫ max (U (a, θ )) − LS (d , θ ) q (θ x )dθ = d ∈D θ ( a∈D ) = arg min ∫ LS (δ ( x ), θ )q(θ x )dθ = arg min LS (d , q, x ) d ∈D θ d ∈D i.e. the action that minimises the expected posterior loss. Example A person asking for medical care has some symptoms that may be connected with two different diseases A and B. But the symptoms could also be temporary and disappear within reasonable time. For A there is a therapy treatment that cures the disease if it is present and hence removes the symptoms. If however the disease is not present the treatment will lead to that the symptoms remain with the same intense. For B there is a therapy treatment that generally “reduces” the intensity of the symptoms by 10 % regardless of whether B is present or not. If B is present the reduction is 40 %. Assume that A is present with probability 0.3, that B is present with probability 0.4. Assume further that A and B cannot be present at the same time and therefore that the probability of the symptoms being just temporary is 0.3. What is the Bayes decision in this case: Treatment for A, treatment for B or no treatment?