Lecture 3: Decision-theoretic components

advertisement
Lecture 3:
Decision-theoretic components
Inferential approaches within statistics
Descriptive/Explorative
Explanatory
Application of statistical models to data
Estimation and interpretation of parameters
Predictive
Tables, Diagrams, Sample statistics
No quantification of the uncertainty
Application of statistical models to data
Estimation and tuning (learning) of parameters for the prediction of new cases
Decisive
Main objective is to make decisions under uncertainty
Estimation (and prediction) with statistical models is made by maximising the
expected utility (or minimising the expected loss)
Frequentist or Bayesian?
•
Statistical decision theory is not per definition limited to a specific paradigm
(frequentist vs. Bayesian)
•
However, evolving from the explanatory or predictive approach to a decisive
approach is very often coupled with the application of Bayesian statistical
thinking
•
Two main principles:
•
•
Maximise the expected utility ⇔ Minimise the expected loss –
expected with respect to the posterior distribution of the state of nature
(Bayesian)
Minimax principle (non-Bayesian)
Recall the Bayesian framework:
Prior density:
p(θ )
Probability distribution of “data”:
f (x| θ )
[pdf or pmf]
Likelihood:
L(θ | x )
[ = Π f (xi| θ ) or
Sample: x = (x1, … , xn)
Posterior density:
f (x| θ )
q(θ | x )
Relations through Bayes’ theorem:
 (∏ f (xi θ ))⋅ p(θ )

L(θ | x ) ⋅ p(θ )
L(θ | x ) ⋅ p(θ )
 ∫ (∏ f (xi λ ))⋅ p( λ )dλ
q(θ x ) = 
=
=
∝ L(θ | x ) ⋅ p(θ )
f ( x θ )⋅ p(θ )
f (x)

∫ L( λ | x ) ⋅ p( λ)dλ

f ( x λ )⋅ p( λ )dλ
 ∫
Decision-theoretic elements
1. One of a number of decisions (or actions) should be chosen
2. State of nature: A number of states possible – can be an infinite
number. Usually represented by θ
3. The consequence of taking a particular action given a certain state
of nature is known (for all combinations of states and actions)
4. For each state of nature the relative desirability of each of the
different actions possible can be quantified
5. Prior information for the different states of nature may be
available: Prior distribution of θ
6. Data may be available. Usually represented by x. Can be used to
update the knowledge about the relative desirability of (each of)
the different actions
Classical approach
True state of nature: θ
Unknown. The Bayesian description of this
uncertainty is the prior p(θ )
Data:
x
Observation of X, whose pdf (or pmf) depends
on θ (data is thus assumed to be available)
Decision rule:
δ
Action:
δ (x)
Loss function:
LS (θ , δ (x) ) measures the loss from taking action
δ (x) when θ holds
Risk function:
R (θ , δ ) =
The decision rule becomes an action when
applied to given data x
∫
x
L S (θ , δ ( x ))L (θ | x )d x = E X (L S (θ , δ ( X )))
123
Likelihood
Expected loss with respect to variation in x
Function of the decision rule (and not the
action)
Minimax decision rule:
A procedure δ * is a minimax decision rule if
R θ , δ * = min max R(θ , δ )
δ  θ

( )
i.e. θ is chosen to be the “worst” possible value, and under that value the decision
rule that gives the lowest possible risk is chosen.
The minimax rule uses no prior information about θ , thus it is not a Bayesian rule.
Example
Suppose you are about to make a decision on whether you should buy or rent a new
TV to have for two years = 24 months.
δ 1 = “Buy the TV”
δ 2 = “Rent the TV”
Now, assume θ is the mean time until the TV breaks down for the first time.
Let θ assume three possible values: 6, 12 and 24 months.
The cost of the TV is $500 if you buy it and $30 per month if you rent it.
If the TV breaks down after 12 months you’ll have to replace it for the same cost as
you bought it if you bought it. If you rented it you will get a new TV for no cost
provided you proceed with your contract.
Let X be the time in months until the TV breaks down and assume this variable is
exponentially distributed with mean θ.
A loss function for an ownership of maximum 24 months may be defined as
LS (θ , δ 1(X ) ) = 500 + 500 ⋅ 1{X – 12} and
LS (θ , δ 2(X ) ) = 30 ⋅ 24 = 720
0
where 1{y} = 
1
y<0
y≥0
Then
∞
R(θ , δ 1 ) = E (500 + 500 ⋅ 1{X −12} ) = 500 + 500 ∫ θ e
−1 −θ −1 x
=
12
(
= 500 ⋅ 1 + e −12 / θ
)
R(θ , δ 2 ) = 720
Now compare the risks for the three possible values of θ.
Clearly the risk for the first rule increases with θ while the risk for the second is
constant. In searching for the minimax rule we therefore focus on the largest
possible value of θ and there δ 2 has the smallest risk.
δ 2 is the minimax decision rule.
Bayes decision rule
Bayes risk:
B(δ ) = ∫
θ ∈Θ
R (θ , δ ) p(θ )dθ
i.e. integrates out θ with its prior distribution.
Note! The integral is a sum when p (θ ) is a pmf.
A Bayes rule is a procedure that minimizes the Bayes risk
δ B = arg min ∫
δ
θ ∈Υ
R(θ , δ ) p(θ )dθ
Note! This is about the decision rule, not a specific action
Example cont.
Assume the three possible values of θ (6, 12 and 24) have the prior
probabilities 0.2, 0.3 and 0.5 respectively.
θ =6
0.2
0.3
θ = 12

p (θ ) = 
0.5 θ = 24
 0 otherwise
Then
[(
)
(
)
(
) ]
B (δ1 ) = 500 ⋅ 1 − e −12 6 ⋅ 0.2 + 1 − e −12 12 ⋅ 0.3 + 1 − e −12 24 ⋅ 0.5 =
= 280
B (δ 2 ) = 720 (does not depend on θ )
Thus, the Bayes risk is minimized by δ 1 and therefore δ 1 is the Bayes
decision rule.
(pmf )
Bayesian decision theory
A decision problem is defined in terms of
1.
2.
3.
A set of possible decisions (or actions) D = {d1, d2, … } – often referred to
as the decision space
A set of states of natures (or uncertain events). A particular state of nature is
denoted θ . The set of all possible states is denoted Θ
A set of consequences C = {c1, c2, … } . A particular consequence in C is a
function of the decision d and the state of nature θ : c(d, θ )
Hence, C should contain as many consequences as there are combinations of
decisions and states of nature.
The triple (D, Θ, C ) describes the structure of the decision problem
Utility
The decision maker is assumed to have an order of preference for the different
consequences:
ci p c j means that consequence cj is preferred to consequence ci
ci ~ c j
means that ci and cj are equally preferred
ci p c j
means that ci is not preferred to cj
~
Example
Assume that when the temperature is above 25 °C and you have decided to wear
long trousers and a long sleeves shirt, and you will as a consequence feel
unusually hot
c1 = c (d =“longs”, θ > 25 °C )
Assume that when the temperature is below 15 °C and you have decided to wear
shorts and a t-shirt you will as a consequence feel unusually cold
c2 = c (d =“shorts”, θ < 15 °C )
Your preference order would be one of c1 p c2 , c2 p c1 and c1 ~ c2
The preference order or relative desirability of different consequences is measured
by the utility of each consequence.
A utility function describes the utilities for all combinations of decision, d and
state of nature, θ :
U (d , θ )
= U (c ) with c = c(d , θ )
The utility does not have to be a “positive reward”, although when comparing
non-desirable consequences it is common to rather speak of loss instead of utility
(see coming slides).
Example cont.
Assume that even if you feel unusually hot you can still do what you’re supposed
to do. e.g. go to work and earn one day’s salary (8 hours).
Assume that if you feel unusually cold you will have to change clothes which
hinders a bit what you are supposed to do, e.g. you loose an hour of paid salary.
U(“longs”, θ > 25 °C) = 8, U(“shorts”, θ < 15 °C) = 7
Now, when the state of nature is unknown, decisions are made under uncertainty.
This is the reason for “Statistical decision theory”
A decision maker can have an order of preference for two consequences:
c1 p c2
…but since the consequence depends on the state of nature θ that is unknown it
is not possible to make a decision solely on the preference order.
The probabilities of the corresponding states of nature must also be taken into
account.
Hence, measuring the relative desirability goes back to the underlying
probability distributions of the state of nature.
The expected utility of a decision d is obtained by integrating the utility function
with the probability distribution of θ using its probability density (or mass) function
g(θ ) :
U (d , g ) = E g (U (d , g )) = ∫ U (d , θ ) ⋅ g (θ )dθ
θ
When data is not taken into account g (θ ) is the prior pdf/pmf p (θ ) :
U (d , p ) = ∫ U (d , θ ) ⋅ p(θ ) dθ
θ
When data (x) is taken into account, g(θ ) is the posterior pdf/pmf q (θ | x ) :
U (d , q, x ) = ∫ U (d , θ ) ⋅ q (θ x ) dθ
θ
Now, for any triple of consequences (c, c1, c2) such that
c1 p c2 and c1 p c p c2
~
~
i.e. c2 is preferred to c1 , c1 is not preferred to c and c is not preferred to c2
it can be shown that there exist a unique number α ∈ [0, 1] such that
c ~ α c1 + (1 − α )c2
i.e. c and a convex combination of c1 and c2 are equally preferable.
This in turn is due to that preference of consequences can be expressed in terms of
preferences of probability distributions.
When c ~ αc1+(1 – α)c2 it can be shown that
U (c ) = αU (c1 ) + (1 − α )U (c2 )
For a particular state of nature θ let
c1 be the worst consequence and c2 be the best consequence
Normalise – without loss of generality – the utility function U(c ) such that
U(c1) = 0 and U(c2) = 1
For a particular decision d such that
c1 p c(d , θ ) p c2
~
~
it is then possible to find α such that c(d,θ ) is equivalent with a hypothetical
gamble with consequences c1 and c2 where Pr(c1| d,θ ) = α and Pr(c2|d,θ ) = 1 – α
Hence
U (d ,θ ) = α ⋅U (c1 ) + (1 − α ) ⋅ U (c2 ) = 1 − α
123
123
=0
=1
This means that U(d, θ) can be seen as the probability of obtaining the best
consequence.
Pr (Best consequence | d , θ ) ∝ U (d , θ )
⇒
Pr (Best consequence | d ) ∝ ∫ U (d , θ ) ⋅ g (θ )dθ = U (d , g )
θ
Hence, the optimal decision is the decision that maximises the expected utility
under the probability distribution that rules the state of nature
d g(optimal ) = arg max (U (d , g ))
d ∈D
⇒ The Bayes decision is
d
(B )
 arg max (U (d , p )) when no data are used
d ∈D
=
(
U (d , q, x )) when data, x are used
arg max
d ∈D
Example
Assume you are choosing between fixing the interest rate of your mortgage loan for
one year or keeping the floating interest rate for this period.
Let us say that the floating rate for the moment is 4 % and the fixed rate is 5 %.
The floating rate may however increase during the period and we may approximately
assume that with probability g1 = 0.10 the average floating rate will be 7 %, with
probability g2 = 0.20 the average floating rate will be 6 % and with probability g3 =
0.70 the floating rate will stay at 4 %.
Let d1 = Fix the interest rate and d2 = Keep the floating interest rate
Let θ = average floating rate for the coming period
4 − 5 = −1 θ = 4

U (d1 , θ ) =  6 − 5 = 1 θ = 6
 7−5 = 2 θ = 7

 5− 4 =1 θ = 4

U (d 2 , θ ) = 5 − 6 = −1 θ = 6
5 − 7 = −2 θ = 7

⇒
U (d1 , g ) = (−1) ⋅ 0.7 + 1⋅ 0.2 + 2 ⋅ 0.1 = −0.3
U (d 2 , g ) = 1⋅ 0.7 + (− 1) ⋅ 0.2 + (− 2 ) ⋅ 0.1 = 0.3 ⇒ d ( B ) = d 2
Loss function
When utilities are all non-desirable it is common to describe the decision
problem in terms of losses than utilities. The loss function in Bayesian decision
theory is defines as
LS (d ,θ ) = max (U (a,θ )) − U (d ,θ )
a∈D
Then, the Bayes action with the use of data can be written
(
)
d ( B ) = arg max ∫ U (d , θ )q(θ x )dθ =
d ∈D
([
θ
]
)
= arg max ∫ max (U (a, θ )) − LS (d , θ ) q (θ x )dθ =
d ∈D
θ
(
a∈D
)
= arg min ∫ LS (δ ( x ), θ )q(θ x )dθ = arg min LS (d , q, x )
d ∈D
θ
d ∈D
i.e. the action that minimises the expected posterior loss.
Example
A person asking for medical care has some symptoms that may be connected with
two different diseases A and B. But the symptoms could also be temporary and
disappear within reasonable time.
For A there is a therapy treatment that cures the disease if it is present and hence
removes the symptoms. If however the disease is not present the treatment will
lead to that the symptoms remain with the same intense.
For B there is a therapy treatment that generally “reduces” the intensity of the
symptoms by 10 % regardless of whether B is present or not. If B is present the
reduction is 40 %.
Assume that A is present with probability 0.3, that B is present with probability
0.4. Assume further that A and B cannot be present at the same time and therefore
that the probability of the symptoms being just temporary is 0.3.
What is the Bayes decision in this case: Treatment for A, treatment for B or no
treatment?
Download