Statistics 512 Notes 22: Wrap up of Sufficiency, Most Powerful Tests

advertisement
Statistics 512 Notes 25: Decision Theory
Decision Theoretic Approach to Statistics: Views statistics
as a mathematical theory for making decisions in the face
of uncertainty.
The Decision Theory Paradigm:
The decision maker chooses an action a from a set A of all
possible actions based on the observation of a random
variable, or data, X  ( X1 , , X n ) , which has a probability
distribution that depends on a parameter  called the state
of nature. The set of all possible value of  is denoted by
 . The decision is made by a statistical decision function,
d , which maps the sample space (the set of all possible
data values) onto the action space A . Denoting the data by
X , the action is random and is given as a  d ( X ) .
By taking the action a  d ( X ) , the decision maker incurs a
loss l ( , d ( X )) , which depends on both  and d ( X ) . The
comparison of different decision functions is based on the
risk function, which is the expected loss,
R( , d )  E [l ( , d ( X ))] .
Here, the expectation is taken with respect to the
probability distribution of X , which depends on  . Note
that the risk function depends on the true state of nature ,  ,
and on the decision function , d . Decision theory is
concerned with methods of determining “good” decision
functions, that is, decision functions that have small risk.
Examples:
(1) Sampling inspection: A manufacturer produces a lot
consisting of N items, n of which are sampled randomly
and determined to be either defective or nondefective. Let
p denote the proportion of the N items that are defective.
Let X i =0 or 1 according to whether the ith item is
nondefective or defective and let X  ( X1 , , X n ) denote
the sample. Suppose that the lot is sold for a price, $M$,
with a guarantee that if the proportion of defective items
exceeds p0 , the manufacturer will pay a penalty, $P, to the
buyer. For any lot, the manufacturer has two possible
actions: either sell the lot or junk it at a cost, $C$. The
action space is therefore
A ={sell,junk}
The data are X  ( X1 , , X n ) and the state of nature is
  p . The loss function depends on the action d ( X ) and
state of nature as shown in the following table
State of Nature
Sell
Junk
p p
-$M
$C
p p
$P
$C
Here a profit is expressed as a negative loss. Note that the
decision rule depends on X  ( X1 , , X n ) , which is
random; the risk function is the expected loss, and the
expectation is computed with the respect to the probability
distribution of X  ( X1 , , X n ) . This distribution depends
on p .
0
0
(2) Classification: On the basis of several physiological
measurements, X , a decision must be made concerning
whether a patient has suffered a myocardial infarction (MI)
and should be admitted to intensive care. Here A  {admit,
do not admit} and   {MI, no MI}. The probability
distribution of X depends on  , perhaps in a complicated
way. Some elements of the loss function may be difficult
to specify; the economic costs of admission can be
calculated, but the costs of not admitting when in fact the
patient has suffered a myocardial infarction is more
subjective. To make this problem more realistic, the action
space could be expanded to include actions such as “send
home” and “hospitalize for further observation.”
(3) Estimation: Suppose that we want to estimate some
function v ( ) on the basis of a sample X  ( X1 , , X n ) ,
where the distribution of the X i depends on  . Here
d ( X ) is an estimator of v ( ) . The quadratic loss function
l ( , d ( X ))  [v( )  d ( X )]2
is often used. The risk function is then
R( , d )  E [v( )  d ( X )]2 
which is the familiar mean squared error. Note that, here
again, the expectation is taken with respect to the
distribution of X  ( X1 , , X n ) , which depends on  .
Bayes Rules and Minimax Rules
Decision theory is concerned with choosing a “good”
decision function, that is, one that has a small risk
R( , d )  E [l ( , d ( X ))]
We have to face the difficulty that R depends on  , which
is not known. For example, there might be two decision
rules, d1 and d 2 , and two values of  , 1 and  2 , such that
R(1 , d1 )  R(1 , d2 )
but
R(2 , d1 )  R(2 , d2 )
Thus d1 is better if the state of nature is 1 , but d 2 is better
if the state of nature is  2 . The two most widely used
methods for confronting this difficulty are to use either a
minimax rule or a Bayes rule.
The minimax method proceeds as follows. First, for a
given decision function d , consider the worst that the risk
could be:
max  [ R( , d )] .
*
Then choose a decision function, d , that minimizes this
maximum risk:
min d max[ R( , d )]
Such a decision rule is called a minimax rule.
The weakness of the minimax rule is intuitively apparent.
It is a very conservative procedure that places all its
emphasis on guarding against the worst possible case. In
fact, this worst case might not be very likely to occur. To
make this idea more precise, we can assign a probability
distribution to the state of nature  ; this distribution is
called the prior distribution of  and denoted by  ( ) .
Given such a prior distribution, we can calculate the Bayes
risk of a decision function d :
B(d )  E ( ) [ R( , d )]
where the expectation is taken with respect to the
distribution  ( ) . The Bayes risk is the average of the risk
function with respect to the prior distribution of  . A
**
function d that minimizes the Bayes risk is called a Bayes
rule.
Example: As part of the foundation of a building, a steel
section is to be driven down to a firm stratum below
ground. The engineer has two choices (actions):
a1 : select a 40-ft section
a2 : select a 50-ft section
There are two possible states of nature:
d1 : depth of firm stratum is 40 ft
d 2 : depth of firm stratum is 50 ft
If the 40-ft section is incorrectly chosen, an additional
length of steel must be spliced on at a cost of $400. If the
50-ft section is incorrectly chosen, 10 ft of steel must be
scrapped at a cost of $100. The loss function is therefore
represented in the following table:
1
2
0
$400
a1
a2
$100
0
A depth sounding is taken by means of a sonic test.
Suppose that the measured depth, X , has three possible
values, x1  40, x2  45, and x3  50 , and that the
probability distribution of X depends on  as follows:
x
2
1
x1
.6
.1
x2
.3
.2
x3
.1
.7
We will consider the following four decision rules:
x1
x2
x3
d1
a1
a1
a1
d2
a1
a2
a2
d3
a1
a1
a2
d4
a2
a2
a2
We will first find the minimax rule. To do so, we need to
compute the risk of each of the decision function in the
case where   1 and in the case where    2 . To do such
computations for   1 , each risk function is computed as
R(1 , di )  E1 [l (1 , di ( X ))]
=  j=1 l (1 , di ( x j )) P( X  x j |   1 )
3
We thus have
R(1 , d1 )  0*.6  0*.3  0*.1  0
R(1 , d 2 )  0*.6  100*.3  100*.1  40
R(1 , d3 )  0*.6  0*.3  100*.1  10
R(1 , d 4 )  100*.6  100*.3  100*.1  100
Similarly, in the case where    2 , we have
R( 2 , d1 )  400
R( 2 , d 2 )  40
R( 2 , d3 )  120
R( 2 , d 4 )  0
To find the minimax rule, we note that the maximum
values of the risk of d1 , d2 , d3 and d4 are 400, 40, 120 and
100 respectively. Thus, d 2 is the minimax rule.
We now consider computation of a Bayes rule. Suppose
that on the basis of previous experience and from largescale maps, we take as the prior distribution  (1 )  .8 and
 (2 )  .2 . Using this prior distribution and the risk
functions computed above, we find for each decision
function its Bayes risk
B(d )  E ( ) [ R( ), d ]
 R(1 , d ) (1 )  R( 2 , d ) ( 2 )
Thus, we have
B(d1 )  0*.8  400*.2  80
B(d 2 )  40*.8  40*.2  40
B(d3 )  10*.8  120*.2  32
B(d 4 )  100*.8  0*.2  32
Comparing these numbers, we see that d 3 is the Bayes rule
(among these four rules). Note that this Bayes rule is less
conservative than the minimax rule in that it chooses action
a1 (40-ft length) based on observation x2 (45-ft sounding).
That is because the prior distribution for this Bayes rule
puts more weight on 1 . If the prior distribution were
changed sufficiently, the Bayes rule would change.
Posterior Analysis
We now develop a method for finding the Bayes rule. The
Bayes risk for a prior distribution  ( ) is the expected loss
of a decision rule d ( X ) when X is generated from the
following probability model:
First, the state of nature is generated according to the prior
distribution  ( )
Then, the data X is generated according to the distribution
f ( X ; ) , which we will denote by f ( X |  )
Under this probability model (call it the Bayes model), the
marginal distribution of X is (for the continuous case)
f X ( x)   f ( x |  ) ( )d
Applying Bayes rule, the conditional distribution of
 given X is
f X , ( x, )
f ( x |  ) ( )
h( | x) 

f X ( x)
 f ( x |  ) ( )d
The conditional distribution h( | x ) is called the X  x of
 . The words prior and posterior derive from the facts that
 ( ) is specified before (prior to) observing X and
h( | x ) is calculated after (posterior to) observing X  x .
We will discuss later more about the interpretation of prior
and posterior distributions.
Suppose that we have observed X  x . We define the
posterior risk of an action a  d ( x ) as the expected loss,
where the expectation is taken with respect to the posterior
distribution of  . For continuous random variables, we
have
Eh ( | X  x ) [l ( ), d ( x))]   l ( , d ( x))h( | X  x)d
Theorem: Suppose there is a function d0 ( x) that minimizes
the posterior risk. Then d0 ( x) is a Bayes rule.
Proof: We will this for the continuous case. The discrete
case is proved analogously. The Bayes risk of a decision
function d is
B(d )  E ( ) [ R( ), d ]
 E ( ) [ E X (l ( , d ( X )) |  ]
    l ( , d ( x)) f ( x |  ) dx  ( ) d


  l ( , d ( x)) f x , ( x,  ) dxd
    l ( , d ( x)) h( | x) d  f X ( x) dx


(We have used the relations
f ( x |  ) ( )  f X , ( x,  )  f X ( x)h( | x) )
Now the inner integral is the posterior risk and since
f X ( x) is nonnegative, B(d ) can be minimized by choosing
d ( x)  d 0 ( x) .

The practical importance of this theorem is that it allows us
to use just the observed data, x , rather than considering all
possible values of X to find the action for the Bayes rule
d ** given the data X  x , d ** ( x) . In summary the
**
algorithm for finding d ( x) is as follows:
Step 1: Calculate the posterior distribution h( | X  x) .
Step 2: For each action a , calculate the posterior risk,
which is
E[l ( , a ) | X  x]   l ( , a)h( | X )d
*
The action a that minimizes the posterior risk is the Bayes
**
*
rule action d ( x)  a
Example: Consider again the engineering example.
Suppose that we observe X  x2  45 . In the notation of
that example, the prior distribution is  (1 )  .8 ,  (2 )  .2 .
We first calculate the posterior distribution:
f ( x |  ) (1 )
h(1 | x2 )  2 2 1
 i 1 f ( x2 | i ) (i )
.3*.8
.3*.8  .2*.2
 .86

Hence,
h(2 | x2 )  .14
We next calculate the posterior risk (PR) for a1 and a2 :
PR(a1 )  l (1 , a1 )h(1 | x2 )  l ( 2 , a1 )h( 2 | x2 )
 0  400*.14
 56
and
PR(a2 )  l (1 , a2 )h(1 | x2 )  l ( 2 , a2 )h( 2 | x2 )
 100*.86  0
 86
Comparing the two, we see that a1 has the smaller posterior
risk and is thus the Bayes rule.
Download