Notes 26 - Wharton Statistics Department

advertisement
Statistics 512 Notes 26: Decision Theory
Continued
Posterior Analysis
We now develop a method for finding the Bayes rule. The
Bayes risk for a prior distribution  ( ) is the expected loss
of a decision rule d ( X ) when X is generated from the
following probability model:
First, the state of nature is generated according to the prior
distribution  ( )
Then, the data X is generated according to the distribution
f ( X ; ) , which we will denote by f ( X |  )
Under this probability model (call it the Bayes model), the
marginal distribution of X is (for the continuous case)
f X ( x)   f ( x |  ) ( )d
Applying Bayes rule, the conditional distribution of
 given X is
f ( x,  )
f ( x |  ) ( )
h( | x)  X ,

f X ( x)
 f ( x |  ) ( )d
The conditional distribution h( | x ) is called the X  x of
 . The words prior and posterior derive from the facts that
 ( ) is specified before (prior to) observing X and
h( | x ) is calculated after (posterior to) observing X  x .
We will discuss later more about the interpretation of prior
and posterior distributions.
Suppose that we have observed X  x . We define the
posterior risk of an action a  d ( x ) as the expected loss,
where the expectation is taken with respect to the posterior
distribution of  . For continuous random variables, we
have
Eh ( | X  x ) [l ( ), d ( x))]   l ( , d ( x))h( | X  x)d
Theorem: Suppose there is a function d0 ( x) that minimizes
the posterior risk. Then d0 ( x) is a Bayes rule.
Proof: We will this for the continuous case. The discrete
case is proved analogously. The Bayes risk of a decision
function d is
B(d )  E ( ) [ R( ), d ]
 E ( ) [ E X (l ( , d ( X )) |  ]
    l ( , d ( x)) f ( x |  ) dx  ( ) d


  l ( , d ( x)) f x , ( x,  ) dxd
    l ( , d ( x)) h( | x) d  f X ( x) dx


(We have used the relations
f ( x |  ) ( )  f X , ( x,  )  f X ( x)h( | x) )
Now the inner integral is the posterior risk and since
f X ( x) is nonnegative, B(d ) can be minimized by choosing
d ( x)  d 0 ( x) .

The practical importance of this theorem is that it allows us
to use just the observed data, x , rather than considering all
possible values of X to find the action for the Bayes rule
d ** given the data X  x , d ** ( x) . In summary the
**
algorithm for finding d ( x) is as follows:
Step 1: Calculate the posterior distribution h( | X  x) .
Step 2: For each action a , calculate the posterior risk,
which is
E[l ( , a) | X  x]   l ( , a)h( | X )d
*
The action a that minimizes the posterior risk is the Bayes
**
*
rule action d ( x)  a
Example: Consider again the engineering example.
Suppose that we observe X  x2  45 . In the notation of
that example, the prior distribution is  (1 )  .8 ,  (2 )  .2 .
We first calculate the posterior distribution:
f ( x |  ) (1 )
h(1 | x2 )  2 2 1
 i 1 f ( x2 | i ) (i )
.3*.8
.3*.8  .2*.2
 .86

Hence,
h(2 | x2 )  .14
We next calculate the posterior risk (PR) for a1 and a2 :
PR(a1 )  l (1 , a1 )h(1 | x2 )  l ( 2 , a1 )h( 2 | x2 )
 0  400*.14
 56
and
PR(a2 )  l (1 , a2 )h(1 | x2 )  l ( 2 , a2 )h( 2 | x2 )
 100*.86  0
 86
Comparing the two, we see that a1 has the smaller posterior
risk and is thus the Bayes rule.
Decision Theory for Point Estimation
Estimation theory can be cast in a decision theoretic
framework. The action space is the same as the parameter
space and the decision functions are point estimates
d ( X )  ˆ . In this framework, a loss function may be
specified, and one of the strengths of the theory is that
general loss functions are allowed and thus the purpose of
estimation is made explicit. But the case of squared error
loss is especially tractable. For squared error loss, the risk
is

R( , d )  E   d ( X 1 ,

, X n )
2


2

ˆ
 E    


Suppose that a Bayesian approach is taken and the prior
distribution on  is  ( ) . Then from the above theorem,
the Bayes rule for squared error loss can be found by
minimizing the posterior risk, which is
E ( ) [(  ˆ)2 | X  x] . We have
E ( ) [(  ˆ)2 | X  x]  Var ( ) (  ˆ) | X  x    E ( ) (  ˆ) | X  x 
 Var ( )  | X  x    E ( )  | X  x   ˆ 
2
The first term of this last expression does not depend on ˆ
ˆ
ˆ
and the second term is minimized by   E ( )  | X  x  .
Thus, the Bayes rule for squared error loss is the mean of
the posterior distribution of  ,
ˆ    h( | X )d .
Example: A (possibly) biased coin is thrown once, and we
want to estimate the probability of the coin landing heads
on future tosses based on this single toss. Suppose that we
have no idea how biased the coin is; to reflect this state of
knowledge, we use a uniform prior distribution on  :
g ( )  1, 0    1
Let X  1 if a head appears, and let X  0 if a tail appears.
The distribution of X given  is
x 1
 ,
f (x |  )  
1   , x  0
The posterior distribution is
f ( x |  ) 1
h( | x) 
 f ( x |  )d
In particular,
2
h( | X  1) 

1
 2
  d
0
h( | X  0) 
(1   ) 1
1
 2(1   )
 (1   )d
0
Suppose that X  1 . The Bayes estimate of  is the mean
of the posterior distribution h( | X  1) , which is
1
2

(2

)
d


0
3
ˆ1

The Bayes estimate in the case where X  0 is
3.
Note that these estimates differ from the classical
maximum likelihood estimates, which are 0 and 1.
Comparison of the risk functions of the Bayes estimate and
the MLE:
The risk function of the above Bayes estimate is


2
2
2
1

2

E  ˆBayes         P ( X  0)      P ( X  1)

  3

3

2
2
1

2

     (1   )      
3

3

1 1
1
    2
9 3
3
The risk function of the MLE ˆMLE  X is


2
2
2
E  ˆMLE      0    P ( X  0)  1    P ( X  1)


  0    (1   )  1    
2
2
  2     (1   )
The following graph shows the risk function of the Bayes
estimate and the MLE – the Bayes estimate is the solid line
and the MLE is the dashed line. The Bayes estimate has
smaller risk than that of the MLE over most of the range of
[0,1] but neither estimator dominates the other.
Admissibility
Minimax estimators and Bayes estimators are “good
estimators” in the sense that their risk functions have
certain good properties; minimax estimators minimize the
worst case risk and Bayes estimators minimize a weighted
average of the risk. It is also useful to characterize bad
estimators.
Definition: An estimator ˆ is inadmissible if there exists
another estimator ˆ ' that dominates ˆ , meaning that
R( ,ˆ ')  R( , ˆ) for all  and
R( ,ˆ ')  R( , ˆ) for at least one 
If there is no estimator ˆ' that dominates ˆ, then ˆ is admissible
Example: Let X be a sample of size one from a
N ( ,1) distribution and consider estimating  with squared
error loss.
Let ˆ( X )  2 X . Then
R( ,ˆ)  E [(2 X   ) 2 ]

 Var (2 X )   E [2 X ]   
2
 4  2
The risk function of the MLE ˆMLE ( X )  X is
R( ,ˆMLE )  E [( X   ) 2 ]
 Var ( X )   E [ X ]   
2
1
Thus, the MLE dominates ˆ( X )  2 X and ˆ( X )  2 X is
inadmissible.
Consider another estimator ˆ( X )  3 . We will show that
ˆ is admissible. Suppose not. Then there exists a different
estimator ˆ ' with smaller risk. In particular,
R(3, ˆ ')  R(3, ˆ)  0 . Hence,
0  R(3, ˆ ')   (ˆ ' 3) 2


1
exp ( x  3) 2 / 2 dx .
2
Thus, ˆ '( X )  3 and there is no estimator that dominates
ˆ . Even though ˆ is admissible, it is clearly a bad
estimator.
In general, it is very hard to know whether a particular
estimate is admissible, since one would have to check that
it was not strictly dominated by any other estimate in order
to show that it was admissible. The following theorem
states that any Bayes estimate is admissible.
Theorem (Complete Class Theorem): Suppose that one of
the following two assumptions holds:
(1)  is discrete and d * is a Bayes rule with respect to a
prior probability mass function  such that  ( )  0 for all
  .
(2)  is an interval and d * is a Bayes rule with respect to a
prior density function g ( ) such that  ( )  0 for all
  and R ( , d ) is a continuous function of  for all d .
Then d * is admissible.
Proof: We will prove the theorem for assumption (2). The
proof is by contradiction. Suppose that d * is inadmissible.
There is then another estimate, d , such that
R( , d *)  R( , d ) for all  and with strict inequality for
some  , say 0 . Since R( , d *)  R( , d ) is a continuous
function of  , there is an   0 and an interval   h such
that
R( , d *)  R( , d )   for 0  h    0  h
Then,
0  h

  R( , d *)  R( , d ) ( )d     R( , d *)  R( , d ) ( )d

0 h

0  h


 ( )d  0
0 h
But this contradicts the fact that d * is a Bayes rule because
a Bayes rule has the property that

B(d *)  B(d ) 
  R( , d *)  R( , d ) ( )d  0 .

The proof is complete.

The theorem can be regarded as both a positive and
negative result. It is positive in that it identifies a certain
class of estimates as being admissible, in particular, any
Bayes estimate. It is negative in that there are apparently
so many admissible estimates – one for every prior
distribution that satisfies the hypotheses of the theorem –
and some of these might make little sense (like ˆ( X )  3
for the normal distribution above).
Download