Bayes

advertisement
Chapter 1 Basic Concepts
Thomas Bayes (1702-1761): two articles from his pen published
posthumously in 1764 by his friend
Richard Price.
Laplace (1774): stated the theorem on inverse probability in general
form.
Jeffreys (1939): rediscovered Laplace’s work.
Example 1:
yi , i  1, 2,, n : the lifetime of batteries


2
Assume y i ~ N  ,  . Then,

p y |  ,
2
 
n
 1
exp  2
 2
n
 y
i 1
i
2
t
   , y   y1 , , yn  .

To obtain the information about the values of

and  , two methods
2
are available:
(a) Sampling theory (frequentist):

and 
2
are the hypothetical true values. We can use
2
 point estimation: finding some statistics ˆ y  and ˆ  y  to
estimate

and  , for example,
2
n
ˆ y   y 
y
i 1
n
n
i
2
, ˆ  y  
 y
i 1
 y
2
i
n 1
.


 interval estimation: finding an interval estimate ˆ1  y , ˆ2  y 
1
and ˆ12  y , ˆ 22  y  for
estimate for

and  , for example, the interval
2
,

s
s  
   , Z ~ N (0,1).
y

z
,
y

z
,
P
Z

z






2
2
2
2
n
n 

(b) Bayesian approach:

2
Introduce a prior density   , 
 for

and  . Then, after some
2
manipulations, the posterior density (conditional density given y)


f  ,  2 | y can be obtained. Based on the posterior density, inferences
about

and 
2
can be obtained.
Example 2:
X ~ b10, p  : the number of wins for some gambler in 10 bets,
where p is the probability of winning.
Then,
10 
10 x
f x | p     p x 1  p  , x  1,2, ,10.
x
(a) Sampling theory (frequentist):
To estimate the parameter p, we can employ the maximum likelihood
principle. That is, we try to find the estimate
p̂
to maximize the
likelihood function
10  x
10 x
l  p | x   f x | p     p 1  p  .
x
2
x  10 ,
For example, as
Thus,
10 
1010
l  p | x   l  p | 10    p10 1  p 
 p10 .
10 
ˆ  1 . It is a sensible estimate. Since we can win all the
p
time, the sensible estimate of the probability of winning should be 1. On
the other hand, as
Thus,
x  0,
10 
100
10
l  p | x   l  p | 0    p 0 1  p   1  p  .
0
ˆ  0 . Since we lost all the time, the sensible estimate of
p
the probability of winning should be 0. In general, as
ˆ 
p
xn,
n
, n  0,1,,10,
10
maximize the likelihood function.
(b) Bayesian approach::
  p  : prior density for p, i.e., prior beliefs in terms of probabilities of
various possible values of p being true.
Let
  p 
r a  b  a 1
b 1
p 1  p 
 Beta a, b  .
r a r b 
Thus, if we know the gambler is a professional gambler, then we can use
the following beta density function,
  p  2 p  Beta2,1 ,
to describe the winning probability p of the gambler.
The plot of the density function is
3
1.0
0.0
0.5
prior
1.5
2.0
Beta(2,1)
0.0
0.2
0.4
0.6
0.8
1.0
p
Since a professional gambler is likely to win, higher probability is
assigned to the large value of p.
If we know the gambler is a gambler with bad luck, then we can use the
following beta density function,
  p  21  p  Beta1,2 ,
to describe the winning probability p of the gambler. The plot of the
density function is
1.0
0.0
0.5
prior
1.5
2.0
Beta(1,2)
0.0
0.2
0.4
0.6
0.8
1.0
p
Since a gambler with bad luck is likely to lose, higher probability is
4
assigned to the small value of p.
If we feel the winning probability is more likely to be around 0.5, then we
can use the following beta density function,
  p  6 p1  p  Beta2,2 ,
to describe the winning probability p of the gambler. The plot of the
density function is
0.0
0.5
prior
1.0
1.5
Beat(2,2)
0.0
0.2
0.4
0.6
0.8
1.0
p
If we don’t have any information about the gambler, then we can use the
following beta density function,
  p   1  Beta1,1 ,
to describe the winning probability p of the gambler. The plot of the
density function is
1.0
0.9
0.8
prior
1.1
1.2
Beta(1,1)
0.0
0.2
0.4
0.6
p
5
0.8
1.0
posterior density of p given x  conditiona l density of p given x
f x, p  joint density of x and p
 f  p | x 

f x 
marginal density of x
f  x | p   p 

 f x | p   p   l  p | x   p 
f x 
Thus, the posterior density of p given x is
f  p | x     p l  p | x  
r a  b  a 1
b 1  n 
n x
p 1  p    p x 1  p 
r a r b 
 x
 ca, b, x  p x a 1 1  p 
b 10 x 1
In fact,
r a  b  10 
b 10 x 1
p xa 1 1  p 
r  x  a r b  10  x 
 Beta x  a, b  10  x 
f  p | x 
Then, we can use some statistic based on the posterior density, for
example, the posterior mean
E  p | x    pf  p | x dp 
1
0
As
xa
a  b  10 .
xn,
ˆ  E  p | n 
p
an
a  b  10
is different from the maximum likelihood estimate
n
10 .
Note:
f  p | x     p l  p | x 
 the original informatio n about p the informatio n from the data 
 the new informatio n about p given the data
6
Properties of Bayesian Analysis:
1. Precise assumption will lead to consequent inference.
2. Bayesian analysis automatically makes use of all the information from
the data.
3. The inferences unacceptable must come from inappropriate assumption
and not from inadequacies of the inferential system.
4. Awkward problems encountered in sampling theory do not arise.
5. Bayesian inference provides a satisfactory way of explicitly
introducing and keeping track of assumptions about prior knowledge or
ignorance.
1.1 Introduction
Goal: statistical decision theory is concerned with the making of
decisions in the presence of statistical knowledge which sheds
lights on some of the uncertainties involved in the decision
problem.
3 Types of Information:
1. Sample information: the information from observations.
2.Decision
information: the information about the possible
consequences of the decisions, for
example, the loss due to a wrong
decision.
3. Prior information: the information about the parameter.
1.2 Basic Elements
:
parameter.
:
parameter space consisti1ng of all possible values of
7
.
a:
decisions or actions (or some statistic used to estimate
:
the set of all possible actions.
 ).
L , a  :     R
: loss function
L1 , a1  : the loss when the parameter value is 1
a1
and the action
is taken.
X   X 1 , X 2 ,, X n  : X 1 , X n
are independent
observations from a common distribution
:
sample space (all the possible values of X, usually
subset of
 will be a
R n ).
 f x |  dx 
A f x1 ,, xn |  dx1  dxn
A
X
P  X  A   dF x |    
A
  f x |      A f x1 , , xn |  
 A
where
F X x |  
is the cumulative distribution of X.
 hx  f x |  dx

E h X    hx dF X x |    
.

  h x  f  x |  
 
Example 2 (continue):
Let
A  1, 3, 5, 7, 9.
Then,
8
10 
10 
10 x
10 x
Pp  X  A     p x 1  p      p x 1  p 
xA  x 
x1, 3, 5, 7 , 9  x 
10 
10 
10 
10 
10 
9
7
5
3
   p1  p     p 3 1  p     p 5 1  p     p 7 1  p     p 9 1  p 
1
3
5
7
9
Let
X
10 .
a1  the estimate of p 
Also, let
 X X
h  X   L p ,  
 p  loss function
 10  10
Then,
  X 
X

E p h X   E p  L p,   E p   p 
 10

  10 
10 p
X
 Ep    p 
 p0 .
10
 10 
Example 3:
Let X ~ beta ,1 and
hx   x 2 . Then,
E h X    hx  f x |  dx   x 2x 1dx
1
0



 2
Example 4:
9
x  2 |10 

 2 .
a1 :
sell the stock.
a2 :
keep the stock.
1 :
stock price will go down.
2 :
stock price will go up.
Let
L1 , a1   500, L1 , a2   300, L 2 , a1   1000, L 2 , a2   300
The above loss function can be summarized by
1
2
a1
a2
-500
1000
300
-300
Note that there is no sample information from an associated statistical
experiment in this example. We call such a problem no-data problem.
1.3 Expected Loss, Decision Rules, and Risk
Motivation:
In the previous section, we introduced the loss of making a decision
(taking an action). In this section, we consider the “expected” loss of
making a decision. Two types of expected loss are considered:
 Bayesian expected loss
 Frequentist risk
(a) Bayesian Expected Loss:
Definition:
The Bayesian expected loss of an action a is
10
  , a   E  L , a    L , a dF      L , a   d

where
  
distribution of
and
F   

are the prior density and cumulative
 , respectively.
Example 4 (continue):
Let
 1   0.99,   2   0.01 .
Then,
  , a1   E  L , a1    1 L1 , a1     2 L 2 , a1 
 0.99   500  0.011000  485
and
  , a2   E  L , a2    1 L1 , a2     2 L 2 , a2 
 0.99  300  0.01   300  294
(b) Frequentist Risk:
Definition:
A (nonrandomized) decision rule
R
. If
 X 
is a function from  into
X  x0 is observed, then   x0  is the action that will be
taken. Two decision rules,
1
and
 2 , are considered equivalent if
P  1  X    2  X   1, for every  .
Definition:
The risk function of a decision rule
 X 
11
is defined by
R ,   E L ,  X    L , x dF X x |     L , x  f x |  dx

Definition:
If

R , 1   R ,  2 , for all    ,
 , then the decision rule 1
with strict inequality for some
R-better than the decision rule
is
 2 . A decision rule is admissible if
there exists no R-better decision rule. On the other hand, a decision rule
is inadmissible if there does exist an R-better decision rule.
Note:
A rule
1
is R-equivalent to
2
if
R , 1   R ,  2 , for all    .
Example 4 (continue):
R1 , a1   L1 , a1   500  300  L1 , a2   R1 , a2 
and
R 2 , a1   L 2 , a1   1000  300  L 2 , a2   R 2 , a2 .
Therefore, both
a1
and
a2
are admissible.
Example 5:
Let
12
X ~ N  ,1, L , a     a  ,  1  X   X ,  2  X  
2
Note that
E  X   
and
X
.
2
Var X   1.
Then,
R , 1   E L , 1  X   E L , X   E   X 
2
 Var  X   1
and
  X 
X

R ,  2   E L ,  2  X   E  L ,   E  

2 
2

 
2
X

X  
 E 
    E 
  
2 2
 2

 2
 X   2
  X   2
 E 
   2 
 

2
2  2
2
4 
 2
2
2
2
X 
X  
 E 
     E 
 
2
2
4
 2
 2
 2 Var ( X )  2
X
 Var    0 


4
4
4
 2
2
1 2
 
4
4
Definition:
The Bayes risk of a decision rule
on


with respect to a prior distribution
is defined as
r  ,    E  R ,     R , a   d

Example 5 (continue):
Let
13
 ~ N 0,  2 ,    
Then,
1
2  2
 2
e 2
2
r , 1   E  R , 1   E  1  1
and
2


1




r  ,  2   E R ,  2   E  
4 4 
1 E  2
1 Var ( )
 
 
4
4
4
4
1 2
 
4
4
 
1.4 Decision Principles
The principles used to select a sensible decision are:
(a) Conditional Bayes Decision Principle
(b) Frequentist Decision Principle.
(a)
Conditional Bayes Decision Principle:
Choose an action
a A
minimizing
a Bayes action and will be denoted
a
  , a  . Such a will be called
.
Example 4 (continue):
Let A  a1 , a2 ,  1   0.99,   2   0.01 . Thus,
14
  , a1   485,   , a2   294 .
Therefore,
a  a1 .
(b)
Frequentist Decision Principle:
3 most important frequentist decision principle:
 Bayes risk principle
 Minimax principle
 Invariance principle
(1) Bayes Risk Principle:
 1 ,  2  D , a decision
Let D be the class of the decision rules. Then, for
rule
1 is preferred to a rule  2 based on Bayes risk principle if
r  , 1   r  ,  2 
.
A decision rule minimizing r  ,   among all decision rules in class D

is called a Bayes rule and will be denoted as  . The quantity

r    r  ,  
is called Bayes risk for

.
Example 5 (continue):


X ~ N  ,1,  ~ N 0,  2 ,
D  cx : c is any constant .
15
Let
 c  X   cX . Then,
R ,  c   E   cX   E cX     E cX  c   c  1 
2

2
2
 E cX  c   2cX  c c  1  c  1  2
2
2

 c 2Var  X   c  1  2
2
 c 2  c  1  2
2
and

  , c   E  R ,  c   E  c 2  c  12 2
 

 c 2  c  1 E   2
2
 c 2  c  1  2
2
Note that
  ,  c  is a function of c.   ,  c  attains its
2
minimum as c 
,
1  2
2
( f c     ,  c   c  c  1  , f c   0  c 
)
1  2
2
2
2
'
Thus,

2
1 2
2
X  
X
2
1 
is the Bayes estimator.
In addition,
16

  2
r    r   ,   2
  
2
2
1 

1 
2
2

 2
 2
  
 

1
2

1 

4
2


2 2
1    1   2 2
 2 1   2 

1   2 2
2

1  2
Example 4 (continue):
Let D  a1 , a2 ,  1   0.99,   2   0.01. Then,
r  , a1   E  R , a1   E  L , a1   485
and
r  , a2   E  R , a2   E  L , a2   294 .
a1
Thus,
is the Bayes estimator.
Note:
In a no-data problem, the Bayes risk (frequentist principle) is
equivalent to the Bayes expected loss (conditional Bayes principle).
Further, the Bayes risk principle will give the same answer as the
conditional Bayes decision principle.
Definition:
Let
X   X 1 , X 2 ,, X n 
have the probability distribution
function (or probability density function)
f x |    f x1 , x2 ,, xn |  
with prior density and prior cumulative distribution
17
  
and
F    , respectively. Then, the marginal density or distribution
X   X 1 , X 2 ,, X n 
of
mx   mx1 , x2 ,, xn   

is
    f x |  d


f x |  dF   
     f x |  
 
The posterior density or distribution of
 given x
f  | x   f  | x1 , x2 ,, xn  
The posterior expectation of
g  
given
is
   f x |  
mx 
x
.
is

 g     f x |  d

 
 g   f  | x d 
m x 
E f  |x  g    

  g   f  | x    g     f x |  
 
 

m x 

Very Important Result:
Let
X   X 1 , X 2 ,, X n 
have the probability distribution
function (or probability density function)
f x |    f x1 , x2 ,, xn |  
with prior density and prior cumulative distribution
  
and
F    , respectively. Suppose the following two assumptions hold:
(a) There exists an estimator
0
with finite Bayes risk.
18
(b) For almost all

x , there exists a value  x 
minimizing
 L ,  x  f  | x d

f  | x 
L , x   
  ,    E
  L ,  x  f  | x 
 
Then,
(a)
if
L , a   a  g  
2
,
then
 g   f  | x d


f  | x 
g    
 x   E
  g   f  | x 
 
and, more generally, if
L , a   w a  g  
2
,
then
 w g   f  | x d
 

f  | x 
 w  f  | x 

E
w g   

 x  

f  | x 
w    w g   f  | x 
E
 
  w  f  | x 


(b)
if
then
L , a   a  
  x 
,
is the median of the posterior density or distribution
19
f  | x 
 given x . Further, if
of
k 0   a ,   a  0
L , a   
,
 k1 a   ,   a  0
then
 x 

is the
k0
k0  k1 percentile of the posterior density or
f  | x 
distribution
of
 given x .
(c)
if
 0 when a    c
L , a   
1 when a    c
then,
  x 
is the midpoint of the interval
I
of length
2c
which maximizing
 f  | x d
I
P  I | x   
  f  | x 
 I
[Outline of proof]
(a)

  , a   E f  |x  L ,  x   E f  |x  w a  g  2



 E f  |x  w  a 2  2ag    g 2  







 E f  |x  w  a 2  2 E f  |x  g  w  a  E f  |x  g 2  w 
Thus
20
  , a 
 2 E f  |x  w a  2 E f  |x  g  w   0
a
E f  |x  g  w 
 a
E f  |x  w 
(b)
Without loss of generality, assume m is the median of
f  | x  . We
want to prove
  , m    , a   E f  |x  L , m  L , a   0 ,
a  m .
for
Since
L , m   L , a   m    a  
 m    a     m  a,

   m  a    2  m  a,
   m    a  a  m,

 m
m   a
 a
then
E f  |x  L , m   L , a 
 m  a P  m | x   2  m  a Pm    a | x 
 a  m P  a | x 
 m    a




2


m

a


 m  a P  m | x   a  m Pm    a | x  
   m     a 


 m  a



 a  m P  a | x 
 m  a P  m | x   a  m P  m | x 

ma am

0
2
2
21
[Intuition of the above proof:]
a1
a3
a2
c1
c3
c2
3
We want to find a point c such that
a
i 1
2
i 1
i
achieves its
c  a2 ,
minimum. As
3
ca
 ai  a2  a1  a2  a2  a2  a3  a3  a1 .
As c  c1 
3
c
i 1
 ai  c1  a3  a3  a1 .
1
As
3
c  c2   c2  ai  c2  a1  c2  a2  c2  a3  a3  a1  c2  a2
i 1
,
 a3  a1
As c  c3 
3
c
i 1
3
 ai  c3  a1  a3  a1 .
3
Therefore, As
c  a2 ,
ca
i 1
i
achieves its minimum.
(2) Minimax Principle:
A decision rule
principle if
1 is preferred to a rule  2 based on the minimax
sup R ,  1   sup R ,  2  .


22
A decision 
minimizing sup R ,   among all decision rules in
M

class D is called a minimax decision rule, i.e.,


sup R  ,  M  inf sup R  ,   .
 
 D  
Example 5 (continue):
D  cx : c is any constant 
and
R , c   c 2  c 1  2 .
2
Thus,


1 if c  1
2
sup R ,  c   sup c 2  1  c   2  
.


 if c  1
Therefore,
 M  1  X   X
is the minimax decision rule.
Example 4 (continue):
D  a1 , a2 . Then,
sup R , a1   sup L , a1   1000

and
sup R , a 2   sup L , a 2   300

Thus,

.

 M  a2 .
(3) Invariance Principle:
If two problems have identical formal structures (i.e., the same sample
space, parameter space, density, and loss function), the same decision
rule should be obtained based on the invariance principle.
23
Example 6:
X: the decay time of a certain atomic particle (in seconds).
Let X be exponentially distributed with mean
f x |   
1

e
x
Suppose we want to estimate the mean

,
,0  x  .
 . Thus, a sensible loss function
is
L , a  
1
2
  a 
2
a

 1  
 
2
.
Suppose
Y: the decay time of a certain atomic particle (in minutes).
Then,
X
1  y

Y 
, f y |  e
, 0  y  ,  
60

60 .
Thus,
2
  a 
 2
 a    60   a 
L , a   1    1 
 1    L  , a 
         
  60 
2
where
a  a
60

..
Let
  X  : the decision rule used to estimate 
and
 Y  : the decision rule used to estimate  .
24
,
Based on the invariance principle,
 X 
X
  Y     X   60 Y   60  .
60
 60 
The above augments holds for any transformation of the form
Y  cX , c  R , based on the invariance principle. Then,
1
c
1
1
X
  X    Y    cX  

   X   X 1  kX , k   1
c
c
Thus,
  X   kX
is the decision rule based on the invariance
principle.
1.5 Foundations
There are several fundamental principles discussed in this section. They
are :
(a) Misuse of Classical Inference Procedure
(b) Frequentist Perspective
(c) Conditional Perspective
(d) Likelihood Principle
(a)
Misuse of Classical Inference Procedure:
Example 7:
Let
X 1 , X 2 ,, X n ~ N  ,1
In classical inference problem,
H 0 :   0 v.s. H1 :   0 ,
the rejection rule is
25
n
n x  1.96, x 
as   0.05.
x
i
i 1
n
,
10
24
Assume the true mean   10 , n  10 , and

X ~ N 1010 ,1024
Suppose x  10
11
.
, then
n x  10 24 10 11  10  1.96
and we reject H 0 .
Intuitively, x  10
11
seems to strongly indicate H 0 should be
true. However, for a large sample size, even as x is very close to 0, the
classical inference method still indicates the rejection of H 0 . The
above result seems to contradict the intuition.
Note: it might be more sensible to test, for example,
H 0 :   103 v.s. H1 :   103 .
Example 8:
Let
X 1 , X 2 ,, X 100 ~ N  ,1
In classical inference problem,
H 0 :   0 v.s. H1 :   0 ,
26
the rejection rule is
100
n x  10 x  1.645, x 
x
i 1
i
100
as   0.05.
,
If x  0.164,. then 10x  1.64  1.645 and we do not reject
H 0 . However, as   0.51 ,
p  value  PX  x   PX  1.64  0.0505  0.51 ,
we then reject H 0 .
(b)
Frequentist Perspective:
Example 9:
Let
X 1 , X 2 ,, X 100 ~ N  ,1
In classical inference problem,
H 0 :   0 v.s. H1 :   1 ,
the rejection rule is
100
n x  10 x  1.645, x 
x
i 1
i
100
,
as   0.05. By employing the above rejection rule, about 5% of all
rejection of the null hypothesis will actually be in error as H 0 is true.
However, suppose the parameter values   0 and   1 occur
equally often in repetitive use of the test. Thus, the chance of H 0 being
true is 0.5. Therefore, correctly speaking, 5% error rate is only correct for
27
50% repetitive uses. That is, one can not make useful statement about the
actual error rate incurred in repetitive use without knowing R , 
for all
(c)
.
Conditionalt Perspective:
Example 10:
Frequentist viewpoints:
X 1 , X 2 are independent with identical distribution,
P X i    1  P X i    1 
1
.
2
Then,
 X1  X 2

 X 1 , X 2    2
 X 1  1
can be used to estimate
if X 1  X 2
if X 1  X 2
.
In addition,
P  X 1 , X 2      P X 1    1, X 2    1 or X 1    1, X 2    1
 P X 1    1, X 2    1
1 1 1 1
 2   
2 2 2 2
 0.75
Thus, a frequentist claims 75% confidence procedure.
Conditional viewpoints:
Given X 1  X 2 ,   X 1 , X 2  is 100% certain to estimate
28

correctly,
i.e., P  X1 , X 2    | X1  X 2   1 .
Given X 1  X 2 ,   X 1 , X 2  is 50% certain to estimate

correctly,
i.e., P  X1, X 2    | X1  X 2   0.5 .
Example 11:
X
1
0.005
0.0051
1.02
f x |   0
f x |   1
f x |   1
f x |   0
2
0.005
0.9849
196.98
3
0.99
0.01
0.01
X  1 : some index (today) indicating the stock (tomorrow) will not go up
or go down
X  2 : some index (today) indicating the stock (tomorrow) will go up
X  3 : some index (today) indicating the stock (tomorrow) will go down
  0 : the stock (tomorrow) will go down.
  1 : the stock (tomorrow) will go up
Frequentist viewpoints:
To test
H 0 :   0 v.s. H1 :   1 ,
by the most powerful test with   0.01, we reject H 0 as
X  1, 2 since
  P X  1, 2 |   0  0.005  0.005  0.01.
Thus, as X  1 , we reject H 0 and conclude the stock will go up.
This conclusion might not be very convincing since the index does not
indicate the rise of the stock.
29
Conditional viewpoints:
As X  1 ,
f 1 |   1
 1.02 .
f 1 |   0
Thus, f 1 |   1 and f 1 |   0 are very close to each other.
Therefore, based on conditional viewpoints, about 50% chance, the stock
will go up tomorrow.
Example 12:
Suppose there are two laboratories, one in Kaohsiung and the other in
Taichung. Then, we flip a coin to decide the laboratory we will perform
an experiment at:
Head: Kaohsiung
;
Tail: Taichung
Assume the coin comes up tail. Then, the laboratory in Taichung should
be used.
Question: should we need to perform another experiment in
Kaohsiung in order to develop report?
Frequentist viewpoints: we have to call for averaging over all possible
data including data obtained in Kaohsiung.
Conditional viewpoints: we don’t need to perform another experiment
in Kaohsiung. We can make statistical
inference based on the data we have now.
The Weak Conditionality Principle:
Two experiments
E1
or
E2
can be performed to draw
information about  . Then, the actual information about  should
depend only on the experiment E j  j  1 or 2 that is actually
30
performed.
(d)
Likelihood Principle:
Definition:
For observed data x, the function l    f x |   , considered as a
function of  , is called the likelihood function.
Likelihood Principle:
All relevant experimental information is contained in the likelihood
function for the observed x. Two likelihood functions contain the
same information about  if they are proportional to each other.
Example 13:
 : the probability that a coin comes up head.
Suppose we want to know if the coin is fair, i.e.,
H0 : 
1
1
v.s. H1 :   ,
2
2
with   0.05. Then, we flip a coin in a series of trials, 9 heads and 3
tails. Let
X : the number of heads.
Two likelihood functions can be used. They are:
1. Binomial:
 n
n x
X ~ B12, p , l1    f1 x |      x 1   
 x
In this example,
12 
12  9
3
l1    f1 9 |      9 1   
 220 9 1   
9
31
2. Negative Binomial:
 n  x  1 x
 1   n
X ~ NBn, , l2    f 2 x |    
x


In this example, we throw a coin until 3 tails come up. Therefore,
n  3, x  9
and
 3  9  1 9
 1   3  55 9 1   3
l2    f 2 9 |    
9


By likelihood principle,
l1   and
l2  
contain the same
information. Thus, intuitively, the same conclusion should be achieved
based on the two likelihood functions. However, classical statistical
inference would result in bizarre conclusions from frequentist point of
view.
1. Binomial:
The reject rule is
X  c,
c
is some constant. Thus, in this example,
p  value  P X  9 |  
1
1
1



 f1  9 |     f1 10 |     f1 11 |    
2
2
2



 0.075  0.05
1

f 12 |   
2

Thus, we do not reject H 0 and conclude the coin is fair.
2. Negative Binomial:
The reject rule is
X  c,
c
is some constant. Thus, in this example,
32
p  value  P X  9 |  
1
1


 f 2  9 |     f 2 10 |     
2
2


 0.0325  0.05
Thus, we reject H 0 and conclude the coin is not fair.
Note:
The “robust” Bayesian paradigm which takes into account uncertainity in
the prior is fundamentally correct paradigm.
Chapter 2 Bayesian Analysis
2.1 Posterior Distribution
I Discrete case:
Motivating example 1:
 0 : no AIDS, 1 : AIDS    0   0.99,  1   0.01
Let
positive (indicate AIDS)
1 :
X 
0 : negative (indicate no AIDS)
From past experience and records, we know
f (0 |  0 )  0.97, f (1 |  0 )  0.03, f 0 | 1   0.02, f 1 | 1   0.98
Objective: find
f  0 | 1 , where
f  0 | 1 : the conditional probability that a patient really has no
AIDS given the test indicating AIDS.
33
Note: high
f  0 | 0 , f 1 | 1 imply the AIDS is accurate in
judging the infection of AIDS.
Bayes’s Theorem (two events):
P( A | B) 
P( A  B)
P( A) P( B | A)

P( B)
P( A) P( B | A)  P( A c ) P( B | A c )
[Derivation of Bayes’s theorem (two events)]:
A
Ac
B
B∩A B∩Ac
We want to know P( A | B) 
P( B  A)
P( B) . Since
P( B  A)  P( A) P( B | A) ,
and
P( B)  P( B  A)  P( B  Ac )  P( A) P( B | A)  P( Ac ) P( B | Ac ) ,
thus,
P( A | B) 
P( B  A)
P( B  A)
P( A) P( B | A)


P( B)
P( B  A)  P( B  Ac ) P( A) P( B | A)  P( Ac ) P( B | Ac )
Similarly, if the parameter  taking two values  0 and  1 and the
data X taking values at c1 , c2 ,, cn , , then let
   0   A ,   1  Ac
34
and
X  ck   B .
0
X= ck
(X= c k )
1
(X= c k )
∩ 1
 0
Since
P(X  ck      0 )  P(   0 ) P(X  ck  |    0 )
   0  f ck |  0 
,
and
P(X  ck )  P(X  ck     0 )  P(X  ck     1 )
 P(   0 ) P(X  ck  |    0 )  P(   1 ) P(X  ck  |    1 )
   0  f ck |  0     1  f ck |  1 
,
thus,
f  0 | c k   P (   0  | X  ck ) 


P (X  c k     0 )
P (X  ck )
P (X  ck     0 )
P (X  c k     0 )  P (X  c k     1 )
  0  f ck |  0 
  0  f ck |  0     1  f ck |  1 
Motivating example 1 (continue):
By Bayes’s theorem,
f  0 | 1  P (   0  | X  1) 
  0  f 1 |  0 
  0  f 1 |  0     1  f 1 |  1 
0.99  0.03
0.99  0.03  0.01  0.98
 0.7519

 A patient with test positive still has high probability (0.7519) of no
AIDS.
35
Motivating example 2:
1 :
the finance of the company being good.
2 :
the finance of the company being O.K.
3 :
the finance of the company being bad.
X  1:
good finance assessment for the company.
X  2:
O.K. finance assessment for the company.
X  3:
bad finance assessment for the company.
From the past records, we know
 1   0.5,  ( 2 )  0.2,  ( 3 )  0.3,
f (1 | 1 )  0.9, f (1 |  2 )  0.05, f (1 |  3 )  0.05.
That is, we know the chances of the different finance situations of the
company and the conditional probabilities of the different assessments for
the company given the finance of the company known, for example,
f (1 | 1 )  0.9 indicates 90% chance of good finance year of the company
has been predicted correctly by the finance assessment.
Our objective is to obtain the probability f (1 | 1) , i.e., the conditional
probability that the finance of the company being good in the coming
year given that good finance assessment for the company in this year.
Bayes’s Theorem (general):
Let A1 , A2 ,, An be mutually exclusive events and
A1  A2   An  S ,
then
36
P( Ai | B) 
P( Ai  B)
P( B)
.............. 
,
P( Ai ) P( B | Ai )
P( A1 ) P( B | A1 )  P( A2 ) P( B | A2 )    P( An ) P( B | An )
i  1,2,  , n .
[Derivation of Bayes’s theorem (general)]:
A1
A2
B∩A1
B∩A2
B∩A1
B∩A2
B∩A1
B∩A2
………………..
B
……………
An
B∩An
B∩An
B∩An
Since
P( B  Ai )  P( Ai ) P( B | Ai ) ,
and
P( B)  P( B  A1 )  P( B  A2 )   P( B  An )
.......  P( A1 ) P( B | A1 )  P( A2 ) P( B | A2 )    P( An ) P( B | An )
,
thus,
P( Ai | B) 
P( B  Ai )
P( Ai ) P( B | Ai )

P( B)
P( A1 ) P( B | A1 )  P( A2 ) P( B | A2 )    P( An ) P( B | An )
Similarly, if the parameter  taking n values 1 , 2 , , n and
the data X taking values at c1 , c2 ,, cn , , then let
37
  1  A1 ,    2   A2 ,,    n   An , and X
 ck   B .
Then,
f  i | c k   P(   i  | X  c k ) 


P(X  c k      i )
P(X  c k )
P(   i ) P(X  c k  |    i )
P(   1 ) P(X  c k  |    1 )    P(   n ) P(X  c k  |    n )
  i  f c k |  i 
  1  f c k |  1     2  f c k |  2       n  f c k |  n 
  2
  1
X  ck 
B1
………………..
X  ck 
  n
X  ck 
B∩A1
B∩A2
  12    2
B∩A
B∩An
B∩A
 n   n 
……………
Example 2:
 ( 1 ) f (1 | 1 )
 ( 1 ) f (1 | 1 )   ( 2 ) f (1 |  2 )   ( 3 ) f (1 |  3 )
0.5 * 0.9
............. 
 0.95
0.5 * 0.9  0.2 * 0.05  0.3 * 0.05
f  1 | 1 
 A company with good finance assessment has very high
probability (0.95) of good finance situation in the coming year.
II Continuous case:
hx,      f x |   : the joint density.
38
m  x  
 f x |    d


 hx, d :
the marginal density

Thus, the posterior is
f  | x  
h x,     f  x |  

    f  x |      l  | x 
m  x 
m  x 
Example 1:
Let
X i ~ Poisson  , i  1,2,, n
and

1
   1  
   ~ gamma , ,    
 e
.







Then,
n
f  x1 ,, xn |    
 x e 
n
i
i 1

xi !

 xi
i 1
e  n
n
 xi !
.
i 1
Thus,
n
 xi
   1   
e  n
f  | x1 , , xn  
 e

n
  
 xi !
i 1
i 1
n
 xi  1   n   
 n
 i1
e
 xi !

i 1
 n
1
 gamma
x


,
 i
n
 i 1




Note: the MLE (maximum likelihood estimate) based on
n
f x1 ,, xn |    l  | x1 ,, xn 
is
39
x
x
i 1
n
i
while the posterior
n
mean (Bayes estimate under square loss function) is
x
i 1
i

. The
n
posterior mean incorporates the information obtained from the data
(sample mean,
(prior mean,


x)
with the information obtained from the prior
)
Example 2:
Recall:
 n
n x
X ~ bn, p , f  x | p     p x 1  p 
 x
and
  p  ~ betaa, b ,   p  
Then,
 a  b  a 1
b 1
p 1  p  .
 a  b 
f  p | x ~ betax  a, n  b  x.
Extension:
X ,, X  ~ muln, ,, ,    ,, , 
p
1
p
1
p
1
p
i 1
i
1
p
f x1 , , x p |   

n!
x 
1x1  p p 1   i 
p


i 1


x1! x p ! n   xi !
i 1


p
n
 xi
i 1
and
   ~ Dirichlet a1 , , a p , a p 1 ,
 p 1 
   ai 
a p 1 1
p

a p 1 
i 1


a1 1
.
    p 1
 1  p 1   i 
i 1


 a 

i
i 1
40
Then,
f  | x1 ,  , x p   f x1 ,  , x p |    
p

p

n!
xp 
x1





1



1
p
i

p


i 1


x1! x p ! n   xi !
i 1


n
 xi
i 1
 p 1 
   ai 
a p 1 1
p


a

1
 p1i 1  1a1 1  p p 1    i 
i 1


 a 

i
i 1
p

x1  a1 1
1

x p  a p 1
p
p


1    i 
i 1


n
 xi  a p 1 1
i 1
p


~ Dirichlet  x1  a1 ,  , x p  a p , n   xi  a p 1 
i 1


Note: the mean of X i is n i and thus the MLE (maximum
likelihood estimate) for  i based on f x1 ,, xn |    l  | x1 ,, xn 
xi
is
while the posterior mean (Bayes estimate under square loss
n
function)
is
xi  ai
xi  ai

p 1
p


n

ak




x

a

n

x

a



k
k
k
p 1 

k 1
k 1
k 1


p
.
The
posterior mean incorporates the information obtained from the data
xi
(MLE estimate,
) with the information obtained from the prior
n
(prior mean,
ai
)
p 1
a
k 1
k
41
Example 3:
n
n

X 1 ,  , X n ~ N  ,1, f  x1 ,  , xn |    

and
  xi  2
1   i 1
 e
2 
2
   ~ N  ,1 .Then,
f  | x1 ,  , xn   f x1 ,  , xn |    
n
n
 1 

 e
 2 
  xi  2
 i 1
2
1
e
2


   2
2
n
n


n 1  1   n 2  2 x   x 2   2  2    2
i
i 

2  
i 1
i 1


 1 

 e
 2 






 n
   n

n 1  1    n 1 2  2  x       x 2   2  
i
i

  
2  
 i 1
   i 1
 


 1 

 e
 2 




n 1  2 

   2 
2 

n 1



 
 1 

 e
 2 
  n

  xi    

  i 1

 
  n 1  

 




1
2
n 1

e


 
xi    
 
i 1
 
n 1  
 
 
n



xi   

i 1

n 1 


2

xi   

i 1

n 1 


2
n

 n

  x2

i
2 
 


   i 1 
  n 1 n 1 
 


 
2
 n

 x2

i
n 1  i 1
2 




2  n 1 n 1 





e
 n

  xi  

1 
 N  i 1
,
 n  1 n  1




42
n









xi u 

i 1

n 1 


n

2 

 



 
Note: the MLE (maximum likelihood estimate) based on
n
f x1 ,, xn |    l  | x1 ,, xn 
is
x
x
i 1
i
while the posterior
n
mean (Bayes estimate under square loss function) is
n
x
i 1
i

n 1
 n 
 1 

x  
 .
n

1
n

1




The posterior mean incorporates the information obtained from the
data (sample mean,
(prior mean,
prior is
 ).
x ) with the information obtained from the prior
The variance of
x
is
1/ n
1 . Intuitively, the variation of prior is n
and the variance of
times of the one of
x . Therefore, we put more weights to the more stable estimate.
A Useful Result:
Let T  X 1 , X 2 , , X n  be a sufficient statistic for the
parameter

with density g t |   . If, T  X 1 , X 2 ,, X n   t
Then
f  | x1 ,, xn   f  | t  
  g t |  
m t 
,
m t  is the marginal density for T  X 1 , X 2 ,, X n  .
Example 3 (continue):
43
n

f  x1 ,  , xn |    










  xi  2
n
1   i 1
 e
2 
n
1 
 e
2 
2
n
n

1 
   n 2  2
xi 
xi2  


2  
i 1
i 1


n
1  nx   2 n 2
e
 e
2 
n
1 
 e
2 

1

1
nt  n 2
2

n
 xi2
i 1
2
n
 xi2
i 1
e
2
t  x 
By factorization theorem, t  x is a sufficient statistic. By the above
useful result,



x
1 
n

f  | x1 ,, xn   f  | x  ~ N 
,
1
n  1
 1


n


Definition (Conjugate family):
Let  denote the class of density function f x |   . A class P of
prior distribution is said to be a conjugate family for 
f  | x  is in the class P for all
f
x |    
 P .
A Useful Result:
Let
f x |    hx e x    .
44
if
and
If the prior of

is
    k  ,  e     , then the posterior
is
f  | x   k   x,   1e  x    1   .
[proof:]
f  | x      f  x |    k  ,  e       h x e x   
 k  ,  h x e   x    1  
 e   x    1  
Since
    






d


k

,

e
d  1  k  ,   


1
    
e
d

,
thus
k   x,   1 
1
   x   1  
e
d

.
Therefore,
e  x    1  
f  | x      x    1    k   x,   1e   x    1  
d
e
Note:
Some commonly used conjugate families are the following:
Normal-normal,
Poisson-gamma,
Normal-gamma,
Binomial-beta,
Multinomial-Dirichlet.
45
Negative
Gamma-gamma,
binomial-beta,
2.2 Bayesian Inference
For Bayesian analysis, f
 | x , the posterior distribution, plays
an important role in statistical inferential procedure. Some Bayesians
suggest that inference should ideally consist of simply reporting the
entire
posterior
f  | x 
distribution
(maybe
for
a
non-informative prior). However, some standard uses of the posterior
are still helpful!!
Some statistical inference problems are
I. Estimation (point estimate), estimation error
II. Interval estimate
III. Hypothesis testing
IV. Predictive inference
I. Estimation (point estimate)
(a) Estimation

is
ˆ
is the most likely value of

given the
The generalized maximum likelihood estimate of
maximizes f
 | x . ˆ
which
prior and the sample X.
Example 3 (continue):
n
n
  xi  2
 1   i 1
X 1 ,, X n ~ N  ,1, f x1 , , xn |    
 e
 2 
and
   ~ N  ,1 .Then,
46
2



x

1
n


f  | x1 ,, xn  ~ N
,
1
n

1
 1



n


ˆ 
x

n
1
1
n
◆
is then the posterior mode (also posterior mean)
Other commonly used Bayesian estimates of

include posterior
mean and posterior median. In Normal example, posterior
mode=posterior mean=posterior median.
Note:
The mean and median of the posterior are frequently better
estimates of

than the mode. It is worthwhile to calculate and
compare all 3 in a Bayesian study.
Example 4:
X ~ N  ,1,   0;     1,   0.
Then,
f  | x      f  x |    I 0,     
 I 0,    e


1
e
2
 x  2
2
Thus,
f  | x  
I 0,    e

e
0
47


 x  2
2
 x  2
2
d
.
 x  2
2
Further,

E f  | x    
e
0

e

 x  2
e
d
2
 x  



2
0

e
d
2
0
e
0

e

  x 2
d
2
  x 2
2
d
0





  x 
2
d
2
  x 2
  x 






x




d
2
0


 x   e

2
d
2

x

e

2
2

 x
x

e
d
x
1
2
x e


2
2
2
2

d

d
 e
x

x

e


2

d 2 / 2
2
e


2
2
2
2
d
d
x

x
1    x 
x2

1
e 2
 x  2
1    x 
Note:
The classical MLE in this example is x. However,
classical MLE might result in senseless conclusion!!
(b) Estimation error
The posterior variance of
 x 
is
48
 0.
The

V x   E f  | x     x 
The posterior variance is defined as
f  | x 
V x   E
2
.
   x 2 ,
f  | x 
  is the posterior mean.
where  x   E
Note:
V x  V x    x   x 
2
Example 3 (continue):



x

1
n

.
f  | x1 ,, xn  ~ N
,
1
n

1
 1



n


Then, the posterior mean is
  x1 ,  , xn  
x

n
1 ,
1
n
and the posterior variance is
V x1 ,, xn  
1
.
n 1
Suppose the classical MLE  x1 , , xn   x is used, then
V x1 ,, xn   V x1 ,, xn    x1 ,, xn    x1 ,, xn 
2
n
 n
  xi    xi
1

  i1
 i 1
n 1  n 1
n


49






2
n


n


x

i 

1
i 1



n  1  nn  1 


1
  x 


n  1  n  1 
2
2
Example 5:


X ~ N  , 2 ,     1.
Then,

f  | x  ~ N x, 2
.
Thus,
the posterior mean=the posterior mode=posterior mode=x
=classical MLE
Note:
The Bayesian analysis based on a non-informative prior is often
formally the same as the usual classical maximum likelihood analysis.
Example 4 (continue):
The posterior density is
f  | x  
I 0,    e

e
0
and the posterior mean is
50


 x  2
2
 x  2
2
d
x2

1
e 2
E f  | x       x   x  2
 x   x  ,
1    x 
x2

1
e 2
2
where   x  
. If  x   x , then
1    x 
V  x   V  x     x     x 
2
 V  x   x    x   x 
2
 V x    2 x 
Therefore, V  x   V  x   
2
x  .
V x  1  x   x x
since
V  x   E f  | x    x 
2

   x  e
2

0

e


 x  
2
 x  2

d
2
2

2

0


0
e
d
0
 xe
   x  e
2
x
2

e

 e

0
  x 2

2
  x 
2
d
0
  x  x   1
where
51
2
d


  x 2
  x 2
2
2
d
d
   x  e
2

  x 2
2
d 

2
e
 2
2
d    de

2
2

 

2
2
   e
 e
d 


2
  e
 2
2
2
 e
   x e

 2
2
  x 2
2
d
 e

  x 2
2
d
(c) Multivariate estimation
Let
  1 , , p t
be p-dimensional parameter. Then, the
posterior mean is
 x   1 x   2 x    p x t

 
 E f ( | x ) 1  E f ( | x )  2   E f ( | x )  p
and the posterior variance is
f  | x 
V x   E
   x    x  
t
t
Further, The posterior variance of

 x 
is
V  x   E f  | x      x     x 
t
.

 V  x     x     x   x     x 
t
II. Credible sets (interval estimate)
(a) Highest posterior density (HPD)
A 1001   % credible set C for
 f  | x d
C
PC | x   
  f  | x 
C

is
continuous 
discrete 
52
, PC | x   1   .
A 1001   % HPD (highest posterior density) credible set C for

is
CH     : f  | x  k   ,
where
k  
is the largest constant such that
PCH | x   1   .
Example 3 (continue):



x

1
n

.
f  | x1 ,, xn  ~ N
,
1
n

1
 1



n



Therefore, a 1001   % HPD credible set for
CH
 n
  xi  
  i1
 z1
2
n

1


n
1
,
n 1
x
i 1
i
is

n 1
 z1
2

1 

n 1 

Example 5 (continue):
As  2  1 ,
X ~ N  ,1,     1.
Then,
f  | x ~ N x,1 .
Thus, a 1001   % HPD credible set for

is
CH   x  z1 , x  z1  .

2
2

Note that the above HPD credible set is the same as the 1001   %
53
confidence interval based on classical analysis.
Note:
The classical confidence interval and the Bayesian credible set for

are frequently the same.
◆
To find the Bayesian credible set, two approaches can be used. They
are
(I) using some numerical algorithm.
(II) using asymptotic results: approximating f  | x  by
N  x , V x , then a
1001   % HPD credible set for

is
CH    x   z1 V x ,  x   z1 V x  .


2
2


For multivariate case   1 , ,  p , approximating f
t
 | x
by N  x , V x , where
 x   1 x   2 x    p x t
and

V  x   E f  | x      x     x 
Then,
t

.
   xt V x 1   x  ~  p2 .
Thus, a 1001   % HPD credible set for


is
CH   :    x  V x     x    p2,1
where
 p2,1
1
t
is the
1001   
54
,
percentile of the chi-square
distribution.
Example 6:
f x |    e x  I  ,   x ,     0, 
and
   
1
.
 1 2


Then,
e   x  
x 
f  | x      f  x |   
I
 1   2  ,  


e


I 0 , x   

1 2
e x


Thus, the plot of the above posterior density is
Posterior density
b(a)
The 1001   % HPD credible set for

x
is
CH  b , x .
Let
  e    log   . Then,
       
d
1
1
1
  log    

d
  1  log  2 


Further,
f x |    e x log Ilog ,   x   e xI log ,   x 
55
and
f  | x      f  x |   


e x
 

1
 I x  
 1  log  2 1, e 
1

e x
1
 I x  
 1  log  2 1, e 
Thus, the plot of the above posterior density is
Posterior density
1
d(a)
The 1001   % HPD credible set for
x

is
CH  1, d   .
Although
  log   ,
log 1, log d    0, log d    b , x .
That is the transformation of the 1001   % HPD credible set for
not equivalent to the 1001   % HPD credible set for

is
.
(b) General approach to find “optimal” credible set
A 1001   % “optimal” credible set C for
size with respect to S C  , where
56
l  has minimum
S C  
 l  d ,
C
and
l  is some specific function. For example, as l    1 ,
S C  
 d
 the length of an interval
C
the
resulting
1001   %
“optimal”
credible
set
is
the
1001   % HPD credible set!!
III. Hypothesis testing
(a) Bayes factor
To test H 0 :   0 v.s. H1 :   1 , we need to calculate
 0  P  0 | x , 1  P  1 | x  .
Then,
 0  1  reject H 0
 0  1  not reject H 0
The conceptual advantage is that  0 and 1 are the actual
(subjective) probabilities of the hypothesis based on the information
from the data and prior opinions.
Example 3 (continue):



x

1
n

.
f  | x1 ,, xn  ~ N
,
1
n

1
 1



n


Suppose
57
9
n  9,  xi  110,   10 .
i 1
Then, f  | x1 ,, x9  ~ N 12,0.1 .
To test H 0 :   10 v.s. H1 :   10 ,
10  12
   12


| x
0.1
0.1


 P Z  6.3245  0
 0  P   10 | x   P
and
1  P  10 | x   1 .
Example 7:
X ~ Binomial n, p,   p  1. Then,
f  p | x   p x 1  p 
n x
That is,
.
f  p | x ~ Betax  1, n  x  1 . To test
H 0 :   10 v.s. H1 :   10 ,
1
1 
 n  2 
n x
| x   2
p x 1  p  dp
0   x  1 n  x  1
2 
1
 n  2 
n x
2

p x 1  p  dp

  x  1 n  x  1 0


 0  P p 
 n  2
1

 
 x  1 n  x  1  2 
n1
 1
n  x ! x!
1 2



n  1! 
 x  1 x  1x  2
Definition (Bayes factor):
Denote
58
 0  P    0  
   d ,
0
 1  P   1  
   d
1
0
0
H
H
Let
 1 be the posterior odds ratio of 0 to 1 and let  1
be the prior odds ratio. Then, the Bayes factor in favor of  0 is
define as
0
B
posterior odds ratio
1  0 1



prior odds ratio
1 0 .
0
1
[Intuition of Bayes factor:]
Suppose
0   0 , 1  1  H 0 :    0 v.s. H1 :   1 .
Then,
0
B
0
1

1
f  0 | x 
f 1 | x 
  0 
 1 
  0  f x |  0 
f x |  0 
 1  f x | 1 


  0 
f  x | 1 
 1 
is the classical likelihood ratio.
Note:
Bayes factor B can be perceived as a Bayesian likelihood ratio.
Example 3 (continue):
f  | x1 ,, x9  ~ N 12,0.1 . Since    ~ N 10,1
59
0 
10
   d
 10 

1
1
2
e
d 
 1
2
2
2
10




The Bayes factor then is
0
B
0
1

1
0
1 0
1
.
2
1
2
Note:
The use of p-value will sometimes have a similar result to the
Bayesian approach especially as the vague prior information is used.
For example, for
X ~ N  ,1,     1.
Then,
f  | x ~ N x,1 .
To test H 0 :    0 v.s. H1 :    0 ,
  x 0  x 

    0  x 
1 
 1
 0  P   0 | x   P
However, for classical approach with data x, since
X ~ N  ,1  reject H 0 as X  c .
x  0 
 X  0
p - value  P X  x   P


1
1 

 PZ  x   0   1   x   0  Z ~ N 0,1
  0  x 
Both p-value and  0 give the same value.
60
(b) Testing a point null hypothesis
To test H 0 :    0 v.s. H1 :    0 , the Bayesian answers differ
radically from the classical answers. The continuous prior density is
not acceptable for the above hypothesis since any such prior will give
 0 prior (and hence posterior) probability 0.
We can choose the proper prior by the following way. Assume
  0    0  P   0 
 1  1   0  P   0  .
A proper prior is
  0
 ,
     0
 1 g  ,    0
,
where g   is some proper density function such that
g  d  1 .

 

0
Note that
   d
0 
 g  d

 
1

  0  1
g  d

 

0
0
  0  1  1
Then, the marginal density of X is
m  x       f  x |  d
  0 f x |  0  
 g   f  x |  d

 
1

0
  0 f  x |  0   1   0 
g   f  x |  d

 

0
  0 f  x |  0   1   0 m1  x 
where
m1 x  
g   f x |  d . Thus,

 

0
61
f  0 | x  
 0 f x |  0 
m  x 

1
1 0
1
0


 0 f x |  0 
 0 f  x |  0   1   0 m1  x 
m1  x 
f x |  0 

1 0
m1  x  
 1 


0
f x |  0 

1
Therefore, the posterior odds ratio is
 0 f x |  0 
0
  0 | x 
m  x 


 1 1    0 | x  1   0 f  x |  0 
m  x 
 0 f x |  0 
 0 f x |  0 

m  x    0 f  x |  0  1   0 m1  x 
0
f x |  0 


1 0
m1  x 

and the Bayes factor is
0
B
0
1

1
f x |  0 
m1  x 
.
Example 7:
Let H 0 :    0 v.s. H1 :    0 , X 1 ,, X n ~ N  ,1. Then,
 1
X ~ N  , . A proper choice of prior is
 n
62
 0 ,

    
 1   0  

  0
   
2
2
1
e
2
 1   0 N  ,1,    0
Then,
m1  x  

 


n
0
2
e
 n  x 
2
1
1

2 1  
n

e
2
1

e
2
    2
2
d
  x   2
2 1 1
n


1

~ N   ,1  
n

Then,
0

1 0
m1  x  
 f  0 | x   1 


0
f x |  0 




1 0

 1 

0




1

2 1  1
n
1
2  1

e
e
1

 







  x   2
2 1 1
n

  x  0
2 1
n
2
n
If    0 , then
0

1 0
 1 


0



1 0
 
1



0



1 0
 1 


0

1
1 n
1
1 n
1
1 n
63
e


1






n  x  0
2 1 1
n

e
e

 



n  x  0 2
2 1 1
n
z2
2 1 1

2



n 


1
1
1
z  n x 0
where
. Further, if we let

 0  1 


1
1 n
e
0 10 
z2
2 1 1



n 


1
2 , then
1
.
The classical approach for the above hypothesis is to reject H 0 as
z  n x   0  c . Thus, the p-value is
p - value  P

n X 0 
n x 0
 P  Z  z   21    z 

n X   0   Z ~ N 0,1. We have the following comparisons
since
for fixed
z  1.96 :
n=20
0.05
0.42
p-value
0
Note that
n=50
0.05
0.52
n=100
0.05
0.60
n=1000
0.05
0.80
p  value  21  1.96  0.05 . For n  20 ,
z


1
2 1 1 
n 
 0  1 
e


1 n


 0.42
2
Similarly,
0
1

 1 


1
1  20
e
1.96 2
2 1 1
20

evaluated at the other n values can be obtained.
64




1
Theorem:
For any prior
  0
 ,
     0
 1 g  ,    0
,
then
0

1 0
r x  
 f  0 | x   1 


0
f x |  0 

where



1
r  x   sup f  x |   usually r  x   f x | ˆ .
 
0
Further,
B
f x |  0 
f x |  0 

.
m1  x 
r x 
Example 7 (continue):
1

X ~ N  , . . Then,
n

f x |   
1
2  1
The maximum likelihood estimate is
e
  x  2
2 1
n
n
ˆ  x . Further,


r  x   sup f  x |    f x | ˆ 
  0
Therefore,
65
1
2  1
n
,
0



1 0

 1 

0




1
2  1
1
2  1


1 0
 1 
e
0


n  x  0
2
z 

1 0
 1 
e 2 
0




2
e
n
  x  0
2 1
n
2
n

2













1
1
Further,
1
2  1
f x |  0 
B

r x 
e
As
e
  x  0
2 1
n
2
n
2  1
n
z2
2
z  1.96 and
0 10 
z

 0  1  e 2


2




1
1
p  value  0.05 ,
2 , then,
1.96

 1  e 2


and B  1 6.83 .
66
2




1
 0.127
1
2.3 Applications
Example 8:






Yi ~ N i , 2  N xi  , 2  N xi11  xi 2  2    xip  p , 2 , i  1,, n. ,
where
 x11
x
21
X 
 


 xn1
Let
x12
x22




xn 2

x1 p 
 xi1 
 1 
x 
 
x2 p 
i
2
, x t  
,    2 
  
  
  i





xnp 
x





ip
p





    c ,
f  | y 

    f  y |    c  

1
 e 2
2
1
 e 2
e
 y  Xˆ  Xˆ  X   y  Xˆ  Xˆ  X 
t
2

1
e

2 2
2

 e


n
  y i  x i  2
i 1
2
2
 y  X t  y  X 
1 
y  Xˆ
2 2 
2
n
1

  ˆ 
t

  y  Xˆ    ˆ 
t
t

X t X   ˆ
 N ˆ , X t X

1

 ˆ  X

X t X   ˆ 

t
X

1
Xty

 X y  Xˆ   0
t

2

Thus,

posterior mean  posterior mode  posterior median  ˆ  X t X

1
Xty
is the least square estimate.
Further, 1001   % highest probability density (HPD) credible
67
set is

1

  : 2   ˆ


 X X   ˆ   
t
t
2
p ,1

,

which is the classical 1001   % confidence ellipsoid.
Example 9:


Y1 , Y2 ,, Yn ~ N  , 2 ,  known. .
 
2
Let   

1
2
.Then,

 
f  2 | y   2 f y | 2

n

n
  y i  2
1  1  i 1 2 2
 e
 2  
  2 2 
n

2 


y



 ns 2 

i
1
2

 n  2 e 2  s 2  i 1



n




n 2 
 Inverse Gamma  , 2 
 2 ns 
The posterior mean is
n
2
1
ns


2 n  n2

 1
2 
ns  2 
Since
68
y
i 1
i
 
n2
2
.
 1

n 2 
|
y
~
Gamma

 , 2,
2


 2 ns 

1001   % highest probability density (HPD) credible interval for

2
0

, 12 satisfy
1.
  02 | y     12 | y 
2.
P  02   2   12 | y  1  


Example 10:


Y1 , Y2 ,, Yn ~ N  ,  2 ,  ,  2 are unknown.
Let   ,   
1

.Then,
f  , | y     ,   f  y |  ,  
n
1 
1
  
  2 2


n
  y i  2
i 1
2 2
n
  y i  2
i 1
1


 e


2 2
e
n 1
 n

 y i  y 2  n   y 2 

 i 1



1

2 2
e
n 1
Thus, the posterior mode is
n
ˆ  y , ˆ 2 
y
i 1
i
 y
n 1
2
,
by taking logarithm of the posterior density and solving for
69
.
 log  f  ,  | y 
 2n
  y   0
0

2 2
 ˆ  y
 
 log f ˆ,  | y
 2
  0
n
y
 n  1 


2
i 1
2
n
 ˆ 2 
1001   % credible set for
 , 
y
i 1
 y
2
i
2 4
0
 y
2
i
n 1
has the contour   ,   c ,
where c is some constant. Therefore, 1001   % credible set for
 , 
is
 ,  :   ,   c,
satisfying
f  ,  | y dd

      
,
:
,
 1 .
c
To find c, the large sample result for the log-likelihood ratio is
 f  ,  | y 
 2 log 
~  22 .

ˆ
 f  , ˆ | y 


Thus,

1  n
2
2 









n

1
log


y

y

n


y
 i


2 2  i1


2
 2
 ~ 2
 n  1log ˆ   1 





2






since log ˆ, ˆ | y  n  1 log ˆ   1 2 .
Thus, 1001   % credible set for  ,   has the contour
70


1 n






 yi  y 2  n  y 2   n  1log ˆ   1   1  22,1 

,

:

n

1
log



2 
2  i1
2 2




The component distribution f  | y  is
f  | y  

 f  ,

| y d




1
n 1

n
  y i  y 2  n   y 2 

i 1
2
e
2
d
 n  1 n
2
2 
 r        yi  y   n  y   

i 1

 2  2 
n
2
    yi  y  


i 1





 1 







  y 
 s



n

n 1
n
2
2












2
n  y  

1  n
2 
  yi  y  

i 1


n
2
n
2
n
2
n

 yi  y 2

 2
 s  i 1
n 1








s2 
 , the noncentral t distribution
Note that f  | y  ~ t  n  1, y ,
n


2
with location parameter y and s n . Further,
y
s
n

n   y 
~ t n1 .
s
The posterior mean is y and 1001   % highest probability
density (HPD) credible interval for 
71
s
s 

y

t
,
y

t



n1,
n1,
2
2
n
n  .

Example 11:




Yi ~ N  i , 2  N zi  , 2 , i  1,, n.  is a single parameter.
Let
  ,  2  


f  , 2 | y

   ,
2
1
2 ,
 f y |  ,  
2

1
 
n 2
2
2


1
ˆ
ˆ
  yi  zi   zi   zi   
2 2 i 1
e
ˆ 



2 
1  n

yi  zi ˆ 
2

2  i 1

 
n2
2 2
e

 e


n
  yi  zi  2
i 1
2 2

z
y

i i 
i 1

n
2 
z

i 
i 1

n
2
2
  z    ˆ    n



  zi yi  zi ˆ  0 
 i 1


1
n
1 
1
 2  
  2 2
n

n
2
i

i 1



2
The component distribution f  | y is

  f  ,
f 2 | y 



1
 
2
 
n2
2
e

| y d
  y i  z i ˆ 
n
2
i 1
  y i  z i ˆ 
2 2
n
 
n2
2




2
i 1
e
n

i 1

 z i2    ˆ 
1  


2 2 


n
ˆ 2 
2

y

z

i
i

2
 2
e 2 i 1
 n 2
  zi
 i 1
1
1
2
e
1
1
2

n2
2
1 

2 2 

2
72
2
n

i 1





d
 z i2    ˆ 






1
2
2




d
Thus, the posterior mode is
 y
n
ˆ
2

i 1
i
 zi ˆ

2
n 1
Chapter 3 Prior Information
3.1 Subjective Determination of the Prior Density
Several useful approach can be used to determine the prior density.
They are
1. the histogram approach
2. the relative likelihood approach
3. matching a given functional form
4. CDF determination
(1) The histogram approach
Divide the parameter space into intervals, determine the subjective
probability of each interval, and then plot a probability histogram.
(2) The relative likelihood approach
Compare the intuitive “likelihoods” of various points in  , and
directly sketching a prior density from these determinations.
(3) Matching a given functional form
Assume that    is of a given functional form, and to then choose
the density of this given form which most closely matches prior
beliefs.
(4) CDF determination
73
Subjectively determine several  percentiles, plot the points
z ,  , sketch a smooth curve joining them, where
P  z   
.
Note:
The histogram and the relative likelihood approaches are the most
useful.
Note:
Untrained or unpredicted people do quite poorly in determining the
probability distributions. Through the study of “Bayesian
robustness”, we can alleviate this problem.
3.2 Non-informative Priors
(a) Introduction
Non-informative prior is a prior which contains no information about
.
Example 1:
X ~ N  ,1,    ,  .
    c , c is some constant.
◆
Laplace (1812) used
    1. As     c
   d   cd   .
  
is not a proper density (improper prior). In addition, the
74
other severe criticism of the non-informative prior is the lack of
invariance under transformation.
Example 2:
 is the parameter we are interested . Suppose     1. Let
  e  log    
.
Then, the prior for  is
        
d
1 1
  log    
d
 
.
Therefore, the prior for  is not non-informative.
(b) Non-informative prior for location and scale problems
(i) Location problem
The density of X is of form
f x  x   f  x    (i.e., depends
only on x   ). The density is then said to be a location density.
Example 3:

X ~ N  , 2

Then, let
f z  
z2
1
2
e 2
2 
.
The density of X is
f x x   f x    
1
e
2 
  x   2
2 2
◆
75
For any constant c, if Y  X  c  X  Y  c , then Y has
the density
f y  y   f x x 
dx
 f x x   f x    .
dy
Further, denote   c   , then Y has the density f  y   
which is identical to the density of X, f x    , in structure.
Example 3 (continue):


  N , ,     5 .
X ~ N  , 2 , Y  X  5 .
Then, Y ~ N
  5,
2
2
Y has the
density
f y y  f y   
1
e
2 
  y  2
2 2
◆
Assume
  
   
and
are the non-informative priors of 
and     c , respectively. Let A be any set. Then,
   d     d

A
A

   A    c  A 

  A  c

   d 
Ac
     c d
A
Thus,        c  . Let   c .
 c   0  some constant
76
.
It is convenient to choose the constant to be 1; that is,  c   1 .
(ii) Scale problem
The density of X is of form
 x
f
,   0
  
1
The density is
then said to be a scale density.
Example 4:


X ~ N 0, 2 .
Then, let
f z  
1
e
2
z2
2
.
The density of X is
x 1
f  
   
1
1
e
2
1  x 
 
2  
2
x2

1
2
e 2
2 
◆
Let Y  cX  X 
f y  y   f x x  
Y
. Then, Y has the density function
c
dx
1
1  y  1  y
 x 1
  f   
f
f   ,

dy 
   c c  c     
where   c . Y has the density
to the density of X,
 y

f
 which is identical
 

 
1
 y
f
 , in structure.
  
1
Example 4:
77


Y  5 X , X ~ N 0, 2 .

Then, Y ~ N 0,25
2
. .
The density of Y is
fY  y  
1
2 5
e
 y2
1 y 2

2 25 2

2
1
1
e 2 

2 
 x
f  ,   5
 
◆
Assume
  
and
   
are the non-informative priors of 
and   c , respectively. Let A be any set. Then,
   d     d

A
A
   A  c  A 

    d 
  A

A
c


c
  1
      d
c c
A 
Thus,  

1  
   . Let   c .
c  c 
 c    1 .
1
c
It is convenient to set  1  1. Then,
 c  
1
  
c
Note:
78

1

.
   
1


is an improper prior since
1

d   .
0
Note:
The main difficulty in using the non-informative prior is the
uniqueness. For example, in location problem,     c can be
possible choices. There are a variety of different choices.
(c) Non-informative priors in general settings
  2 log  f  x |   
Let I     E 
 be the expected Fisher’s
 2


information. Jeffreys (1961) proposed the non-informative prior
    I  
1

If   1
2
 p

t
2
.
is a vector of parameter, then
     1 , 2 ,, p   det I  
1
2
,
where
   2 log  f x |  
  2 log  f x |  
 E 
  E

2


 
  1 2

1

  2
2
 E   log  f x |    E   log  f x |  
I    
 22
  2  1







2
2
 E   log  f x |    E   log  f x |  



   








p
1
p
2


 
Example 5:
X ~ N  ,1.
Then,
79




  2 log  f x |   
 E





1
p

 
  2 log  f x |   
 E






  .
2
p



  2 log  f x |   
 E

 p2

 
  X   2


1
  2 log 
e 2
 2


I     E 
2













1  X   

2
2 
 2
 2

   log 

 E  



1
2





.

Thus, the non-informative prior is
    I   2  1
1
Example 6:


X ~ N 0, 2 .
Then,
X 2


1
2
2
  log 
e 2
 2 


I     E 
 2




 2

log

 



 E 












1
X2


2
2   2
 2







 1
3X 2 
1
3
 E  2 
 2  4 E X2
4 
  


2
 2


80
.

Then,
    I   2 
1
1

3.3 Maximum Entropy Priors
Motivation:
Since partial information is available, it is desirable to find a prior
that is as non-informative as possible.
Definition (Entropy, discrete case):
Assume   1 , 2 ,, n , is discrete, and let  be a
probability on  . The entropy of  , is denoted by  n   ,
 n      i log   i    Elog    .
i
Note:
If   i   0 , then   i  log   i  is defined to be 0.
Example 7:
  0,0.1,0.2,,1,   i  
1
,  i 
11
Then,
 n    
i
1
1
log    log 11 .
11
11
Example 8:
  1 , 2 ,, n ,   i  
81
1
,
n
.
.
Then,
1
 n     log    log n 
n
i n
1
◆
The following result can be used to find the maximum entropy as
some partial prior information Eg k  , k  1,2,, m is
available, for example, as
g k     , Eg k    E     ;
as

g k        , Eg k    E    
k
as

k
;

g k    I (  , z k ]  , E g k    E I (  , z k ]    P  zk  .
Important Result:
The prior satisfying
Eg k     k , k  1,2,, m , has
maximum entropy,
m
~ 
i

  k g k  i 
e k 1
m
e
  k g k  i 
,
k 1
i
where 1 , 2 , , m can be determined by equations
Eg k     k , k  1,2,, m
82
Example 9:
  0,1,2,, g1     , E   5 .
Then,
~  
e1

e


1


 1  e1 e1  1  p  p , p  e1
,
0
which is geometric distribution.
E   
1 p
1
1
 5  p   e1  1  log   .
p
6
6
Therefore,


1 1
51

~    1        
6 6
66

.
Definition (Entropy, continuous case):
Assume  is continuous. The entropy of  is
    
    
    log 
  d
  0   
  0   

 n     E log 

,
where  0   is the natural “informative” prior.
Important Result:
The prior satisfying
E g k   
 g    d
k
has maximum entropy,
83
  k , k  1,2, , m ,
m
~  
 0  e
 k g k  
k 1
m
 k g k  
d
  0  e k 1
,
where 1 , 2 , , m can be determined by equations
Eg k     k , k  1,2,, m
Example 10:
  0,  and assume  is location parameter. Also, let  0    1
Assume we know g1      E   3 . Then,
~  
1  e1

1
1

e
d

e1

 1e1  1 0,
1
0
1
Thus,
E    3 
1
1
 1 
1
3 .
That is,

1 3
~
    e
3
.
Example 11:
   ,   R and assume  is location parameter. Also, let
 0    1 . Assume we know g1      E   3 . Then,
84
~   
1  e 1

1
1

e
d



does not exist since
 1 e

1
d does not exist.
Example 11:
   ,   R and assume  is location parameter. Also, let
 0    1 . Assume we know
g1      E    3, g 2      3  E   3  Var    1 .
2
2
Then,
~   
1  e 1  2  3 
2

 1 e
1  2   3 
2
e


d

e





2    3





1
2 2



2   3 1
2 2










1  1
~ N
3

,

2

22
2





Thus,
E    3  3 
Therefore,
1
1
1
, Var    1 
 1  0, 2 
.
2 2
22
2
~  ~ N 3,1
85
2
2
d
Note:
Two difficulties arise in trying to use the maximum entropy approach
to determine a prior:
 the need to use a non-informative prior in the derivation of
~ 

~  does not exist.
3.4 Using the Marginal Distribution
(I) Introduction
Recall the margin density of X is
 f x |    d continuous 

m x   
discrete 
  f x |    

Example 12:



X ~ N  , 2f ,    ~ N  , 2

Then, the marginal density of X is
m x  



f x |    d 

~ N  , 2   2f




1
2  f

e
 x  2
2
2
f
1
2  

e
    2
2 2
d
◆
Note:
m  x  is sometimes called the predictive distribution for X.
86
Let
m c  x 
m w  x 
be the marginal density under the correct prior and
be the marginal density under the wrong prior. Then,
the statistic obtained from the data should be “close” to the same
statistic based on
xc
m c  x  , not on m w  x  . For example, let
be the mode of
m c  x 
and
xw
be the mode of
m w  x  . Intuitively, the observed data x should be around xc ,
not
xw .
To find the “correct” or “sensible” prior effectively, one could restrict
the choice of the priors to some class. Then, based on some criteria,
the best prior could be found. Several classes of priors are frequently
used.
1. Priors of a given functional form:
   :     g  |  ,   ,
where

is some set and

is called a hyper-parameter of the
prior.
Example 13:




   :  ~ N  , 2 ,   0, 2  0 ,


,  2

is the hyper-parameter.
2. Priors of a given functional form:
p

t
   :      0  i , 0 is any density,   1 ,, p  ,
i 1


87
Example 14:


X1, X 2 ,, X p , X i ~ N i , 2f ,
 2f
is a known constant.
p


   :      0  i ,  0 ~ N  ,  2 ,    ,  2  0.
i 1




3. Priors close to an elicited prior:
Any prior “close” to a sensible prior
For example,

0
would also be reasonable.
contamination class is
   : 1    0    q , q  L,
where L is a class of possible “contaminations” and q  is some
density function for
.
(II) Prior selection
There are several approaches to select a sensible prior. They are
(i)
the ML-II approach
(ii) the moment approach
(iii) the distance approach
(i) ML-II approach
Let

be a class of priors under consideration. ML-II (maximum
likelihood-type II) is to find
ˆ  
satisfying
mˆ x   sup m x .
 
88
Example 15:

X1, X 2 ,, X p , X i ~ N  ,1,    ~ N  , 2
Then,

m  xi  ~ N  ,1   2

The ML-II method is to find
and
 2

.
maximizing m x  .
Thus,
p
m  x   

1

2 1   2
i 1
 
 2 1   2

p
2

 x i    2
e


2 1  2
 x i  x 2
2
e 2 1  e


p  x  

2 1  2
2

m  x 
m  x 
 0,
0

 2
 ˆ  x ,

p
ˆ  
2
Therefore,
 x
i 1
 x
2
i
p


 1  s 2  1  ˆ 2  max 0, s 2  1
ˆ   ~ N x , max 0, s 2  1 .
Example 16:
   : 1    0    q , q  L,
Then,
m  x    f  x |  1    0    q d

  f x |  q d 
 1     f  x |   0  d  
 1   m 0  x   mq  x 
89
q̂
Now, ML-II prior is to find
in L which maximizes
is the class of all possible distributions and
Let
at
ˆ
mq  x  . If L
maximizes
f x |  
 ˆ   be the distribution with P   ˆ   1 (all mass
ˆ ). Since
mq  x    f  x |  q d




  f x | ˆ q d  f x | ˆ  q d


 f x | ˆ   f  x |  ˆ  d
 m ˆ  x 
then
ˆ    1    0    ˆ   .
(ii) Moment approach
Let
 f   and  2f   be the conditional mean of
variance of X with respect to
f x |   . Also, let  m and
 m2 be the known marginal mean of variance of X with respect to
m  x  . Then, the following equations can be used to obtain the
moment of the prior density such as

 and  2 :
 m  E  f    E  X   E E  X | Y 

 m2  E  2f    E  f     m 2


Var X   EVar X | Y   EE  X | Y   E  X  
2
90
One special example is

 f       m    E  

 f     ,  2f     2f   m2   2f   2
Example 17:

X ~ N  ,1,    ~ N  ,  2
Suppose we know
.
2
 m  1,  m
 3 . Since
 f     ,  2f    1
then
 m  1   ,
 m2  3   2f   2  1   2   2  2
Thus,
   ~ N 1,2
is the appropriate prior.
(iii) Distance approach
Let m̂ x  be the marginal density estimate of X obtained from
the data. Also, let
mˆ  x  
 f x |  ˆ  d

be the marginal density when the best prior is found. Then, we try to
find ˆ to minimize

 m
ˆ  x  

ˆ , mˆ   E mˆ log 
d m


 mˆ x 


 m
ˆ x  
m
ˆ  x dx continuous 
  log 


m
x

 ˆ


 m
ˆ x  


ˆ x 
discrete 
log

 m  x  m

 ˆ

 x
91
Note:
ˆ x   mˆ x   d m
ˆ , mˆ   0 .
m
3.5 Hierarchical Priors
Hierarchical prior is called a multistage prior. Let the prior class
  1 : 1  |  ,   ,
where
 1  |  
is to consider
prior
is of a given functional form. The second stage
 2  
 2  
on the hyper-parameter. The second stage
is also called a hyper-prior.
Note:
      1  |   2  d

Example 18:
X 1 , X 2 ,, X n , X i ~ N  i ,1,
n

2
2
 1 :  1  |      0  i |  ,    ,   ,  0 ~ N  ,  

i 1

     ,  2  0


 2     2  ,    
2

1
e
800
  1002
800



1
 6 0.0016  2 
7
 N 100,400   IG 6,0.001
1


1
z
 Z ~ IG  ,    f  z  
e 



1

   z

92





e
,
.


1000
 2
Example 19:
X 1 , X 2 ,, X n , X i ~ N  i ,1,
n


2
   1 :  1  |      0  i |  2 ,  0 ~ N 100,  2 ,  2  0.
i 1








n
 1  |    
2
 i 1002
 i 1
1
2 
2
p

 2  2  
2 2
e
2
1000
1
 60.0016  2 
7

2
e

1000
1018
 
120  2
7
e
 2
Then,

      1  |  2  1  2 d 2
0
n
  i 1002


0


 i 1
1
2 
2

n
e
2 2
2
10
1

1202 
  
10
1202 
n
n
2 2 7
2 0
18
 2
d 2
 n

   i 1002  2000
 i 1

2 2
e
d 2
 n
 6  
 2

2
 
120 
2 7

e


18
n

1000
1018
1

2


1000



1000

i


2 i 1


n
1

2


1000



1000
 i


2 i 1


0
 n
 6  
 2
n
6
6
n
2
n
2
 n

   i 1002  2000
 i 1

2
2 


93
1
 
n 
2  2  6  1



e
d 2


18
10
1202 
n
n

 6  
2


6
1

2


1000



1000

i


2 i 1


 12  n  18

  10
1
2



2
n
 6  2  2  10 3 
6
n
n
2
n
2
1 n

2


1



1000

i
 2000

i 1


12 n
2
 12  n 
12 n 


n
2
2
 2   1  1





1000
 i


n

 6  2000  2  2000 i 1
 12  n 
12 n 


 1
 2
t 1
 2 






1








1
 12

n 2
 500  
n
   12  2   6

 3  
 500 
t
 Multivaria te - t distributi on with   12,   1000,,1000 ,   
I
 3 
since
1

2


1000



1000

i


2 i 1


n

 6  
2

n
6
n
2
 n

 i 1002  2000

 i 1



1
 
n 
2  2  6  1


2 2
e

is the density of




n 
IG  6  , 1000 
2 





n
 
i 1
 100 
2
i
2






and for a multivariate-t distribution Y with parameters
 ,   1
2  n t ,  ,
94
1







has the density
  n 


 2 
 1

t 1




f  y   f  y1 ,, yn  
1

y



y




p
det  12 2  2      2
2
   n 
2
3.6 Criticisms
I Objectivity
Box (1980): it is impossible logically to distinguish between model
assumptions and the prior distribution of the parameters.
II Misuse of prior distribution
Question:
When different reasonable priors yield substantially different
answers, can it be right to state that there is a single answer? Would
it not be better to admit that there is scientific uncertainty, with the
conclusion depending on the prior beliefs?
Answer:
In reporting conclusions from Bayesian analyses, the prior (also, data
and loss) should be reported separately, in order to allow others to
evaluate the reasonableness of the subjective inputs.
III Robustness
Question:
Slight changes in the prior distribution might cause significant
changes in the decision.
Answer:
Through “robust Bayesian” methodology and choice of “robust
priors”, concern can be reduced.
95
3.7 Empirical Bayes Analysis
Motivation:
For

X i   i   i ,  i ~ N 0, 
2
, i  1,2,, p.
Then, the MLE estimate is
ˆi  X i (higher dimensional model).
If
1   2     p , the MLE estimate is
p
ˆ  X 
X
i 1
i
(lower dimensional model).
p
Question to ask: which model is better (lower or higher)?
Answer: empirical Bayes analysis is particular desirable in this
situation.
Empirical
“compromise”
Bayes
between
the
method
model
provides
where
i
a
are
completely unrelated (higher dimensional model) and that
where all the
i
are assume to be equal (lower
dimensional model).
(a) Introduction
There are two types of empirical Bayes methods. One is parametric
empirical Bayes (PEB) and the other is nonparametric empirical
Bayes (NPEB).
Parametric empirical Bayes: the prior    is in some
96
parametric class with unknown hyperparameters.
Nonparametric empirical Bayes: one typically assume only that
i
are i.i.d. from some prior
   .
Example 20:
Let




X i ~ N  i ,  2 ,   i  ~ N  ,  2 ,
where

and
 2
are unknown hyperparameters.
◆
Two different ways to carry out empirical Bayes analysis are
1. estimating the prior or posterior by data first, then use
or
ˆ  | x 
ˆ  
to carry out the standard Bayes analysis.
2. finding the Bayes rule in term of unknown prior, and use the
data to estimate the Bayes rule.
(b) Parametric empirical Bayes for normal mean
Example 15 (continue):

X1, X 2 ,, X p , X i ~ N i ,1,    ~ N  , 2
Then,
1.

m  xi  ~ N  ,1   2
The prior using ML-II method is
.
ˆ   ~ N ˆ ,ˆ2   N x , max 0, s 2  1 ,
where
̂  x
and
ˆ 2  max 0, s 2  1.
2.
97

The posterior distribution for X i is
  2 xi  
 2 

f  i | xi  ~ N 
,
2
2 .
1   
 1  
Then, the parameter estimate using Bayes rule under square-loss function
is the posterior mean


 2 xi  
1   2 xi   xi  
ˆ
i 

1   2
1   2
x
i

xi
 
1   2
 xi  B xi     1  B xi  B
where B 

1
. Further, the posterior variance is
1   2
Vˆi 
 2
1 B .
1   2
Note:
 B  0  ˆi  xi

ˆ
 B  1   i  
◆
Morris (1983, JASA) suggested
p  3 
1
ˆ 

B
 p 1 

 1  ˆ 2





,

and then the empirical Bayes estimate is




ˆiEB  1  Bˆ xi  Bˆ ˆ  1  Bˆ xi  Bˆ x
Vˆi
where
̂
EB
and
2


p  1Bˆ  2 Bˆ 2  xi  x 

 1 

p
p3


ˆ 2 . are the estimates obtained in step 1.
98
,
The estimated posterior distribution is

fˆ i | x ~ N ˆiEB ,Vˆi EB

Note:
 B  0  ˆi  xi

ˆ
 B  1   i  
The 1001   % HPD credible set for  i is
CiEB  ˆiEB  z1 Vˆi EB ,ˆiEB  z1 Vˆi EB 


2
2
and The 1001   % HPD credible set for  is
C
EB

p

 i  ˆiEB

  : 
Vˆi EB

 i 1

2

2
p ,1


.


Example 21:
Y1 , Y2 ,, Yn , Yi ~ N  i ,1,
 i  xi   1 xi1   2 xi 2     p xip   i ,  i ~ N 0, 2 

xi  xi1
Then,
1. Since
xi 2



xip ,   1
2

 p t
  i  ~ N xi  , 2 

m  yi  ~ N xi  ,1   2
.
The marginal distribution is then
m  y   m y1 , y2 ,, yn 

 
 2 1   
 
2
 2 1   2

n
2

99
n
2
n
  y i  x i  2
i 1
e
e

2 1  2

  y  x t  1  y  x 
2
where
 x11 x12
 y1 
x
y 
x22
21
2



y
,X 
 



 
 xn1 xn 2
 yn 
̂
and
ˆ 2
minimizing
 x1 p 
1   2
0

 x2 p 
0
1   2

, 
 
  



 xnp 
0
 0
0 


0 

 

 1   2 

m x  also minimize
 y  X t  1  y  X  .
Thus,

ˆ  X t ˆ 1 X
n

1
ˆ 1 y
X t

 y  x ˆ

i
 i
2
i 1 
ˆ  
n

2
 1

where
1  ˆ 2
0

2
ˆ
0
1




ˆ 
 


0
 0
Since
̂
involves
ˆ 2
and
ˆ 2
0 


0 

 .

 1  ˆ 2 

involves
̂
, we have to use iterative
scheme to solve.
2.
The posterior distribution for Yi is
  2 yi  xi   2 

f  i | yi  ~ N 
,
2
2 .
1   
 1  
Similar to the previous example, the Empirical Bayes estimates are
100

ˆiEB  yi  Cˆ i yi  xi ˆ




ˆ Cˆ 
ˆ2

n

l
2
C
EB
i
i
i
ˆ

Vi  1 
yi  xi ˆ

ˆ
n

 n  li  2
where
n  lˆi  2
1
ˆ
Ci 

1  ˆ2
n  lˆi
and


ˆ 1 X
nX X 
ˆ
li 
1  ˆ 2
t

1
,
Xt

ii
(c) Non-parametric empirical Bayes analysis
Example 22:
X 1 , X 2 ,, X p , X i ~ P i ,
i
are i.i.d. from a common prior
0
Under square loss, the estimate is the posterior mean,
f  i | x i 
  xi   E
 i 
  i f  i | xi d i
  i 

f  xi |  i  0  i 
d i
m 0  xi 
1
m 0


x 
i

i
101
 ix e 
i
xi !
i
  0  i d i

2

x  1  ix 1e 

  0  i d i
m  xi    xi  1!
x  1m xi  1 f xi  1 |  i 

 m xi  1   0  i d i
m  xi 
x  1m xi  1

 f  i | xi  1d i
m  xi 
x  1m xi  1

m  xi 
i
i
0
0
0
0
0
0
0
0
Further, we can estimate the marginal distribution by the empirical
distribution
p
ˆ  0 x  
m
 I x 
i 1
x
p
i
,
 1 as xi  x


I
x


where x i
. Thus, the empirical Bayes estimate for
0 as xi  x
i
is
ˆ EB  x  
 xi
 1   I xi 1 x j 
p
j 1
 I xi x j 
p
j 1
102
.
Chapter 4 Bayesian Computation (Monte
Carlo Methods)
Let let y   y1 , y 2 ,, y n  be the data observed and
r  E g     g   f  | y d ,
be of interest. Sometimes, it might be very difficult to find the explicit
form the above integral. In such cases, Monte Carlo method might be
an alternative choice.
(I)
Direct sampling
The direct sampling is to generate  1 , 2 ,, N from f  | y 
and then use
N
rˆ 
 g  
i
i 1
N
to estimate r. Note that N is usually large. The variance of
Varg  
.
N
Var rˆ  
Thus, the standard error of
r̂
is
N
s.erˆ  
2
ˆ




g


r
 i
i 1
N  N  1
An approximate 95% confidence interval for r is
rˆ  2s.erˆ  .
Further, the estimate of
103
.
r̂
is
p  Pa  g    b | y 
is
number of g  i   a, b 
.
N
ˆ 
p
p̂
The standard error of
is
ˆ 1  p
ˆ
p
.
N
ˆ
s.e p
(II)
Indirect sampling
As  1 , 2 ,, N can not be generated from the posterior directly,
the following sampling methods can be used:
(a) important sampling
(b) rejection sampling
(c) the weighted boostrap
(a) Important sampling
Since
r  E  g   | y  

 g   f  | y d 
 g  w h d
 w h d
 N  g  w 

1 N  w 
1
N
i
i 1
i
N
i
i 1
N

 g  w 
i
i 1
i
N
 w 
i 1
i
104
 g     f  y |  d
    f  y |  d
where
   f  y |  
h 
w  
and h  is a density which the data can be generated from easily
and be in generally chosen to approximate the posterior density, i.e.,
h   c   f  y |  , c  R .
X
h  is called importance function. Var  can be estimated by
Y 
s X2 x sY2 x s XY
 2  3
2
y
y
y
,
where
N
x
x
i 1
N
i
N
,y
N
s 
2
Y
 y
i 1
i
y
i 1
N
i
N
, s X2 
i 1
i
 x
2
,
N 1
N
 y
N 1
 x
2
, s XY 
 x  x  y
i 1
i
i
 y
N 1
The accuracy of the important sampling can be estimated by
plugging in xi  g  i w i  and yi  w i  .
(b) Rejection sampling
Let h  be a density which the data can be generated from easily
and be in generally chosen to approximate the posterior density. In
addition, there is a finite known constant M  0 such that
   f  y |    Mh  ,
105
for every  . The steps for the rejection sampling are:
1. generate  j from h  .
2. generate  j independent of  j from U ~ Uniform0,1 .
3. If
j 
  j  f  y |  j 
Mh  j 
accept  j ; otherwise reject  j .
4. Repeat steps 1~3 until the desired sample (accepted  j ) 1 , 2 ,, N
are obtained. Note that 1 , 2 ,, N will be the data generated from the
posterior density. Then,
N
rˆ 
 g  
i
i 1
N
.
Note:

   f  y |   

P    j   P   j | U 


Mh



j


   f  y | 
Mh  

 h dd
0
   f  y | 
Mh  



 h dd
0
j

    f  y |  d


    f  y |  d

106
Differentiation with respect to  j yield
  j  f  y |  j 

    f  y |  d
, the posterior

density function evaluated at  j .
(c) Weighted boostrap
It is very similar to the important sampling method. The steps are as
follows:
1. generate 1 , 2 ,, N from h  .
2. draw   from the discrete distribution over 1 , 2 ,, N which put
mass
qi 
wi
N
w
i 1
, wi 
  i  f  y |  i 
h i 
,
i
at  i . Then,
N
rˆ 
(III)
 g  
i
i 1
N
.
Markov chain Monte Carlo method
There are several Markov chain Monte Carlo methods. One of the
commonly used methods is Metropolis-Hastings algorithm. Let  
be generated from h      f  y |   which is needed only up to
proportionality constant. Given an auxiliary function q ,  such
that q,  is a probability density function and q,   q ,  ,
the Metropolis algorithm is as follows:
1. Draw


from the p.d.f. q ,
107

 , where 

  j is the current
state of the Markov chain.
 
 
h 
2. Compute the odds ratio  
.
h

3. If   1 , then  j 1   .
If   1 , then  j 1
   with probabilit y 

 j with probabilit y 1  
4. Repeat steps 1~3 until the desired sample (accepted  j ) 1 , 2 ,, N
,
are obtained. Note that 1 , 2 ,, N will be the data generated from the
posterior density. Then,
N
rˆ 
 g  
i
i 1
N
.
Note:
For the Metropolis algorithm, under mild conditions,  j converge in
distribution to the posterior density as j   .
Note:
q ,  is called the candidate or proposal density. The most
commonly used q ,  is the multivariate normal distribution.
Note:
Hastings (1970) redefine
 
 


h   q   , 
 
h   q   ,  ,
where q ,  is not necessarily symmetric.
108
Appendix: Exams
Midterm
2003. 4. 18
1. (30%)
Let X have a Binomial 1,  distribution,   0,1, A  0,  . The loss
function is L , a     a 2 . Consider decision rules of the form  x  cx .
Assume the prior distribution is Beta2,2 .
(a) Calculate Bayesian (prior) expected loss   , a  , and find the Bayes
(prior) action. (Note that this is the optimal Bayes action for the
no-data problem in which x is not observed).
(b) Find R ,   .
(c) Show that  is inadmissible if c  1 .
(d) Find the Bayesian (posterior) expected loss E f  |x  L ,  x  given
x  1 and find the Bayesian (posterior) action.
(e) Find r  ,   .
(f) Find the value of c which minimizes r  ,  
2. (15%)
The loss matrix for some business actions is given in the following table.
a1
a2
1
2
3
0
3
1
2
3
0
The prior belief is that   1     2     3  
1
3
(a) Determine if each action is admissible or inadmissible.
(b) What is the Bayes action?
(c) What is the minimax action?
109
3. (40%)
Let X 1 , X 2 ,, X 5 ~ Negative Binomial 1,  and the prior distribution is
Beta1,1 . Find the Bayes rule (or Bayes estimate) under loss
(a) L , a     a 2
(b) L , a    a given x1  0, x2  0, x3  0, x4  0, x5  0
(c) L , a  
  a 2
 1   
   a, if   a
3a   , if   a
(d) L , a   
given x1  0, x2  0, x3  0, x4  0, x5  0
 m  z  1 m

 1   z )




Z
~
Negative
Binomial
m
,


f
z
|


(
z


4. (40%)
If X ~ Exponential   and    
1
1
     1
e

~ Inverse Gamma ,   .
(a) Find the posterior mean.
(b) If the loss function is

  a 2
L , a  

, find the Bayes estimate.
(c) If the loss function is L , a      a  , find the Bayes estimate.
2
(d) Find the generalized maximum likelihood estimate of  .
( If a random variable Z ~ Inverse Gamma ,   , then
E Z  
1
   1
,Var Z  
1
   1   2
2
2
110
,  2;
1
~ Gamma ,   )
Z
Quiz
2003. 05. 27
1. (60%)
Observations yi , i  1,, n , are independent given two parameters 
and  2 , and normally distributed with respective means xi and
variance  2 , where the xi are specified explanatory variables.
(a) Suppose  and  2 are unknown and the prior is   ,  2  
1
2
.
 Find the marginal posterior density of  2 .
 Find the posterior mean based on the above marginal posterior.
 Find the generalized maximum likelihood estimate of  2 based on
the marginal posterior.
(b) Suppose  2  1 is known and the prior is     1.
 Find the marginal posterior density of  .
 Find the generalized maximum likelihood estimate of  based on
the above estimate.
 For the following data,
xi
1
-1
yi
3
2
0
0
-1
3
1
2
and the hypothesis H 0 :   0.98 v.s. H1 :   0.98 , find the posterior
probability for H 0 and the posterior odds ratio.
2. (15%)
X 1 , X 2 , , X 10 ~ Poisson  ,    
1

,0     . The observed
data is
0
1
1
0
0
0
1
1
1
0
Find the approximate 90% HPD credible set, using the normal
111
approximation to the posterior.
3. (20%)
Find the noninformative prior for the parameters

yi ~ Exponential  , i  1, , n .

yi ~ N  ,  2 , i  1,  , n,


2
and 
are the parameters of
interest.
4. (30%)
A large shipment of parts is received, out of which 10 are tested for
defects. The number of defective parts, X, is assumed to have a
Binomial 10,  . From past shipments, it is known that  ~ Beta1,10 . Find
the Bayes rule (or Bayes estimate) under loss
(a) L , a     a 2 (b) L     a , x  0 (c) L , a  
   a, if   a
2a   , if   a
(d) L , a   
112
, x  0.
  a 2
 1   
Final
2003. 06. 19
1. (20%)
Find the noninformative prior for the parameters

yi ~ Poisson  , i  1, , n .

yi ~ N  ,  2 , i  1,  , n,


2
and 
are the parameters of
interest.
2. (30%)
Suppose that X ~ N  ,1 ,  0   ~ N 0,2.19 and
   : 0.8 0    0.2q , q  L.
(a) If L is the class of all distributions, find the ML-II prior for any given
x.


(b) If L  q : q ~ N 0,
2
,
2

 1 . , find the ML-II prior for any
given x.
3. (30%)
(a) Assume
X ~ Binomial 3, . If the improper prior density
   
As
1
I 0,1   ,
 1   
x  3 , find 1001   % HPD credible set.
(b) The waiting time for a bus at a given corner at a certain time of day is
known to have a Uniform0,  distribution. It is desired to test
H 0 : 0    10 versus H1 :   10 . From other similar routes, it is known
that  has a Pareto6,3 distribution. If waiting times of 7, 8, 9, 10, and
11 and 12 are observed at the given corner, calculate the posterior
probability for H 0 : 0    10 , the posterior odds ratio, and the Bayes
113
factor.
4. (30%)
(a) Suppose that X 1 , X 2 ,, X n ~ Negative Binomial m,  , and
the prior distribution is Beta ,   . Find the posterior distribution of
 , the posterior mean, the posterior variance, the generalized MLE and
the posterior variance of the generalized MLE.
(b) Suppose that
, and that
X   X 1 ,, X k  ~ Multinomial n,1 ,, k 
t
  1,, k t ~ Dirichlet 1,, k 
. Find the
posterior mean and the generalized MLE.
5. (15%)
A insurance company is faced with taking one of the following 3 actions:
a1 : increase sales force by 10%; a 2 : maintain present sales force; a 3 :
decrease sales force by 10%.
Depending on whether or not the economy is good (  1 ), mediocre (  2 ), or
bad (  3 ), the company would expect to lose the following amounts of
money in each case:
a1
a2
a3
1
2
3
-90
-40
-20
-50
-40
-30
20
10
0
(d) Determine if each action is admissible or inadmissible.
(e) The company believes that  has the probability distribution
 1   0.2,   2   0.3,   3   0.5 . Order the action according to their
Bayesian expected loss (equivalent to Bayes risk, here), and state the
Bayes action.
(f) Order the actions according to the minimax principle and find the
minimax action.
114
Download