Notes 3 - Wharton Statistics Department

advertisement
Statistics 550 Notes 3
Reading: Section 1.3
I. Background for Problem 1.1.9 in Homework 1.
p
2
The model Yi   zij  j   i ,  i ~ N (0,  ) is the multiple
j 1
p
linear regression model. We have E (Yi | zij )   zij  j . The
j 1
coefficients  j can be interpreted as the change in the mean
of Y that is associated with a one unit change in z j when
z1 ,
, z j 1 , z j 1 ,
, z p are held fixed.
Example: The 1966 Coleman Report on “Equality of
Educational Opportunity” sought to explain how student
achievement in schools was associated with the resources
of the school and the socioeconomic background of the
student, e.g.,
Y = verbal achievement score in school (6th graders)
z1  staff salaries per pupil
z2  % of students in 6th grade of school whose father has
a white collar occoupation
z3  SES (socioeconomic status)
z4  teachers’ average verbal scores
z5  mothers’ average education
1
The variables z1 , z2 , z3 , z4 , z5 would be collinear if one
variable was a linear function of the other variables. There
was concern that this was approximately true because
socioeconomic status was highly correlated with the
resources of the school (staff salaries per pupil) prior to the
desegregation of schools.
II. Bayesian Inference for the Normal Distribution
2
2
Suppose X 1 , , X n iid N ( ,  ) ,  known, and our prior
2
on  is N (  , b ) .
The posterior distribution is proportional to
f ( x |  ) (  ) 
 n
1
1
 1
 1
2  
2 
exp

(
x


)
exp

(



)

i




2
2




 2
   2 b
 2b

 i=1 2
1
n
 1

exp  2   i 1 X i2  2nX   n 2   2  2  2   2   
 2b
 2 

  nX  
1 
 n
exp    2  2    2  2  2   
b 
2b  
 2
 
2
 1
nX /  2   / b 2   n
1 
exp    
 2 
2
2   2
 2
n /   1 / b  
b 


Thus, the posterior distribution is
2


N





 2 b2 , 1

n
1 n
1 
 2
 2
2
2

b 
b 
nX

III. Bickel and Doksum’s perspective on Bayesian models.
(a) Bayesian models are useful as a way to generate
statistical procedures which incorporate prior information
when appropriate.
However, statistical procedures should be evaluated in a
frequentist (repeated sampling) way:
For example, for the iid Bernoulli trials example, if we use
the posterior mean with a uniform prior distribution to
1   i 1 xi
n
estimate p , i.e., pˆ 
n  1 , then we should look at
how this estimate would perform in many repetitions of the
experiment if the true parameter is p for various values of
p . More to come on this frequentist perspective in
Chapter 1.3.
(b) We can view the parameter as random and view the
Bayesian model as providing a joint probability distribution
on the parameter and the data.
Consider the model P P = { P , } for the probability
distribution of the data X .
3
The subjective Bayesian perspective is that there is a true
unknown  and our goal is to describe our beliefs about
 after seeing the data X . This requires specifying a prior
distribution  ( ) for our beliefs about  ; the posterior
distribution describes our beliefs about  after seeing the
data X .
Bickel and Doksum’s viewpoint is to see the Bayesian
model
P P = { P , } where we put a prior distribution
 ( ) on  as providing a joint distribution for the
parameter and the data ( , X ) , e.g.,
if the data X 1 , , X n are iid Bernoulli trials with
probability p of success and the prior distribution for p is
Beta(r,s), then the joint probability distribution for
( p, X1 , , X n ) is generated by:
1. We first generate p from a Beta(r,s) distribution.
2. Conditional on p , we generate X 1 , , X n iid Bernoulli
trials with probability p of success.
IV. Motivating Example for Decision Theoretic
Framework (Section 1.3)
A cofferdam protecting a construction site was designed to
withstand flows of up to 1870 cubic feet per second (cfs).
An engineer wishes to estimate the probability that the dam
will be overtopped during the upcoming year. Over the
previous 25-year periods, the annual maximum flood levels
of the dam has exceeded 1870 cfs 5 times. The engineer
4
models the data on whether the flood level has exceeded
1870 cfs as independent Bernoulli trials with the same
probability p that the flood level will exceed 1870 cfs in
each year.
Some possible estimates of p based on iid Bernoulli trials
X1 , , X n :
(1) pˆ 

n
i 1
Xi
n
1   i 1 X i
n
(2) pˆ 
on p .
n2
, the posterior mean for a uniform prior
2   i 1 X i
n
(3) pˆ 
, the posterior mean for a Beta(2,2)
n4
prior on p (called the Wilson estimate, recommended by
Moore and McCabe, Introduction to the Practice of
Statistics).
How should we decide which of these estimates to use?
The answer depends in part on how errors in the estimation
of p affect us.
Example 1 of decision problem: The firm wants the
engineer to provide her best “guess” of p , the probability
of an overflow, i.e., to estimate p by p̂ . The firm wants
the probability of an overflow to be at most 0.05. Based on
the estimate p̂ of p , the engineer’s firm plans to spend an
5
additional f (max(0, pˆ  0.05)) dollars to shore up the dam
where f (0)  0 and f is an increasing function; after
spending f (max(0, pˆ  0.05)) dollars. By spending this
money, the firm will make the probability of an overflow
be max(0, p  max(0, pˆ  0.05)) . The cost of an overflow
to the firm is $C. The expected cost to the firm of using an
estimate of p̂ (for a true initial probability of overflow of
p ) is
f (max(0, pˆ  0.05))  C *(max(0, p  max(0, pˆ  0.05))) .
We want to choose an estimate which provides low
expected cost.
Example 2 of decision problem: Another decision problem
besides estimating p might be that the firm wants to decide
whether p  0.15 or p  0.15 ; if p  0.15 , the firm would
like to build additional support for the additional dam. This
is an example of a testing problem of deciding whether a
parameter lives in one of two subsets that form a partition
of the sample space. The cost to the firm of making the
wrong decision about whether p  0.15 or
p  0.15 depends on what type of error was made (deciding
that p  0.15 when in fact p  0.15 or deciding that
p  0.15 when in fact p  0.15 ).
The decision theoretic framework involves:
(1) clarifying the objectives of the study;
(2) pointing to what the different possible actions are
(3) providing assessments of risk, accuracy, and
reliability of statistical procedures
6
(4) providing guidance in the choice of procedures for
analyzing outcomes of experiments.
History: Abraham Wald (1950, Statistical Decision
Functions) developed the foundations of the decision
theoretic framework for statistics.
V. Components of the Decision Theory Framework
(Section 1.3.1)
As in Section 1.1, we observe data X from a distribution
P , where we do not know the true P but only know that
P P = { P , } (the statistical model).
The true parameter vector  is sometimes called the “state
of nature.”
Action space: The action space A is the set of possible
actions, decisions or claims that we can contemplate
making after observing the data X .
For Example 1 of decision problem, the action space is the
possible estimates of p , A  [0,1] .
For Example 2 of decision problem, the action space is
{decide that p  0.15 , decide that p  0.15 }.
Loss function: The loss function l (  a  is the loss
incurred by taking the action a when the true parameter
vector is  .
7
The loss function is assumed to be nonnegative. We want
the loss to be small. The loss function can be thought of as
the negative of a utility function in economics.
Ideally, we choose the loss function based on the
economics of the decision problem as in Example 1 of
decision problem. However, more commonly, the loss
function is chosen to qualitatively reflect what we are
trying to do and to be mathematically convenient.
Commonly used loss functions for point estimates of a real
valued parameter q( ) :
Denote our estimate of q( ) by a .
The most commonly used loss function is
2
quadratic (squared error) loss: l (  a    q(  )- a) .
Other choices that are less computationally convenient but
perhaps more realistically penalize large errors less are:
(1) absolute value loss, l (  a    q(  ) - a | ;
(2 ) Huber’s loss functions,
2

if |q(  ) - a | k
(q(  ) - a)
l (  a   
2

2k | q(  ) - a | -k if |q(  ) - a |> k
for some constant k
(3) zero-one loss function
if |q(  )- a | k
0
l (  a   
if |q(  )- a |> k
1
for some constant k
8
Decision procedures: A decision procedure or decision rule
specifies how we use the data to choose an action a . A
decision procedure  is a function  ( x ) from the sample
space of the experiment to the action space.
For Example 1 of decision problem, decision procedures
 i 1 X i
n
include  ( X ) 
n
1   i 1 X i
n
and  ( X ) 
n 1
.
Risk function: The loss of a decision procedure will vary
over repetitions of the experiment because the data from
the experiment X is random. The risk function R (θ ,  ) is
the expected loss from using the decision procedure
 when the true parameter vector is  :
R(θ,  )  E [l ( ,  ( X ))]
Example: For quadratic loss in point estimation of q( ) ,
the risk function is the mean squared error:
R(θ ,  )  E [l ( ,  ( X ))]  E [(q(   ( X )) 2 ]
This mean square error can be decomposed as bias squared
plus variance.
Proposition 3.1:
E [(q(   ( X )) 2 ]  (q(  E [ ( X )]) 2  E {( ( X )  E [ ( X )]) 2}
9
Proof: We have
E [(q(   ( X )) 2 ]  E [({q(  E [ ( X )]}  {E [ ( X )]   ( X )  ] 
(q(  E [ ( X )]) 2  E [{E [ ( X )]   ( X ) ] 
{Bias[ ( X )]}2  Variance[ ( X )]
10
Example: Suppose that an iid sample X1,...,Xn is drawn
from the uniform distribution on [0,  ] where  is an
unknown parameter and the distribution of Xi is
1
0<x<

f X ( x; )  
0
elsewhere
Several point estimators:
1. W1  max i X i
 n 1 
W

2. 2  n  max i X i . Note: Unlike W1, W2 is unbiased


because
n 1
n 1 n
E (W2 )   
E (W1 )   
   0 .
n
n n 1
3. W3=2 X . Note: W3 is unbiased,
E [ X ]  

0
x2
x dx 

2
1
E [W3 ]  2 E [ X ]  2

2


0

2

Comparison of three estimators for uniform example using
mean squared error criterion
1. W1  max i X i
The sampling distribution for W1 is
11
 nw1n 1

( w1 )    n
0

fW1
0  w1  
elsewhere
and


0
0
E [W1 ]   w1 fW1 ( w1 )dw1   w1
nw1n 1
n
nw1n 1
dw1 
(n  1) n

0
n
1
   

n 1
n 1
2
To calculate Var (W1 ) , we calculate E (W1 ) and use the
2
2
formula Var ( X )  E ( X )  [ E ( X )] .
Bias (W1 )  E [W1 ]   


E (W )   w f dw1   w
2
1
0
2
1 w1
nw1n  2
=
(n  2) n
0


0
2
1
nw1n 1
n
dw1
n
2
n2
2

n 2  n
n
Var (W1 ) 
 
 
2
2
n2
 (n  1)  (n  2)(n  1)
Thus,
MSE (W1 )  {Bias (W1 )}2  Var (W1 ) 
2
n 2
2
2
.



2
2
(n  1) (n  2)(n  1)
(n  1)(n  2)
n 1
W

max i X i
2. 2
n
12
 n 


 n 1 
n 1
W

W1 .
Note 2
n
n 1
n 1 n
E
(
W
)

E
(
W
)

  ,

1
Thus,  2
n
n n 1
Bias (W2 )  0 and
n 1
 n 1 
Var (W2 )  Var (
W1 )  
 Var (W1 )
n
 n 
2
n
1
 n 1 
2



2

2
n(n  2)
 n  (n  2)(n  1)
Because W2 is unbiased,
1
MSE (W2 )  Var (W2 ) 
2
n(n  2)
2
3. W3  2 X
To find the mean square error, we use the fact that if
X 1 , , X n iid with mean  and variance  2 , then
X   Xn
X 1
 and variance  2
has
mean
n
We have
E ( X )  
x2
x dx 

2

1
0
E ( X )  
2

0


0
x3
2
x dx 

3
2
1

2


2
0
  
Var ( X ) 
  
3  2  12
2
2
13
3

2
Thus, E ( X )  2 , Var ( X )  12n and
2
E (W3 )  2 E ( X )   and Var (W3 )  4Var ( X ) 
3n .
2
W3 is unbiased and has mean square error 3n .
The mean square errors of the three estimators are the
following:
MSE
2
2
(n  1)(n  2)
1
2
n(n  2)
1 2

3n
W1
W2
W3
For n=1, the three estimators have the same MSE.
1
2
1


For n>1, n(n  2) (n  1)(n  2) 3n
So W2 is best, W1 is second best and W3 is the worst.
14
Download