Notes 9 - Wharton Statistics Department

advertisement
Statistics 512 Notes 9: The Monte Carlo Method
Continued
The Monte Carlo method:
Consider a function g ( X ) of a random vector X where
X has density f ( X ) . Consider the expected value of
g( X ) :
E[ g ( X )]   g ( x ) f ( x )dx .
Suppose we take an iid random samples X1 ,
the density f ( X ) .
Then by the law of large numbers

m
i 1
g( Xi )
, X m from
P
 E[ g ( X )]
m
The Monte Carlo method is to do a simulation to draw
X1 , , X m from the density f ( X ) and estimate E[ g ( X )]
by

Eˆ [ g ( X )] 
m
i 1
g( Xi )
.
m
In a simulation, we can make m as large as we want.
Standard error of the estimate is
2
m

g( Xi ) 
m

i

1
 i1  g ( X i )  m 


S Eˆ [ g ( X )] 
m
By the Central Limit Theorem, an approximate 95%
confidence interval for E[ g ( X )] is
Eˆ [ g ( X )]  1.96SEˆ [ g ( X )]
Example: Monte Carlo estimation of 
Define the unit square as a square centered at (0.5,0.5) with
sides of length 1 and the unit circle as the circle centered at
the origin with a radius of length 1. The ratio of the area of
the unit circle that lies in the first quadrant to the area of the
unit square is  / 4 .
Let U1 and U 2 be iid uniform (0,1) random variables. Let
g (U1 ,U 2 ) =1 if (U1 ,U 2 ) is in the unit circle and 0

otherwise. Then E[ g (U1 , U 2 )]  4 .
Monte Carlo method: Repeat the experiment of drawing
X  (U1 ,U 2 ) , U1 and U 2 iid uniform (0,1) random
variables, m times and estimate  by
ˆ  4

n
i 1
g (U i1 , U i 2 )
m
interval for  is
. An approximate 95% confidence
m

g (U i1 ,U i 2 ) 
m

i

1

 i 1  g (U i1 ,U i 2 ) 

m


ˆ  1.96* 4
m
Because g (U1 ,U 2 ) =0 or 1, (1) is equivalent to
ˆ (1  ˆ )
ˆ  1.96*4
m
2
(1)
In R, the command runif(n) draws n iid uniform (0,1)
random variables.
Here is a function for estimating pi:
piest=function(m){
#
# Obtains the estimate of pi and its standard
# error for the simulation discussed in Example 5.8.1
#
# n is the number of simulations
#
# Draw u1, u2 iid uniform (0,1) random variables
u1=runif(m);
u2=runif(m);
cnt=rep(0,m);
# chk=Vector which checks if (u1,u2) is in the unit circle
chk=u1^2+u2^2-1;
# cnt[i]=1 if (u1,u2) is in unit circle
cnt[chk<0]=1;
# Estimate of pi
est=4*mean(cnt);
# Lower and upper confidence interval endpoints
lci=est-4*(mean(cnt)*(1-mean(cnt))/m)^.5;
uci=est+4*(mean(cnt)*(1-mean(cnt))/m)^.5;
list(estimate=est,lci=lci,uci=uci);
}
> piest(100000)
$estimate
[1] 3.13912
$lci
[1] 3.133922
$uci
[1] 3.144318
Back to Example 5.8.5:
The true size of the 0.05 nominal size t-test for random
samples of size 20 contaminated normal distribution A?
We want to estimate
E[ I {t ( x1 , , x20 )  1.729}]
Monte Carlo method:
Eˆ [ I {t ( x1 ,
, x20

)  1.729}] 
m
i 1
I {t ( xi ,1 ,
, xi ,20 )  1.729}
m
where ( xi ,1 , , xi ,20 ) is a random sample of size 20 from the
contaminated normal distribution A.
[Here X  ( X1 , , X 20 ) and f ( X ) is the density of a
random sample of size 20 from the contaminated normal
distribution A and g ( X )  I {t ( X1 , , X 20 )  1.729} .]
How to draw a random observation from the contaminated
normal distribution A?
(1) Draw a Bernoulli random variable B with p=0.25;
(2) If B=0, draw a random observation from the
standard normal distribution. If B=1, draw a
random observation from the normal distribution
with mean 0 and standard deviation 25.
In R, the command rnorm(n,mean=0,sd=1) draws a random
sample of size n from the normal distribution with the
specified mean and SD. The command rbinom(n,size=1,p)
draws a random sample of size n from Bernoulli
distribution with probability of success p.
R function for obtaining Monte Carlo estimate
Eˆ [ I{t ( x1 , , x20 )  1.729}]
empalphacn=function(nsims){
#
# Obtains the empirical level of the test discussed in
# Example 5.8.5
#
# nsims is the number of simulations
#
sigmac=25; # SD when observation is contaminated
probcont=.25; # Probability of contamination
alpha=.05; # Significance level for t-test
n=20; # Sample size
tc=qt(1-alpha,n-1); # Critical value for t-test
ic=0; # ic will count the number of times t-test is rejected
for(i in 1:nsims){
# Bernoulli random variable which determines whether
# each observation in sample is from standard normal or
# normal with SD sigmac
b=rbinom(n,size=1,prob=probcont);
# Sample observations from standard normal when b=0 and
# normal with SD sigmac when b=1
samp=rnorm(n,mean=0,sd=1+b*24);
# Calculate t-statistics for testing mu=0 based on sample
tstat=mean(samp)/(var(samp)^.5/n^.5);
# Check if we reject the null hypothesis for the t-test
if(tstat>tc){
ic=ic+1;
}
}
# Estimated true significance level equals proportion of
# rejections
empalp=ic/nsims;
# Standard error for estimate of true significance level
se=1.96*((empalp*(1-empalp))/nsims)^.5;
lci=empalp-1.96*se;
uci=empalp+1.96*se;
list(empiricalalpha=empalp,lci=lci,uci=uci);
}
> empalphacn(100000)
$empiricalalpha
[1] 0.04086
$lci
[1] 0.03845507
$uci
[1] 0.04326493
Based on these results the nominal 0.05 size t-test appears
to be slightly conservative when a sample of size 20 is
drawn from this contaminated normal distribution.
Generating random observations with given cdf F
Theorem 5.8.1: Suppose the random variable U has a
uniform (0,1) distribution. Let F be the cdf of a random
variable that is strictly increasing on some interval I, where
F=0 to the left of I and F=1 to the right of I. Then the
1
1
random variable X  F (U ) has cdf F, where F (0) =left
1
endpoint of I and F (1) =right endpoint of I.
Proof: A uniform distribution on (0,1) has the CDF
FU (u )  u for u  (0,1) . Using the fact that the CDF F is a
strictly monotone increasing function on the interval I, then
on
P[ X  x]  P[ F 1 (U )  x]
=P[ F ( F 1 (U ))  F ( x)]
=P[U  F ( x)]
=F ( x)
Difficult to use this method when simulating random
variables whose inverse CDF cannot be obtained in closed
form.
Other methods for simulating a random variable:
(1) Accept-Reject Algorithm (Chapter 5.8.1)
(2) Markov chain Monte Carlo Methods (Chapter 11.4)
R commands for generating random variables
runif
-- uniform random variables
rbinom
rnorm
rt
rpois
rexp
rgamma
rbeta
rcauchy
rchisq
rF
rgeom
rnbinom
-- binomial random variables
-- normal random variables
-- t random variables
-- Poisson random variables
-- exponential random variables
-- gamma random variables
-- beta random variables
-- Cauchy random variables
-- chisquared random variables
-- F random variables
-- geometric random variables
-- negative binomial random variables
Bootstrap Procedures
Bootstrap standard errors
X1 ,
, X n iid with CDF F and variance  2 .
 X   Xn  1
Var  1
 2 Var  X1 

n

 n
SD( X ) 
 Xn  
2
n .

n.
s
SE
(
X
)

s
We estimate SD( X ) by
n where is the
sample standard deviation.
What about SD{Median( X1 , , X n )} ? This SD depends in
a complicated way on the distribution F of the X’s. How to
approximate it?
Real World: F  X1 ,
, X n  Tn  Median( X 1 ,
, Xn).
The bootstrap principle is to approximate the real world by
assuming that F  Fˆn where Fˆn is the empirical CDF, i.e.,
1
the distribution that puts n probability on each of
X 1 , , X n . We simulate from Fˆn by drawing one point at
random from the original data set.
Bootstrap World:
Fˆn  X1* , , X n*  Tn*  Median( X1* ,
, X n* )
The bootstrap estimate of SD{Median( X1 , , X n )} is
SD{Median( X 1* , , X n* )} where X 1* , , X n* are iid draws
from Fˆ .
n
*
How to approximate SD{Median( X 1 ,
The Monte Carlo method.
, X n* )} ?
2
1 m 
1 m

g
(
X
)

g
(
X
)
  i m  i1 i  
m i 1 
2
1 m
2
1 m
 P
 g ( X i )     i 1 g ( X i )  

i 1
m
m

2
2
E  g ( X )    E  g ( X )   Var  g ( X ) 


Bootstrap Standard Error Estimation for Statistic
Tn  g ( X1 , , X n ) :
*
*
1. Draw X 1 , , X n .
*
*
2. Compute Tn  g ( X 1 ,
, X n* ) .
*
3. Repeat steps 1 and 2 m times to get Tn,1 ,
4. Let seboot 
* 
1
1
m  *
m
  Tn,i  m  r 1T n,r 
m i 1 
, Tn*,m
2
The bootstrap involves two approximations:
not so small approx. error
SDF (Tn )

small approx. error
SDFˆ (Tn )

n
seboot
R function for bootstrap estimate of SE(Median)
bootstrapmedianfunc=function(X,bootreps){
medianX=median(X);
# vector that will store the bootstrapped medians
bootmedians=rep(0,bootreps);
for(i in 1:bootreps){
# Draw a sample of size n from X with replacement and
# calculate median of sample
Xstar=sample(X,size=length(X),replace=TRUE);
bootmedians[i]=median(Xstar);
}
seboot=var(bootmedians)^.5;
list(medianX=medianX,seboot=seboot);
}
Example: In a study of the natural variability of rainfall, the
rainfall of summer storms was measured by a network of
rain gauges in southern Illinois for the year 1960.
>rainfall=c(.02,.01,.05,.21,.003,.45,.001,.01,2.13,.07,.01,.0
1,.001,.003,.04,.32,.19,.18,.12,.001,1.1,.24,.002,.67,.08,.003
,.02,.29,.01,.003,.42,.27,.001,.001,.04,.01,1.72,.001,.14,.29,.
002,.04,.05,.06,.08,1.13,.07,.002)
> median(rainfall)
[1] 0.045
> bootstrapmedianfunc(rainfall,10000)
$medianX
[1] 0.045
$seboot
[1] 0.02167736
Download