Chapter 11 Simulation studies in statistics

advertisement
'
$
Chapter 11
Simulation studies in statistics
&
%
1
'
$
Introduction
• What is a simulation study, and why do one?
• Simulations for properties of estimators
• Simulations for properties of hypothesis tests
&
%
2
'
$
What is a simulation study
Simulation
• A numerical technique for conducting experiments on the computer.
• It involves random sampling from probability distributions.
&
%
3
'
$
Rationale: (In statistics)
• Properties of statistical methods must be established so that the
methods may be used with confidence.
• Exact analytical derivations of properties are rarely possible.
• Large sample approximations to properties are often possible.
• But what happens in the finite sample size case?
• And what happens when assumptions are violated?
&
%
4
'
$
Usual issues
• Is an estimator biased in finite samples? Is it still consistent under
departures from assumptions? What is its sampling variance?
• How does it compare to competing estimators on the basis of bias,
precision, etc.?
• Does a procedure for constructing a confidence interval for a
parameter achieve the advertised nominal level of coverage?
&
%
5
'
$
Usual issues
• Does a hypothesis testing procedure attain the advertised level of
significance or size?
• If it does, what power is possible against different alternatives to
the null hypothesis? Do different test procedures deliver different
power?
How to answer these questions in the absence of analytical results?
&
%
6
'
$
Monte Carlo simulation approximation
• An estimator or test statistic has a true sampling distribution
under a particular set of conditions (finite sample size, true
distribution of the data, etc.)
• Ideally, we could want to know this true sampling distribution in
order to address the issues on the previous slide.
• But derivation of the true sampling distribution is not tractable.
• Hence we approximate the sampling distribution of an estimator or
a test statistic under a particular set of conditions.
&
%
7
'
$
How to approximate
A typical Monte Carlo simulation involves the following:
• Generate S independent data sets under the conditions of interest
• Compute the numerical value of the estimator/test statistic T from
each data set. Hence we have T1 , T2 , · · · , Ts , · · · , TS
&
%
8
'
$
How to approximate
• If S is large enough, summary statistics across T1 , T2 , · · · , Ts , · · · , TS
should be good approximations to the true properties of the
estimator/test statistic under the conditions of interest.
An example:
Consider an estimator for a parameter θ:
– Ts is the value of T from the s-th data set, s=1,2,· · · ,S.
– The mean of all Ts ’s is an estimate of the true mean of the
sampling distribution of the estimator.
&
%
9
'
$
Simulation for properties of estimators
A simple example:
• Compare three estimators for the mean µ of a distribution based on
a random sample Y1 , Y2 , · · · , Yn
• The three estimators are:
– Sample Mean, T (1)
– Sample Median, T (2)
– Sample 10% trimmed mean,T (3)
&
%
10
'
$
Simulation procedures
For a particular choice of µ, n and true underlying distribution
• Generate independent draws Y1 , Y2 , · · · , Yn from the distribution
• Compute T (1) , T (2) , T (3)
• Repeat S times. Hence we have
(1)
(1)
(1)
(2)
(2)
(2)
(3)
(3)
(3)
T1 , T 2 , · · · , T S
T1 , T 2 , · · · , T S
T1 , T 2 , · · · , T S
&
%
11
'
$
Simulation procedures
Compute for k = 1, 2, 3
mean
ˆ
=
S
1 ∑ (k)
Ts = T̄ (k)
S s=1
ˆ
bias
= T̄ (k) − µ
v
u
S
u 1 ∑
2
(k)
t
(k)
ˆ
SD =
(Ts − T̄ )
S − 1 s=1
ˆ
MSE
=
S
1 ∑ (k)
ˆ 2
ˆ 2 + bias
(T̄ − µ)2 ≈ SD
S s=1
&
%
12
'
$
A simulation study
• To compare 3 estimators for location µ through a simulation study
• The 3 estimators are
– Sample Mean,
– Sample Median,
– Sample 10% trimmed mean
• Underlying distribution: N(0,1)
• Sample size: 15
• Simulation size: 1000
&
%
13
'
$
A simulation study: R code
> # To compare 3 estimators for location,mu, through simulation study
> # The 3 estimators are sample mean, median, and 10% trimmed
mean.
> S=1000 # Simulation size (No.of samples)
> n=15 # Sample size
> mu=1 # Mean of the underlying normal distribution
> sd=1 # Standard deviation of the underlying normal distribution
&
%
14
'
$
A simulation study: R code
> meax=numeric(S) # A vector contains all the sample means
> medx=numeric(S) # A vector contains all the sample medians
> trmx=numeric(S) # A vector contains all the sample 10% trimmed
means
> stdx=numeric(S) # A vector contains all the sample standard
deviation
> set.seeds(1234) # Set the seed number
&
%
15
'
$
A simulation study: R code
> for (i in 1:S){
+ x=rnorm(n,mu,sd) # Generate a random sample of size n
+ mean[i]=mean(x) # Compute the mean for the i-th
sample,i.e.T∧(1) i
+ medx[i]=median(x)# Compute the median for the i-th
sample,i.e.T∧(2) i
+ trmx[i]=mean(x,trim=0.1)
+# Compute the 10% trimmed mean for the i-th sample, i.e.T∧(3) i
+ stdx[i]=sd(x)# Compute the standard deviation for the i-th sample
}
&
%
16
'
$
A simulation study: R code
> # Compute the mean of “S” sample means, medians, and trimmed
means
> simumean=apply(cbind(meax,medx,trmx),2,mean)
> # “2” in the second argument asks the computer to fine “means” for
all the columns in the matrix given in argument 1.
> # Hence “simumean” is a 3×1 vector consists of the average of the
“S” means, medians, and the 10% trimmed means.
&
%
17
'
$
A simulation study: R code
> # Compute the standard deviation of the “S” sample means,
medians, and 10% trimmed means
> simustd=apply(cbind(meax,medx,trmx),2,sd)
> # Compute the bias
> simubias=simumean-rep(mu,3)
> # Compute the MSE (Mean Square Error)
> simumse=simubias∧2+simustd∧2
&
%
18
'
$
A simulation study: R code
> ests=c(“Sample mean”,“Sample median”,“Sample 10% trimmed
mean”) # column heading in the output
> names=c(“True value”,“No. of simu”,“”MC Mean,“MC Std
Deviation”,“MC Bias”,“MC MSE”) # row heading in the output
>
sumdat=rbind(rep(mu,3),rep(S,3),simumean,simustd,simbias,simumse)
>
> dimnames(sumdat)=list(names,ests)
> round(sumdat,4)
&
%
19
'
$
A simulation study: R code
&
%
20
'
$
Checking coverage probability of confidence interval
• Usual 95% confidence interval for µ based on sample mean is given
by
s
s
(Ȳ − t0.025,n−1 √ , Ȳ + t0.025,n−1 √ )
n
n
• Does the interval achieve the nominal level of coverage 95%?
&
%
21
'
$
Checking coverage probability of confidence interval
> #Check the coverage probability of confidence interval
> #We make use of the data obtained from the above simulation study
> t05=qt(0.975,n-1) #Get t {0.025,14}
>coverage=sum((meaxt05*stdx/sqrt(n)<=mu)&(meax+t05*stdx/sqrt(n)>=mu))/S
> #The above statement is equivalent to
> # d=0
> # for(i in 1:S){
> #d=d+meax[i]t05*stdx[i]/sqrt(n)<=mu)&(meax[i]+t05*stdx[i]/sqrt(n)>=mu))}
> #coverage=d/S
>coverage
[1]0.945
&
%
22
'
$
Simulation for properties of hypothesis tests
Example: Consider the size and power of the usual t-test for the mean
• Test H0 : µ = µ0 against H0 : µ ̸= µ0
• To evaluate whether size/level of test achieves advertised α
– Generate data under H0 : µ = µ0 and calculate the proportion of
rejections of H0
• Approximation the true probability of rejecting H0 when it is true
– Proportion should be approximately equal to α
&
%
23
'
$
Simulation for properties of hypothesis tests
• To evaluate power
– Generate data under some alternative µ ̸= µ0 (let say, µ = µ1 )
and calculate the proportion of rejections of H0
• Approximate the true probability of rejecting H0 when the
alternative is true
&
%
24
'
$
Checking the size of t-test: R code
> #Check the size of t-test
>ttests=(meax-mu0)/(stdx/sqrt(n))
>size=sum(abs(ttests)>t05)/S
>size
[1]0.045
&
%
25
'
$
Checking the power of t-test: R code
> #Check the power of t-test when d=1*sd
> Assume the samples generated from N (µ1 , σ) where µ1 = 1, σ = 1.
> Test H0 : µ = µ0 vs H1 : µ = µ1 = µ0 + σ where µ0 = 0.
>ttests=(meax-mu0)/(stdx/sqrt(n))
>power=sum(abs(ttests)>t05)/S
>power
[1]0.954
&
%
26
'
$
Simulation study in SAS
Simulation to study the size of t-test
*Generate “ns” normal random samples;
* “mcrep” (M.C. Replications) is the index for the s-th random sample;
data simu;
seed=123;
ns=1000;n=25;mu=0;sigma=1;
do mcrep=1 to ns;
do i=1 to n;
x=rannor(seed);
output; end;
keep mcrep x;
end;
run;
&
%
27
'
$
Simulation study in SAS
proc sort data=simu;
by mcrep;
run;
*Perform the t-test for each sample.
Output the p-value “probt” in the SAS dataset “outtset”;
proc univariate data=simu noprint mu0=0;
by mcrep;
var x;
output out=outtest probt=p;
run;
&
%
28
'
$
Simulation study in SAS
* Count how many samples have “probt”<0.05;
data outtest;
set outtest;
reject=(p<0.05);
run;
* “reject” is the number of samples with “probt”<0.05. Hence the
mean of “reject” is the rejection rate,“rejrate”;
&
%
29
'
$
Simulation study in SAS
proc means data=outtest noprint;
var reject;
output out=results mean=rejrate;
proc print data=results;
var freq rejrate;
run;
&
%
30
Download