Project #3

advertisement
Project #3
UNL STAT 950
Fall 2012
Complete the following problems below. When R is needed for part of a problem, include all of your R
program output with code inside of it and any additional information needed to explain your answer.
You may need to edit your output and code in order to make it look nice after you copy and paste it
into your Word document.
1) The purpose of this problem is to perform inferences about the ratio of variances from two
2
2
populations. Let Y1 ~ F1 and Y2 ~ F2 with Y1 being independent of Y2. Define t(F,F
1 2 )    1 / 2
as the ratio of the two variances from the two populations.
a) What is the plug-in estimator Tplug = t(Fˆ1,Fˆ2 ) ?
b) Derive the nonparametric -method estimate of the variance for Tplug.
c) Simulate samples from F1 and F2 using the following code:
>
>
>
>
>
>
set.seed(8121)
n1<-10
n2<-10
y1<-rnorm(n = n1, mean = 0, sd = 1) #Sample from F_1
y2<-rnorm(n = n2, mean = 0, sd = 2) #Sample from F_2
set1<-data.frame(y = c(y1, y2), pop = c(rep(x = 1, times = n1), rep(x = 2, times
= n2)))
> set1
y pop
1
1.49855323
1
2 -0.98053791
1
3
0.44460395
1
4 -0.97115284
1
5 -1.16749149
1
6
0.33838806
1
7
0.61929116
1
8
0.91287587
1
9
0.73927369
1
10 0.77852717
1
11 0.09511827
2
12 0.50533449
2
13 -2.14266837
2
14 -1.71690455
2
15 -2.14187131
2
16 1.14793945
2
17 -1.58705439
2
18 -2.13012146
2
19 0.52124358
2
20 3.05966581
2
Calculate the empirical influence values for this data using both your derivations from b) and
the empinf() function. Calculate the non-parametric -method estimate of the variance for Tplug.
d) Estimate the empirical influence values using the jackknife and regression-based methods.
Calculate estimates of the variance for Tplug using these estimates of the empirical influence
values. When implementing the regression-based methods, use R = 1999 resamples, the
boot() function, and set a seed of 8111 before implementing the boot() function.
1
e) Suppose the statistic of interest is changed to
 1 n
 1 n
2
2
T
 (Y1j  Y1 )  
 (Y2 j  Y2 ) 
 n1  1 j1
  n2  1 j1

Use the jackknife and regression-based methods to estimate the variance of T (with same
resamples). Compare your estimates to those found in parts c) and d) and provide specific
reasons why differences or similarities occur.
f) Find the bootstrap estimate of the variance of T using R = 1999 resamples, the boot() function,
and set a seed of 8111 before implementing the boot() function. Compare the variance here to
those obtained previously.
g) In the previous parts, you saw some similarities and differences among the variance estimates.
How could we determine which estimate (if any) is correct? There are a few different
approaches to answering this question. The parts outlined below provide one approach
through using Monte Carlo simulation.
i) Simulate 500 data sets using the same settings as given at the beginning of this problem.
Display the first and last data sets simulated. Do not use a for loop here!
ii) For each simulated data set, calculate the variance estimates for T using each of the four
methods examined in this problem (this includes the nonparametric -method variance
estimate for Tplug). Set a seed number of 8729 right before estimating the variance with the
first data set. Display the first and last variance calculated for each of the four methods.
iii) Average the variance estimates for each method over the 500 simulated data sets.
iv) Estimate T for each of the 500 simulated data sets. Calculate the sample variance across
2
these estimates of T; i.e., calculate 4991 b500
1 (tb  t ) where tb is the estimate of T for the
bth data set and t  5001 b500
1 tb . The resulting sample variance is a Monte Carlo estimate
of the actual variance of T.
v) Compare the values from iii) to that calculated in iv). Which of the variance estimators is
doing a better job?
vi) Repeat this process for n1 = n2 = 100 using the same seed numbers. Do the variance
estimators improve? Explain.
2) This is a continuation of #3 of Project #1 (Example 10.1.22 of Casella and Berger). For simplicity
of notation, let T  (n  1)1 nj1(Yj  Y)2 denote the unbiased sample variance, let
Tbias  n1  nj1(Yj  Y)2 denote that biased sample variance, and let the population variance be .
a) Calculate t and tbias.
b) Using actual resamples, find the bootstrap estimate of the bias for both T and T bias and find the
corresponding corrected statistic values. Compare bias estimates and corrected statistics to
the actual values (remember that 2 = 4). Use a seed number of 7818 with the boot() function
when taking R = 1999 resamples.
c) Using actual resamples, find the bootstrap estimate for the bias of the bias corresponding to
both T and Tbias. Compare bias estimates and corrected statistics to the actual values. Use a
seed number of 7818 again with the boot() function when taking R = 1999 and M = 500
resamples.
d) You may be surprised here that the corrected estimates are not closer to 2. Why is this not a
cause for concern? Describe how one could evaluate if a similar problem occurs for other
cases?
e) Use the jack.after.boot() function to create the default diagnostic plot with respect to T.
Comment on the sensitivity of the bootstrap calculations for T using the plot.
2
Download