Project exam STK4600, SPRING 2015 – due May 6

advertisement
Project exam STK4600, SPRING 2015 – due May 6
1. Consider the data set “trees” in R, consisting of a sample of 31 trees with measurements of
girth (diameter in inches), height (feet) and timber volume (cubic feet). Diameter and height of
trees are easily measured, but volume is more difficult to measure. Use R for to answer all
questions and include the R codes in your answers.
a) Plot height vs. diameter for the 31 trees.
b) Make a histogram for height for the 31 trees and compute the mean height for the 31
trees.
c) Select a simple random sample of size 12 and estimate the mean height for the 31
trees and compute 95% confidence intervals by using
i. the sample mean
ii. the ratio estimator with x = diameter.
d) Find an estimate of the correct confidence level for the 95% confidence interval based
on the ratio estimator for sample sizes 8, 12 and 16, based on 1000 simulations.
For the last questions we shall look at the problem of estimating volume of the trees and we
shall consider either diameter or height as the auxiliary variable when using the ratio
estimator.
e) Make plots of volume vs. diameter and volume vs. height. Choose the variable you
will use as auxiliary variable in the ratio estimator and explain why.
f) Suppose now that these trees are a simple random sample from a forest of N= 2967
trees and that the sum of the diameters for all trees in the forest is 41835 inches while
the sum of the heights is 220000 feet. Use the ratio estimator chosen in part e) to
estimate the total volume for all trees in the forest, and give a 95% confidence
interval.
2. Consider a population of 8 countries from the American continent. We want to estimate the
total number of inhabitants in this population in 2010 by taking a sample of 4 countries.
Auxiliary information is the 1980 figures. The actual population sizes (in millions) for these
two years are shown in table below.
Country
1 – Canada
2 – USA
3 – Mexico
4 – Argentina
5 – Brazil
6 – Chile
7 – Uruguay
8 – Cuba
1980 population size
24.0
227.7
69.3
28.2
121.3
11.1
2.9
9.7
2010 population size
33.8
310.2
112.5
41.3
201.1
16.7
3.5
11.5
Let, for country i, Yi be the population size in 2010 and xi the population size in 1980. We shall
assume the following population model:
E(Yi) = βxi and Var(Yi) = σ2v(xi). The Yi’s are uncorrelated.
a) Define what we mean when we say that an estimator T̂ for the total T is the best linear
unbiased (BLU) estimator for T.
Assume now that the sample selected is 1, 4, 5 and 8. We shall consider three versions of the
model.
b) Assume first that v(xi) = 1 for i = 1,..,8. Compute the BLU estimate for the total number of
inhabitants for the 8 countries in 2010.
c) Let now v(xi) = xi for i = 1,..,8. Compute the BLU estimate for the total number of
inhabitants for the 8 countries in 2010 under this model.
d) Let v(xi) = xi2 for i = 1,..,8. Compute the BLU estimator for the total number of inhabitants
for the 8 countries in 2010 under this model.
e) Estimate the mean increase in population size for the 8 countries using the sample meanbased estimator. Under which model is this the BLU estimator? Derive the model-based
standard error of the estimate under the model. By standard error we mean the square root
of the estimated variance of the prediction error.
f) Compute the model-based standard errors for the three estimates in b) - d).
g) Make plots using R of x vs. y and x vs. y/x for all 8 countries.
h) Compare the three estimates in c) - d) with the true value. Try explaining how the different
estimates do. Assuming the 8 countries are the sample, which plots in g) may be helpful in
evaluating the three models considered?
Download