Project exam STK4600, SPRING 2015 – due May 6 1. Consider the data set “trees” in R, consisting of a sample of 31 trees with measurements of girth (diameter in inches), height (feet) and timber volume (cubic feet). Diameter and height of trees are easily measured, but volume is more difficult to measure. Use R for to answer all questions and include the R codes in your answers. a) Plot height vs. diameter for the 31 trees. b) Make a histogram for height for the 31 trees and compute the mean height for the 31 trees. c) Select a simple random sample of size 12 and estimate the mean height for the 31 trees and compute 95% confidence intervals by using i. the sample mean ii. the ratio estimator with x = diameter. d) Find an estimate of the correct confidence level for the 95% confidence interval based on the ratio estimator for sample sizes 8, 12 and 16, based on 1000 simulations. For the last questions we shall look at the problem of estimating volume of the trees and we shall consider either diameter or height as the auxiliary variable when using the ratio estimator. e) Make plots of volume vs. diameter and volume vs. height. Choose the variable you will use as auxiliary variable in the ratio estimator and explain why. f) Suppose now that these trees are a simple random sample from a forest of N= 2967 trees and that the sum of the diameters for all trees in the forest is 41835 inches while the sum of the heights is 220000 feet. Use the ratio estimator chosen in part e) to estimate the total volume for all trees in the forest, and give a 95% confidence interval. 2. Consider a population of 8 countries from the American continent. We want to estimate the total number of inhabitants in this population in 2010 by taking a sample of 4 countries. Auxiliary information is the 1980 figures. The actual population sizes (in millions) for these two years are shown in table below. Country 1 – Canada 2 – USA 3 – Mexico 4 – Argentina 5 – Brazil 6 – Chile 7 – Uruguay 8 – Cuba 1980 population size 24.0 227.7 69.3 28.2 121.3 11.1 2.9 9.7 2010 population size 33.8 310.2 112.5 41.3 201.1 16.7 3.5 11.5 Let, for country i, Yi be the population size in 2010 and xi the population size in 1980. We shall assume the following population model: E(Yi) = βxi and Var(Yi) = σ2v(xi). The Yi’s are uncorrelated. a) Define what we mean when we say that an estimator T̂ for the total T is the best linear unbiased (BLU) estimator for T. Assume now that the sample selected is 1, 4, 5 and 8. We shall consider three versions of the model. b) Assume first that v(xi) = 1 for i = 1,..,8. Compute the BLU estimate for the total number of inhabitants for the 8 countries in 2010. c) Let now v(xi) = xi for i = 1,..,8. Compute the BLU estimate for the total number of inhabitants for the 8 countries in 2010 under this model. d) Let v(xi) = xi2 for i = 1,..,8. Compute the BLU estimator for the total number of inhabitants for the 8 countries in 2010 under this model. e) Estimate the mean increase in population size for the 8 countries using the sample meanbased estimator. Under which model is this the BLU estimator? Derive the model-based standard error of the estimate under the model. By standard error we mean the square root of the estimated variance of the prediction error. f) Compute the model-based standard errors for the three estimates in b) - d). g) Make plots using R of x vs. y and x vs. y/x for all 8 countries. h) Compare the three estimates in c) - d) with the true value. Try explaining how the different estimates do. Assuming the 8 countries are the sample, which plots in g) may be helpful in evaluating the three models considered?