Math 58B - Introduction to Biostatistics Jo Hardin Spring 2016 Lab Assignment 5 Lab Goals: 1. To understand the process of creating t-intervals for both one sample mean (quantitative response) and a new individual response. 2. To understand why we use a t-multiplier in the interval instead of the z-multiplier. 3. To experience the effect of sample size on the need for a t-multiplier. 4. To be able to differentiate a confidence interval and a prediction interval. In class For the first part of the lab today, we aren't following along an investigation. Instead, you will create confidence intervals using R to see how well you can capture the true population parameter. Your code should mimic the simulating confidence intervals applet we've used in class previously. The purpose of this lab is to explore standard normal distributions versus t distributions. Initially, the population is completely known. Additionally, the first part of the lab will involve a single sample mean from the known population. Scottish Militiamen (from Chance and Rossman, ISCAM) The data associated with this lab contains population of chest measurements (in inches) for 5738 Scottish militiamen in the early 19th century. The observations will be considered to be the population. The data are at: require(mosaic) militiamen = read.table("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/MILIT IAMEN.TXT", sep="\t", header=TRUE) militiamen = unlist(militiamen) Note1: if you can't remember what an ISCAM function does, pass it the argument "?". Also, remember to look at the back of each chapter for R summaries. load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData")) iscamsummary("?") # page 187 ## Error in iscamsummary("?"): iscamsummary(x, explanatory, digits) ## This function calculates the five number summary, mean, and standard deviation ## of the quantitative variable x ## Optional: A second, categorical variable can also be specified ## and values will be calculated separately for each group. ## Optional: Specify the number of digits in output to be different from 3. Note2 Remember to not print out messages and warnings. We already waste enough paper! In your R markdown code add the following: At the top of the R code chunk: {r message=FALSE, warning=FALSE} Note3 Similarly, shrink your plots down so that they don't take up an entire page: At the top of the R code chunk: {r fig.height=5, fig.width=3, fig.align='center', message=FALSE, warning=FALSE} 1. Generate three histograms by typing the following commands into R. Describe what each histogram represents. It may help to type par(mfrow=c(3,1)) so that all three histograms are on the same page. Note that all of the histograms are on the same scale. hist(militiamen,xlim=c(33,48)) hist(sample(militiamen,5),xlim=c(33,48)) mil.smean <- c() for( i in 1:10000){ mil.samp <- sample(militiamen,5) mil.smean[i] <- mean(mil.samp) } hist(mil.smean,xlim=c(33,48)) Explain each of these three histograms to your neighbor. 2. Compute 10000 95% confidence intervals as follows: First, repeatedly draw samples of size 5 from the population and compute the mean and standard deviation of each sample. mm.mean <- c() mm.sd <- c() for(i in 1:10000){ mil.samp <- sample(militiamen,5) mm.mean[i] <- mean(mil.samp) mm.sd[i] <- sd(mil.samp) } Next, use the standard normal distribution multiplier z ∗ (1.96) to compute the lower endpoint and upper endpoint of each such interval. lower<-mm.mean - 1.96 * sd(militiamen)/sqrt(5) lower[1:10] ## ## [1] 37.80343 40.60343 37.20343 37.80343 39.00343 36.80343 38.20343 [8] 38.00343 39.40343 36.00343 upper<-mm.mean + 1.96 * sd(militiamen)/sqrt(5) upper[1:10] ## ## [1] 41.39657 44.19657 40.79657 41.39657 42.59657 40.39657 41.79657 [8] 41.59657 42.99657 39.59657 Notice that the standard deviation of the entire population is used in the above computation. Finally, count how many of these intervals do not contain the population mean. sum(mean(militiamen) < lower) ## [1] 203 sum(mean(militiamen) > upper) ## [1] 281 What is the true population average of the chest measurements? How many of the 10000 confidence intervals above captured the true mean? How many would you expect to capture the true mean? Do the confidence intervals all have the same width? 3. Assume you didn't know the population standard deviation, as is highly likely in real world applications. What would you use instead of sd(militiamen)? Hint: you've already computed it! Using that substitution, find endpoints of 10000 new confidence intervals. How many of these new confidence intervals contain the true mean? How many would you expect to contain the true mean? Do the confidence intervals all have the same width? 4. It looks like the standard normal distribution did not provide an appropriate multiplier when the population standard deviation was not known. Instead of 1.96, find the appropriate multiplier t ∗ from the t distribution using the R function iscaminvt. Repeat the computations in 3., continuing to assume that the population standard deviation is unknown. How many of this third batch of 10000 confidence intervals contain the true mean? How many would you expect to contain the true mean? Do the confidence intervals all have the same width? 5. Summarize the properties of the three different kinds of intervals computed using: z with known population standard deviation, z with unknown population standard deviation, and t with unknown population standard deviation. What do you think would happen if you used t with a known population standard deviation? To turn in Follow up from the militiamen: 1. Consider the following two statistics: stat1<-(mm.mean-39.832)/(sd(militiamen)/sqrt(5)) stat2<-(mm.mean-39.832)/(mm.sd/sqrt(5)) Make boxplots and histograms for each of the two statistics (making 4 separate plots is fine, you might want to use the command par(mfrow=c(2,2))). Remember to use freq=F in the histogram plot so that we get the density instead of the actual count. For each of the histograms (one for stat1, one for stat2) overlay a standard normal curve using the following command directly after each of the histogram functions. hist(stat1, freq=F, xlim=c(-10,10)) lines(seq(-4,4,.1),dnorm(seq(-4,4,.1),0,1)) Comment on the fit of the normal curve to the tails of each of the two histograms. Also, calculate the percent of both statistics which are above 1.96 or below -1.96. The first bit of the command is below, but you'll need to extend the code to count everything you need. sum(stat1 > 1.96) / length(stat1) this command # convince yourself that you understand ## [1] 0.0203 2. By giving as much commentary as you think necessary (and the results from the plots and intervals above), explain to someone who hasn't done this lab why the normal multiplier (i.e., 1.96) doesn't "work" for creating confidence intervals, but the tmultiplier (i.e., 2.77) does "work". Your answer should have something to do with the coverage rate of the confidence intervals (which is what it means to "work"). That is, you might start your discussion by saying that "A 95% confidence interval should...". (Moving forward) a new observation: Consider Investigation 2.6 where the goal was to produce 95% prediction intervals for a healthy body temperature. Use the data and process above to construct 95% prediction intervals for an individual chest measurement. In order to have an interval wide enough to capture the individual variability (i.e., capture a new person's temperature when they walk in the door), we need both the variance of the points and the variability of X. true sd of points around X = √σ2 + σ2 /n estimated sd of points around X = √s2 + s 2 /n 3. Using the t-intervals from the in-class portion of the lab, what percent (long run average percent) of the observations are contained in each interval? (I've done most of the work for you, but you might have to adjust the variable names: make sure you understand what the code does!!!) num.below.CI <- c() num.above.CI <- c() for(i in 1:10000){ # confirm the correct variable names for lower and upper CI bounds num.below.CI[i] <- sum(militiamen < lower[i]) num.above.CI[i] <- sum(militiamen > upper[i]) } num.outside.CI <- num.below.CI + num.above.CI favstats(num.outside.CI) ## ## min Q1 median Q3 max mean sd n missing 1903 1994 2652 2837 4618 2576.985 484.6505 10000 0 4. Using the sample standard deviation and the t ∗ multiplier, construct 1000 prediction intervals. For each interval, find the number of militiamen captured. Average the capture rate to report the coverage rate of your prediction intervals. The following code might be useful. t <- 0.02 # this number is wrong! mm.mean <- c() mm.sd <- c() num.above.zPI <- c() num.below.zPI <- c() for(i in 1:10000){ mil.samp <- sample(militiamen,5) mm.mean[i] <- mean(mil.samp) mm.sd[i] <- sd(mil.samp) # this is wrong!!!! num.above.zPI[i] <- sum(militiamen > mm.mean[i] + t*mm.sd[i]) num.below.zPI[i] <- sum(militiamen < mm.mean[i] - t*mm.sd[i]) } num.outside.zPI <- num.above.zPI + num.below.zPI favstats(num.outside.zPI) ## ## min Q1 median Q3 max mean sd n missing 4659 5738 5738 5738 5738 5538.333 406.6519 10000 0 5. In your own words, explain the conceptual difference between a confidence interval and a prediction interval.