Lab Assignment 5

advertisement
Math 58B - Introduction to Biostatistics
Jo Hardin
Spring 2016
Lab Assignment 5
Lab Goals:
1.
To understand the process of creating t-intervals for both one sample mean
(quantitative response) and a new individual response.
2.
To understand why we use a t-multiplier in the interval instead of the z-multiplier.
3.
To experience the effect of sample size on the need for a t-multiplier.
4.
To be able to differentiate a confidence interval and a prediction interval.
In class
For the first part of the lab today, we aren't following along an investigation. Instead, you
will create confidence intervals using R to see how well you can capture the true
population parameter. Your code should mimic the simulating confidence intervals applet
we've used in class previously.
The purpose of this lab is to explore standard normal distributions versus t distributions.
Initially, the population is completely known. Additionally, the first part of the lab will
involve a single sample mean from the known population.
Scottish Militiamen (from Chance and Rossman, ISCAM) The data associated with this lab
contains population of chest measurements (in inches) for 5738 Scottish militiamen in the
early 19th century. The observations will be considered to be the population. The data are
at:
require(mosaic)
militiamen =
read.table("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/MILIT
IAMEN.TXT", sep="\t", header=TRUE)
militiamen = unlist(militiamen)
Note1: if you can't remember what an ISCAM function does, pass it the argument "?". Also,
remember to look at the back of each chapter for R summaries.
load(url("http://www.rossmanchance.com/iscam3/ISCAM.RData"))
iscamsummary("?") # page 187
## Error in iscamsummary("?"): iscamsummary(x, explanatory, digits)
## This function calculates the five number summary, mean, and standard
deviation
## of the quantitative variable x
## Optional: A second, categorical variable can also be specified
## and values will be calculated separately for each group.
## Optional: Specify the number of digits in output to be different from 3.
Note2 Remember to not print out messages and warnings. We already waste enough
paper! In your R markdown code add the following:
At the top of the R code chunk:
{r message=FALSE, warning=FALSE}
Note3 Similarly, shrink your plots down so that they don't take up an entire page:
At the top of the R code chunk: {r fig.height=5, fig.width=3,
fig.align='center', message=FALSE, warning=FALSE}
1.
Generate three histograms by typing the following commands into R. Describe what
each histogram represents. It may help to type par(mfrow=c(3,1)) so that all three
histograms are on the same page. Note that all of the histograms are on the same scale.
hist(militiamen,xlim=c(33,48))
hist(sample(militiamen,5),xlim=c(33,48))
mil.smean <- c()
for( i in 1:10000){
mil.samp <- sample(militiamen,5)
mil.smean[i] <- mean(mil.samp)
}
hist(mil.smean,xlim=c(33,48))
Explain each of these three histograms to your neighbor.
2.
Compute 10000 95% confidence intervals as follows: First, repeatedly draw samples
of size 5 from the population and compute the mean and standard deviation of each
sample.
mm.mean <- c()
mm.sd <- c()
for(i in 1:10000){
mil.samp <- sample(militiamen,5)
mm.mean[i] <- mean(mil.samp)
mm.sd[i] <- sd(mil.samp)
}
Next, use the standard normal distribution multiplier z ∗ (1.96) to compute the lower
endpoint and upper endpoint of each such interval.
lower<-mm.mean - 1.96 * sd(militiamen)/sqrt(5)
lower[1:10]
##
##
[1] 37.80343 40.60343 37.20343 37.80343 39.00343 36.80343 38.20343
[8] 38.00343 39.40343 36.00343
upper<-mm.mean + 1.96 * sd(militiamen)/sqrt(5)
upper[1:10]
##
##
[1] 41.39657 44.19657 40.79657 41.39657 42.59657 40.39657 41.79657
[8] 41.59657 42.99657 39.59657
Notice that the standard deviation of the entire population is used in the above
computation. Finally, count how many of these intervals do not contain the population
mean.
sum(mean(militiamen) < lower)
## [1] 203
sum(mean(militiamen) > upper)
## [1] 281
What is the true population average of the chest measurements? How many of the 10000
confidence intervals above captured the true mean? How many would you expect to
capture the true mean? Do the confidence intervals all have the same width?
3.
Assume you didn't know the population standard deviation, as is highly likely in real
world applications. What would you use instead of sd(militiamen)? Hint: you've
already computed it! Using that substitution, find endpoints of 10000 new confidence
intervals. How many of these new confidence intervals contain the true mean? How
many would you expect to contain the true mean? Do the confidence intervals all have
the same width?
4.
It looks like the standard normal distribution did not provide an appropriate
multiplier when the population standard deviation was not known. Instead of 1.96,
find the appropriate multiplier t ∗ from the t distribution using the R function
iscaminvt. Repeat the computations in 3., continuing to assume that the population
standard deviation is unknown. How many of this third batch of 10000 confidence
intervals contain the true mean? How many would you expect to contain the true
mean? Do the confidence intervals all have the same width?
5.
Summarize the properties of the three different kinds of intervals computed using: z
with known population standard deviation, z with unknown population standard
deviation, and t with unknown population standard deviation. What do you think
would happen if you used t with a known population standard deviation?
To turn in
Follow up from the militiamen:
1.
Consider the following two statistics:
stat1<-(mm.mean-39.832)/(sd(militiamen)/sqrt(5))
stat2<-(mm.mean-39.832)/(mm.sd/sqrt(5))
Make boxplots and histograms for each of the two statistics (making 4 separate plots is
fine, you might want to use the command par(mfrow=c(2,2))). Remember to use freq=F
in the histogram plot so that we get the density instead of the actual count.
For each of the histograms (one for stat1, one for stat2) overlay a standard normal curve
using the following command directly after each of the histogram functions.
hist(stat1, freq=F, xlim=c(-10,10))
lines(seq(-4,4,.1),dnorm(seq(-4,4,.1),0,1))
Comment on the fit of the normal curve to the tails of each of the two histograms. Also,
calculate the percent of both statistics which are above 1.96 or below -1.96. The first bit of
the command is below, but you'll need to extend the code to count everything you need.
sum(stat1 > 1.96) / length(stat1)
this command
# convince yourself that you understand
## [1] 0.0203
2.
By giving as much commentary as you think necessary (and the results from the plots
and intervals above), explain to someone who hasn't done this lab why the normal
multiplier (i.e., 1.96) doesn't "work" for creating confidence intervals, but the tmultiplier (i.e., 2.77) does "work". Your answer should have something to do with the
coverage rate of the confidence intervals (which is what it means to "work"). That is,
you might start your discussion by saying that "A 95% confidence interval should...".
(Moving forward) a new observation:
Consider Investigation 2.6 where the goal was to produce 95% prediction intervals for a
healthy body temperature. Use the data and process above to construct 95% prediction
intervals for an individual chest measurement.
In order to have an interval wide enough to capture the individual variability (i.e., capture a
new person's temperature when they walk in the door), we need both the variance of the
points and the variability of X.
true sd of points around X = √σ2 + σ2 /n
estimated sd of points around X = √s2 + s 2 /n
3.
Using the t-intervals from the in-class portion of the lab, what percent (long run
average percent) of the observations are contained in each interval? (I've done most of
the work for you, but you might have to adjust the variable names: make sure you
understand what the code does!!!)
num.below.CI <- c()
num.above.CI <- c()
for(i in 1:10000){
# confirm the correct variable names for lower and upper CI bounds
num.below.CI[i] <- sum(militiamen < lower[i])
num.above.CI[i] <- sum(militiamen > upper[i])
}
num.outside.CI <- num.below.CI + num.above.CI
favstats(num.outside.CI)
##
##
min
Q1 median
Q3 max
mean
sd
n missing
1903 1994
2652 2837 4618 2576.985 484.6505 10000
0
4.
Using the sample standard deviation and the t ∗ multiplier, construct 1000 prediction
intervals. For each interval, find the number of militiamen captured. Average the
capture rate to report the coverage rate of your prediction intervals. The following
code might be useful.
t <- 0.02 # this number is wrong!
mm.mean <- c()
mm.sd <- c()
num.above.zPI <- c()
num.below.zPI <- c()
for(i in 1:10000){
mil.samp <- sample(militiamen,5)
mm.mean[i] <- mean(mil.samp)
mm.sd[i] <- sd(mil.samp)
# this is wrong!!!!
num.above.zPI[i] <- sum(militiamen > mm.mean[i] + t*mm.sd[i])
num.below.zPI[i] <- sum(militiamen < mm.mean[i] - t*mm.sd[i])
}
num.outside.zPI <- num.above.zPI + num.below.zPI
favstats(num.outside.zPI)
##
##
min
Q1 median
Q3 max
mean
sd
n missing
4659 5738
5738 5738 5738 5538.333 406.6519 10000
0
5.
In your own words, explain the conceptual difference between a confidence interval
and a prediction interval.
Download