Lab Assignment 8

advertisement
Math 58B - Introduction to Biostatistics
Spring 2015
Jo Hardin
Lab Assignment 8
Lab Goals:
1.
To understand the process of creating t-intervals for both one sample (quantitative
response) and two independent samples (quantitative response).
2.
To understand why we use a t-multiplier in the interval instead of the z-multiplier.
3.
To experience the effect of sample size on the need for a t-multiplier.
4.
To connect power with confidence intervals (note: as before, we can only compute
power if we know the actual value in the alternative hypothesis, and we know that
value to be true).
In class
For the first part of the lab today, we aren't following along an investigation. Instead, you
will create confidence intervals using R to see how well you can capture the true
population parameter. Your code should mimic the simulating confidence intervals applet
we've used in class previously.
The purpose of this lab is to explore standard normal distributions versus t distributions.
Initially, the population is completely known. Additionally, the first part of the lab will
involve a single sample mean from the known population.
Scottish Militiamen (from Chance and Rossman, ISCAM) The data associated with this lab
contains population of chest measurements (in inches) for 5738 Scottish militiamen in the
early 19th century. The observations will be considered to be the population. The data are
at:
militiamen =
read.table("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/MILIT
IAMEN.TXT", sep="\t", header=TRUE)
militiamen = unlist(militiamen)
Note 1: if you can't remember what an ISCAM function does, pass it the argument "?". Also,
remember to look at the back of each chapter for R summaries.
load(url("http://www.rossmanchance.com/iscam2/ISCAM.RData"))
iscamsummary("?") # page 187
## Error in iscamsummary("?"): iscamsummary(x, explanatory, digits)
## This function calculates the five number summary, mean, and standard
deviation
## of the quantitative variable x
## Optional: A second, categorical variable can also be specified
## and values will be calculated separately for each group.
## Optional: Specify the number of digits in output to be different from 3.
Note 2: You should be writing your R code in a script file (in RStudio) or in a text editor
(e.g., Notepad). Ideally your file will be titled something like ``rcodelab7.r" or something
equally descriptive. Then you can go back to remember what you did if need be.
.1. Generate three histograms by typing the following commands into R. Describe what each
histogram represents. It may help to type par(mfrow=c(3,1)) so that all three histograms
are on the same page. Note that all of the histograms are on the same scale.
hist(militiamen,xlim=c(33,48))
hist(sample(militiamen,5),xlim=c(33,48))
mil.smean <- c()
for( i in 1:10000){
mil.samp <- sample(militiamen,5)
mil.smean <- c(mil.smean,mean(mil.samp))
hist(mil.smean,xlim=c(33,48))
}
Explain each of these three histograms to your neighbor.
.2. Compute 10000 95% confidence intervals as follows: First, repeatedly draw samples of
size 5 from the population and compute the mean and standard deviation of each sample.
mm.mean <- c()
mm.sd <- c()
for(i in 1:10000){
mil.samp <- sample(militiamen,5)
mm.mean <- c(mm.mean,mean(mil.samp))
mm.sd <- c(mm.sd,sd(mil.samp))
}
Next, use the standard normal distribution multiplier 𝑧 ∗ (1.96) to compute the lower
endpoint and upper endpoint of each such interval.
lower<-mm.mean - 1.96 * sd(militiamen)/sqrt(5)
lower[1:10]
##
##
[1] 38.40343 38.20343 38.60343 36.80343 38.40343 37.00343 37.20343
[8] 39.40343 38.80343 38.40343
upper<-mm.mean + 1.96 * sd(militiamen)/sqrt(5)
upper[1:10]
##
##
[1] 41.99657 41.79657 42.19657 40.39657 41.99657 40.59657 40.79657
[8] 42.99657 42.39657 41.99657
Notice that the standard deviation of the entire population is used in the above
computation. Finally, count how many of these intervals do not contain the population
mean.
sum(mean(militiamen) < lower)
## [1] 186
sum(mean(militiamen) > upper)
## [1] 284
What is the true population average of the chest measurements? How many of the 10000
confidence intervals above captured the true mean? How many would you expect to
capture the true mean? Do the confidence intervals all have the same width?
.3. Assume you didn't know the population standard deviation, as is highly likely in real
world applications. What would you use instead of sd(militiamen)? Hint: you've already
computed it! Using that substitution, find endpoints of 10000 new confidence intervals.
How many of these new confidence intervals contain the true mean? How many would you
expect to contain the true mean? Do the confidence intervals all have the same width?
.4. It looks like the standard normal distribution did not provide an appropriate multiplier
when the population standard deviation was not known. Instead of 1.96, find the
appropriate multiplier 𝑡 ∗ from the t distribution using the R function iscaminvt. Repeat the
computations in 3., continuing to assume that the population standard deviation is
unknown. How many of this third batch of 10000 confidence intervals contain the true
mean? How many would you expect to contain the true mean? Do the confidence intervals
all have the same width?
.5. Summarize the properties of the three different kinds of intervals computed using: z
with known population standard deviation, z with unknown population standard deviation,
and t with unknown population standard deviation. What do you think would happen if
you used t with a known population standard deviation?
To turn in
Follow up from the militiamen:
.1. Consider the following two statistics:
stat1<-(mm.mean-39.832)/(2.05/sqrt(5))
stat2<-(mm.mean-39.832)/(mm.sd/sqrt(5))
Make boxplots and histograms for each of the two statistics (making 4 separate plots is
fine, you might want to use the command par(mfrow=c(2,2))). Remember to use freq=F
in the histogram plot so that we get the density instead of the actual count. Also, within the
histogram and boxplot commands, use xlim=c(-10,10) to force all 4 plots to have the same
x-axis limits.
For each of the histograms (one for stat1, one for stat2) overlay a standard normal curve
using the following command directly after each of the histogram functions
hist(stat1, freq=F, xlim=c(-10,10))
lines(seq(-4,4,.1),dnorm(seq(-4,4,.1),0,1))
Comment on the fit of the normal curve to the tails of each of the two histograms. Also,
calculate the percent of both statistics which are above 1.96 or below -1.96. The first bit of
the command is below, but you'll need to extend the code to count everything you need.
sum(stat1 > 1.96) / length(stat1)
this command
# convince yourself that you understand
## [1] 0.0186
.2. By giving as much commentary as you think necessary (and the results from the plots
and intervals above), explain to someone who hasn't done this lab why the normal
multiplier (i.e., 1.96) doesn't "work" for creating confidence intervals, but the 𝑡-multiplier
(i.e., 2.77) does "work". Your answer should have something to do with the coverage rate of
the confidence intervals (that is what I mean by "work"). That is, you might start your
discussion by saying that "A 95% confidence interval should...".
(Moving forward) Two independent samples:
Consider the set up in Investigation 3.9. The two files
oz17 =
unlist(read.table("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Dat
a/oz17.txt", sep="\t"))
oz34 =
unlist(read.table("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Dat
a/oz34.txt", sep="\t"))
contain (let's pretend) the two populations from where these data came. The following R
code generates 10000 random samples from each of the two populations. Notice the
sample sizes.
diff.mean = c()
oz17.sd = c()
oz34.sd = c()
for(i in 1:10000){
oz17.samp = sample(oz17,20)
oz34.samp = sample(oz34,17)
diff.mean = c(diff.mean,mean(oz17.samp) - mean(oz34.samp))
oz17.sd = c(oz17.sd,sd(oz17.samp))
oz34.sd = c(oz34.sd, sd(oz34.samp)) }
.3. What is the true difference in population means? Adding code similar to that of the
militiamen, compute 10000 95% confidence intervals (for the true difference in means)
using the two populations' true standard deviations. How many of these 10000 confidence
intervals contain the true difference in population means? How many would you expect to
contain the true difference in population means?
.4. If you didn't know the two populations' standard deviations, what would you use
instead? Compute 10000 new 95% confidence intervals using the best possible estimate of
standard error that you have and the same multiplier as above (1.96). The following code
may be useful:
diff.sd = sqrt(oz17.sd^2 / 20 + oz34.sd^2 / 17)
Count how many of your intervals contain the true difference in population means. How
does that compare with how many intervals should contain the true difference?
.5. Repeat the computations in 4., but replace 1.96 with the appropriate multiplier from the
t distribution (use iscaminvt). Count how many of your intervals contain the true
difference in population means. How does that compare with how many intervals should
contain the true difference?
.6. Recall that a confidence interval contains all values of the parameter for which we would
not reject the null hypothesis in favor of a two sided alternative hypothesis. Consider the
null hypothesis 𝐻0 : 𝜇17 − 𝜇34 = 0. How many of your confidence intervals in 5. do not
contain zero? What is the approximate power of this hypothesis test? Explain. (Remember,
here we know the true difference in population means; that is, we know the specific
alternative hypothesis to use in computing the power. You should mention the specific
alternative hypothesis in your explanation.)
Hint: go back to the definition of power. This problem will take very little additional R code.
Download