Homework 5

advertisement
Statistics 480/580
Homework 5
Due: Wednesday 3/3/10
Part A.
Goals


Simulate population using linear model
Use Monte Carlo simulation to
o demonstrate variation in slope estimates from sample to sample
o Empirically, show distribution for slope estimates
o Investigate the power of the t-test for the slope.
Let’s generate fake trillium flower data. Suppose the model for a population is:
where is leaf length (cm),
is stem length (cm),
is distributed Normal with mean 0 and sd=2. We will use the original stem lengths from the trillium
dataset.
# Import trillium data
trillium <- read.table( "http://www.humboldt.edu/~mar13/Data.dir/trillium",header=T,skip=9)
attach( trillium )
# Fake data
newleaf <- 7.42 + 1.16*stem + rnorm(582,mean=0,sd=2)
plot ( stem, newleaf)
summary( lm(newleaf ~ stem ) ) # Compare coefficient to 1.16
abline( lm( newleaf ~ stem ) )
#############
## Automate process and repeat 5 more times on original graph
# Set up a graph to lay down the lines.
plot( x=c(0,10), y=c(0,25 xlab="Stem",ylab="Leaf",type="n" )
points( stem, leaf )
for( i in 1:5) # A “for-loop” to repeat commands
{
newleaf <- 7.42 + 1.16*stem + rnorm(582,mean=0,sd=2)
abline( lm( newleaf ~ stem ) )
}
#### Now repeat process 1000 times and
#### save results for slope estimates
beta1s <- rep(NA,1000) # create vector for storage
for( i in 1:1000 )
{
newleaf <- 7.42 + 1.16*stem + rnorm(582,mean=0,sd=2)
beta1s[i] <- lm(newleaf~stem)$coef[2]
}
hist( beta1s ) # distribution of coefficient estimates
sd( beta1s ) # empirical SE of slope coefficient
# compare to the SE for the slope parameter estimate done on the original data
# summary( lm( leaf~stem, data=trillium) )
mean( beta1s ) # average slope estimate, compare to 1.16
abline( v=1.16 )
Problem: Alter the above code to calculate a t-statistic testing that the slope is 1.16 against the alternative hypothesis of not equal to
1.16 and the p-value for the model:
. Repeat 1000 times saving the t-statistic and p-value.
1. Show your code.
2. Create a histogram of your t-statistics
3. Create a histogram of your p-values. .
4. What fraction of the p-values are below 0.05? (These would be type 1 errors.)
5. Repeat steps 2 and 4, but instead test that the slope is 1.25.
6. Repeat problem 5, but have the sd of the noise component be 5.
Hint: In the for-loop, save the output from the lm() function to an object. One way to obtain the standard error is to use
summary(lm.object)$coef[2,2]. You need to calculate your own t-statistic because you are not testing against 0.
Part B.
Read the relevant pages in Manly’s Randomization, Bootstrap and Monte Carlo Methods in Biology.
You will duplicate his randomization test to decide if the average male mandible length is significantly larger than the
average female mandible length. The mandible lengths are:
Males:
120, 107, 110, 116, 114, 111, 113, 117, 114, 112
Females: 110, 111, 107, 108, 110, 105, 107, 106, 111, 111
Define D0 as the difference between Mean(Males)-Mean(Females) for the original data.
Use 4,999 randomizations, but first start off with a few hundred until you get your bugs worked out of your code.
For this problem you will turn in:
1. Your R code: Use reasonable variable names and use the # sign to include comments inside your code. Have a
single variable, say N, for which you can specify the number of randomizations. (Hint: Use sample( ) function.)
2. Calculate the p-value, which is the proportion of all D values that are greater or equal to D0 (including D0). Count the
occurrence of D0 as the (N+1)th observation in the numerator and denominator. This p-value (called level of
significance in Manly) should be automatically calculated in your code.
3. A histogram, like found below, which shows the distribution of the D’s. You will need to use the hist( ), paste( ),
signif( ), title( ), and abline( ) commands. The paste( ) command is used to include the p-value and number of
simulations into the graph; e.g., hist( …., xlab=paste(N,”simulations”)). The abline is used to place a vertical line at
the observed D0 value. The signif( ) command is to prevent the title’s p-value from having too many digits; e.g., title(
main=paste(“p-value=”,signif(pval)).
Download