Dealing With Statistical Uncertainty – Practical Richard Mott Excercises This exercise explores properties of the Normal and T distributions. You will learn how to use some of the basic R functions associated with these distributions. You will also write a small R program to perform a permutation test and a bootstrap confidence interval. Remember that R will provide help on any function func() by typing the command ?func. You can get help on all the functions associated with the Normal and T distributions by typing ?Normal and ?TDist a. Graph the density functions of the standard Normal distribution N(0,1) and the T distributions on 1, 2, 3, 4, 5, and 10 degrees of freedom (df), between the limits x=-3 and x=+3, on the same plot. At what df does the T distribution become similar to the Normal? Hint: To do this, first generate a dense sequence of x values with the command x <- seq( -3, 3, 0.01) Then use this as input to the functions dnorm() and dt() eg to get the density for T10 use the command t10 <- dt(x,10) To plot more than one graph on the same plot, use plot() for the first plot and lines() for the rest. You can specify different colours use the col=”colour” parameter in lines()]. b. For the Normal distribution with mean 2 and variance 4, (i) what value of x satisfies Prob(X>x) = 0.05 (hint: use the function qnorm() ) (ii) what is Prob( X< -1) ? c. Generate a random sample of 10 observations from the above Normal N(2,2) distribution. (i) Calculate the sample mean and variance. (ii) Using the T distribution, calculate the 95% confidence interval for the sample mean (what df do you use?). (iii) Using a one-sample T test, test the null hypothesis Ho: =2.5. (iv) Now generate random samples of 100 and 10000 observations (save these samples into variables with different names eg samp10, samp100, samp10000, as you will re-use them later), and compute the 95% CIs again. What do you notice about the change in width? Repeat your tests of Ho: =2.5. d. Generate two samples of 20 observations, one from N(0,1) and the other from N(0.5,1). (i) Test if their means are different using the two-sample T-test and the Wilcoxon rank sum test. (ii) Add an outlier, by picking a data point at random and setting it to an untypical value, such as 5 or- 10. Now repeat your T and Wilcoxon tests. e. Download the Biochemistry.txt data set and read it into a data frame using read.table().This data set contains biomarker measurements on about 2000 mice. Check you have read in the file correctly by printing the column names (with names()) and the first few lines of the table (with head()). Test if there are GENDER effects present for Biochem.HDL using the formula interface to t.test(). The formula you will need is Biochem.HDL ~ GENDER. f. Perform power calculations using the function power.t.test(). (i) What sample size is required for a two-sample T test (equal variance=1, and equal numbers of observations in each group) to detect a difference between Ho: =2.5 and H1: =4.5 at =0.01 and =0.9. (ii) Using this number of observations, what is the power if H1: =4.0? g. Perform a permutation test on the two-sample data set in (d) above. You will need to write an R program to do this. Hint: First convert the data into a form suitable for permutation. The simplest way is to concatenate the two samples into a single vector, and create a second vector of the same length with 1’s and 2’s in it corresponding to the two groups. Then write a function that will accept these vectors as arguments and compute the difference between the means of the two groups. Now write a loop which will permute either vector (using the sample(replace=FALSE) function) 1000 times and call the function you wrote to compute the difference between the permuted groups, and increment a counter if the difference exceeds the real difference in the unpermuted data. Finally print out the permutation p-value. Compare it to that from the T test. h. Now modify your code to compute the bootstrap 95% confidence interval of the difference in means between the two groups. To make a bootstrap sample, use sample(replace-TRUE). Note that you must not permute the data. You will need to record the difference in means of the two groups at each bootstrap iteration in a vector. Then, at the end, sort the vector to find the 95% CI (this corresponds to finding the bottom 2.5% and top 2.5%) i. This exercise teaches you about simulation, in the context of a contingency table analysis. You will simulate data from a simple one-dimensional contingency table under the null hypothesis, compute the log-likelihood ratio and compare it to the asymptotic chi-squared distribution. (If you get stuck look at the function contingency.table() in examples.R) i. Simulate N=200 observations from a multinomial distribution with K=10 classes under the null hypothesis that all class are equally likely. Hint: type ?Multinomial to find out how to do this. ii. Compute the log-likelihood ratio L = 2 O log(O/E) iii. Repeat this say 10000 times. iv. Compare the distribution of L to the chi-square on K-1 df, eq with a qqplot v. Also, compare the distribution of the chi-square statistic (O-E)2 /E vi. What happens to the aymptotics when the sample size N is smaller (you will have to make your code handle zero counts)?