CSSS 508: Intro to R 3/08/06 Lab 9: Testing This lab is just a summary of several common tests used in statistics. Reading material can be found in Dalgaard: Ch. 4, 7, and parts of 6. T-Test: (one sample, two sample, paired) T-tests assume that the data come from a normal distribution. The distribution varies depending on the sample size (the length of the vector(s) of data you’re analyzing). A smaller sample size will be tested against a t-distribution with larger tails. That is, when we have a small sample, we are more likely to see an average that is extreme or not representative than when we have a larger sample. One sample test: We are testing whether the mean of our sample, mu, is equal to some null hypothesis value. For example: Ho: mu = 0. The null hypothesis, or previously/currently held belief, is that the mean of the population is zero. We have collected some data, hopefully a representative sample of our population, and we’re going to test whether or not we have evidence that zero is incorrect. There are three possible alternative hypotheses: Ha: mu < 0 ; Ha: mu > 0 ; Ha : mu != 0 The first two are one-sided alternatives; the second is two-sided. Let’s test some small samples: x<-rnorm(10,0,1) t.test(x) The default mu (population mean) is zero; the default alternative hypothesis is two-sided. The default type 1 error (how often we’re allowing ourselves to make a mistake) is 0.05. This error is set by conf.level = 1 - error (default = 0.95). This t-test looked at the sample mean of x and found the probability, if the true mean is zero, that we would see something as extreme as the sample mean. If this probability is small, the sample mean is very extreme, and we should not have seen it. Since we did, we conclude that the null hypothesis is wrong and that the true mean is not zero. x2<-rnorm(10,1,1) t.test(x2) x3<-rnorm(10,2,1) t.test(x3) x4<-rnorm(10,3,1) t.test(x4) Rebecca Nugent, Department of Statistics, U. of Washington -1- The two-sided alternative allows for error on both sides; the error is split on both sides. The one-sided tests put all the error on one side. In the above data, we are gradually further above the population mean. t.test(x,alternative=”greater”) t.test(x2,alternative=”greater”) t.test(x3,alternative=”greater”) t.test(x4,alternative=”greater”) What happens if I change the alternative to “less”? t.test(x,alternative=”less”) t.test(x2,alternative=”less”) t.test(x3,alternative=”less”) t.test(x4,alternative=”less”) Many t-tests you have seen in other procedures have tested the null hypothesis that a parameter equals zero (linear regression coefficients, etc). Don’t forget to specify your mu if you would like to test another value. Two sample t-tests: A common hypothesis used to test for group differences is to test equality of their means. You might have two samples from two different groups (ex: control, treatment). The null hypothesis is then Ho: mu1 = mu2. The alternatives are Ha: mu1 > mu2, Ha: mu1 < mu2, Ha: mu1 != mu2. g1<-rnorm(10,0,1) g2<-rnorm(10,0.1,1) g3<-rnorm(10,0.4,1) g4<-rnorm(10,-0.4,1) t.test(g1,g2) t.test(g1,g3) t.test(g1,g4,alternative=”greater”) t.test(g4,g1,alternative=”greater”) Order matters. On the last two tests, the alternative changes from g1 > g4 to g4 > g1. The t-test results include a confidence interval. If this interval contains zero, the t-test is not significant. One choice on the two-sample test is the option to treat the variances from the two samples as equal. That is, are you assuming that the two groups have the same variance? If so, var.equal = TRUE. If not, var.equal = FALSE (the default) and a weighted variance is calculated and used in the test. Rebecca Nugent, Department of Statistics, U. of Washington -2- A real-data example: Energy expenditure data on 22 women characterized as either obese or lean. We’ll download it from the class website: energy.dat energy<read.table(“http://www.stat.washington.edu/rnugent/teaching/csss508/ene rgy.dat”) We can split our group into the two subgroups using the split command. energy.sub<-split(energy$expend,energy$stature) t.test(energy.sub$obese,energy.sub$lean) Paired t-test: Often our two samples are two sets of measurements on the same subjects (ex: before/after). In this case, we don’t actually have two populations to compare. We have two samples from the same population that we want to test for differences. Have the measurements on the subjects changed? (ex: Did they lose weight with an intervention program?). This question gets converted into a one sample t-test analysis where we analyze the differences between the two sets of measurements. If there is no change, we would expect the average difference to be zero. We use the t-test with the option paired=TRUE. x<-rnorm(20,0,1) y<-rnorm(20,0,1) z<-rnorm(20,3,1) t.test(x,y,paired=TRUE) t.test(x,z,paired=TRUE) Note: the vectors must be of the same length (otherwise won’t have pairs). Real-life data: pre- and post-menstrual energy intake in 11 women: intake.dat intake<read.table(“http://www.stat.washington.edu/rnugent/teaching/csss508/int ake.dat”) attach(intake) post-pre t.test(pre,post,paired=TRUE) Again, if we have hypotheses about if the differences are positive or negative, we can use one of the one-sided alternative options (greater, less). Rebecca Nugent, Department of Statistics, U. of Washington -3- Pairwise T-Tests for All Groups Often we have more than two groups. We can test for differences between two groups at a time, but we start to run into multiple comparison problems. Performing many tests increases the probability that you will find a significant one. P-values tend to be more significant than they should be. pairwise.t.test computes tests for all possible two-group comparisons. Recall the low birthwt data; we had three race factors (white, black, other). Let’s test for differences in age and weight among the three groups. library(MASS) attach(birthwt) race<-factor(race,labels=c(“white”,”black”,”other”)) pairwise.t.test(age,race) pairwise.t.test(lwt,race) There are two common adjustment methods. Bonferroni divides the significance level (often 0.05) by the number of tests to determine test significance; so the p-values reported have been multiplied by the number of tests – compare them to the original significance level. Holm adjusts as it goes – the first test is adjusted for the n-1 remaining tests; the second test is adjusted for the n-2 tests left and so on. Holm is the default. Rank Test: (one sample, two sample) The t-tests require the assumption that the data originally come from a normal distribution. If you are not willing to make that assumption, you can use a (nonparametric) rank test. These tests usually replace the data with ordered ranks. Wilcoxon signed rank test (one sample): We still have a null hypothesis: Ho: mu = a. The mu from the null hypothesis is subtracted from the data, giving us a vector of differences. These differences are ordered 1 through n. We then calculate the sum of the ranks for the positive differences or the negative differences. If the hypothesized mean is correct, we would expect the differences to be pretty evenly split as positive/negative. The sum of the positive ranks should be close to the sum of the negative ranks. The distribution of this sum is known; the Wilcoxon test finds how likely our sum would be if the null hypothesis is true. mu<-0 x<-rnorm(10,0,1) diff<-x-mu rank<-order(diff) sum(rank[diff<0]) sum(rank[diff>0]) wilcox.test(x) Rebecca Nugent, Department of Statistics, U. of Washington -4- Note: the results do not include a confidence interval or parameter estimates. Recall, this is a distribution-free test; no model assumed. This test also has the option to set an alternative or a mu. Wilcoxon two sample rank test: There is also a rank test for comparing two groups. Here the combined data are replaced with their ranks. The sum of the ranks for one group is calculated. If the groups are similarly distributed, their rank sums should be close. x<-rnorm(20,0,1) y<-rnorm(20,0,1) z<-rnorm(20,3,1) data<-c(x,y) rank<-order(data) sum(rank[1:20]) sum(rank[21:40]) wilcox.test(x,y) data<-c(x,z) rank<-order(data) sum(rank[1:20]) sum(rank[21:40]) wilcox.test(x,z) back to our energy data... data<-c(energy.sub$obese,energy.sub$lean) rank<-order(data) sum(rank[1:11]) sum(rank[12:22]) wilcox.test(energy.sub$obese,energy.sub$lean) Rebecca Nugent, Department of Statistics, U. of Washington -5- Paired Wilcoxon Test: The paired Wilcoxon test is the non-parametric analog of the paired t-test. It is just the Wilcoxon signed rank test on the differences between the two samples. Note again that the vectors must be of the same length. x<-rnorm(20,0,1) y<-rnorm(20,0,1) z<-rnorm(20,3,1) diff<-x-y rank<-order(diff) sum(rank[diff<0]) sum(rank[diff>0]) wilcox.test(x,y,paired=TRUE) diff<-x-z rank<-order(diff) sum(rank[diff<0]) sum(rank[diff>0]) wilcox.test(x,z,paired=TRUE) Looking again at our pre/post energy intake data: wilcox.test(pre,post,paired=T) Testing Equality of Variances: In the two sample t-test, you had the option to assume that the variances of your two groups were the same. You many want to test if this is true. F test: If two variances are equal, their ratio is one. This ratio of variances has an F distribution. The F test assumes a null hypothesis of the ratio of variances = a number you set (default = 1). It finds the probability of seeing a ratio as extreme as you did. If small (< 0.05??), you reject the null hypothesis that the variances of the two groups are equal. x<-rnorm(30,0,1) y<-rnorm(30,0,1) z<-rnorm(30,0,5) var(x) var(y) var(z) var.test(x,y) var.test(x,z) We again have the same alternative options. Rebecca Nugent, Department of Statistics, U. of Washington -6- Let’s look at the two sample t-test on the energy expenditure data. First, are the variances of the obese group and the lean group the same? var.test(energy.sub$obese,energy.sub$lean) We did not reject the null hypothesis that the variances are the same. We use the t-test with var.equal = TRUE. t.test(energy.sub$obese, energy.sub$lean, var.equal=TRUE) Testing Tabular Data: Single Proportions: We have seen how to test a hypothesis about the population mean. Sometimes we want to test a hypothesis about a proportion of successes. That is, we have n trials with x successes and n-x failures. The proportion of successes is x/n. For example, we ask 200 people whether or not they will vote for an initiative. Then we can test if the proportion of the population who would vote yes is high enough to enact the initiative. Ho: p = po Ha: p > po; Ha: p < po; Ha: p != po n<-100 x<-23 prop.test(x,n,p=0.20) prop.test(x,n,p=0.10) prop.test(x,n,p=0.50) The test uses a normal approximation; Default is two-sided alternative; p = 0.50. Look at binom.test as well. Two Proportions: We can also test the equality of two proportions (the success probabilities of two groups). For example, do two neighborhoods in Seattle vote similarly for an initiative? Ho: p1 = p2 Ha: p1 > p2; Ha: p1 < p2; Ha: p1 != p2 We create a vector of numbers of successes and a vector of the numbers of trials. We ask 135 people in Greenlake a question; 37 say Yes. We ask 147 people in Ravenna the same question; 47 people say Yes. suc.vec<-c(37,47) n.vec<-c(135,147) prop.test(suc.vec,n.vec) Default is two-sided alternative. Rebecca Nugent, Department of Statistics, U. of Washington -7- k proportions: We can extend this to k proportions. For example, asking the same question to different numbers of people in 5 Seattle neighborhoods. The null hypothesis is that all proportions are equal. The alternative is that they are not equal (at least one is different). suc.vec<-c(37,47,25,63,42) n.vec<-c(135,147,120,200,180) prop.test(suc.vec,n.vec) We can also test for an increasing or decreasing trend in proportions. suc.vec<-c(15,48,55,81) n.vec<-c(130,210,150,180) prop.trend.test(suc.vec,n.vec) The third argument is the score argument (default: 1, 2, 3, …, k), the score given to each of the groups. r by c tables: For categorical tables with two or more categories for both questions, we can use a chisquare test. We ask a group of people two categorical questions and are interested to see if there is a relationship between the two questions. The null hypothesis is that the two variables are independent. The alternative hypothesis is that the variables are dependent on each other. A real-life example: Caffeine Consumption and Marital Status: caffmarital.dat caff.marital<read.table(“http://www.stat.washington.edu/rnugent/teaching/csss508/caf fmarital.dat”) res<-chisq.test(caff.marital) res names(res) The chisq.test finds a p-value using the chi-square distribution. If you would like an exact p-value, you might want to think about fisher.test, but it can be very computationally intensive. Please take a closer look at Chapter 6 for a more in-depth look at analysis of variance Rebecca Nugent, Department of Statistics, U. of Washington -8-