Probability: What is a p-value? Suppose we run a clinical trial to compare two treatments, A and B. We get these values for the response for each patient. Treatment A A A A A A A A A A A Mean Response 6 11 15 18 29 33 34 46 49 73 102 37.82 Treatment B B B B B B B B B B B Mean Response 12 13 14 23 26 33 59 59 69 75 78 41.91 The difference between the means is 41.91 – 38.82 = 4.09. Before we begin the experiment, we assume that there is no difference between treatment A and treatment B (the null hypothesis), that both samples are random samples from the same parent population. We perform the experiment to gather evidence that there is a difference (to reject the null hypothesis). The null hypothesis is that there is no difference between treatment A and treatment B, and that both treatment groups are random samples from the same parent population. If the null hypothesis is true, then any difference between the means of the two samples is just the result of random sampling – the particular observations from the parent population that, by random selection, were assigned to A or B. You might think of this as though everyone in both A and B actually got the same treatment. For example, everyone actually got a placebo. So we can ask the following question. If the null hypothesis is true (that there is no real difference between the treatments), then what is the probability that we would observe as big a difference between the means of A and B as we see in the clinical trial data just by chance? One way to estimate this probability would be to do the following experiment. 1. Select a group of patients from the same population of patients as were used in the first trial. 2. Give each patient a placebo. 3. Randomly assign each patient the label “A” or “B”. 4. Calculate the mean of those patients labeled “A” 5. Calculate the mean of those patients labeled “B” 6. Calculate the difference between the mean of group “A” and the mean of group “B”. 7. Repeat this process 100 times, giving 100 random samples. After completing the process for the 100 random samples, count the proportion of times that the (absolute value) of the difference in means (A vs B) for the original clinical trial is more extreme than the difference in means for the 100 random samples. If the (absolute value) of the difference in means (A vs B) for the original clinical trial is more extreme than the difference in means for the ALL 100 random samples, then we can make the following statement. If it is true that there is no difference between treatment A and treatment B (the null hypothesis), the probability that we would observe as large a difference between the means of A and B as we see in the clinical trial data just by chance is less than 1 in 100, or P < 0.01. Here’s an alternative way to say the same thing. If the null hypothesis is true (no difference between A and B), then the probability is P < 0.01 that we would observe as large a difference between the means of A and B (as in the clinical trial) just by chance. The p-value is the probability of the observed result, or a more extreme result, if the null hypothesis is true. Here’s a simulation to illustrate the idea, and to show how to estimate p-values in practice. We’ll start with the original clinical trial data that we showed above. Original data Treatment A A A A A Response 6 11 15 18 29 Treatment B B B B B Response 12 13 14 23 26 A A A A A A Mean 33 34 46 49 73 102 37.82 B B B B B B Mean 33 59 59 69 75 78 41.91 Difference between the means is 41.91 – 38.82 = 4.09. We are asking this question. If the null hypothesis is true (that there is no real difference between the treatments), what is the probability that we would observe as large a difference between the means of A and B as we see in the clinical trial data just by chance? We saw above a way to estimate this probability that used many samples taken from the parent population, where all the patients received placebo. Usually, we can’t take many samples. Can we simulate that experiment using the data from the one clinical trial that we have? If the null hypothesis is true (the labels “A” and “B” don’t mean anything), then we can randomly shuffle the labels “A” and “B” among the patients. So we can perform the following simulation. 1. Start with the original clinical trial data. 2. Randomly shuffle the labels “A” or “B” among all the patients. a. To do this, take a stack of index cards, label them "A" or "B" in the same proportion as in the original data, then shuffle them. b. Assign the (shuffled) labels to the patients. c. Now, some patients who were labeled “A” will be labeled “B”, and vice versa. The total number of patients labeled “A” is unchanged, as is the total number labeled “B”. 3. Calculate the mean of those patients labeled “A” 4. Calculate the mean of those patients labeled “B” 5. Calculate the difference between the mean of group “A” and the mean of group “B”. 6. Repeat this process 100 times, giving 100 random samples. After completing the process for the 100 random shuffles, count the proportion of times that the (absolute value) of the difference in means (A vs B) for the original clinical trial is greater than the difference in means for the 100 random samples. Here’s an example of what we get when we do one random shuffle of the labels for the original clinical trial. First random shuffle of the data Treatment A A A A A A A A A A A Mean Response 26 29 11 15 23 49 75 69 102 6 59 42.18 Treatment B B B B B B B B B B B Mean Response 12 34 73 33 18 13 46 59 14 78 33 37.55 The difference between A and B in the original clinical trial was 4.09. When we randomly shuffle the labels (which is legitimate if the null hypothesis is true), we get a difference of |-4.64| = 4.64. So when we randomly shuffle the labels, we get a bigger difference between the means of A and B than we did in the original clinical trial. Let’s do another random shuffle. Second random shuffle of the data Treatment A A A A A A A A A A A Mean Response 18 14 34 33 13 29 73 75 46 12 33 34.55 Treatment B B B B B B B B B B B Mean Response 59 26 59 49 78 102 23 6 15 11 69 45.18 This time the random shuffle gives us a difference of 45.18-34.55 = 10.64. Again, when we randomly shuffle the labels, we get a bigger difference between the means of A and B than we did in the original clinical trial. By now we should be starting to think that the difference between the samples in the original clinical trial could be produced just by random sampling, even if the two treatments didn’t differ. We want to do a large number of random shuffles (at least 100, preferably more) to count the proportion of times that the (absolute value) of the difference in means (A vs B) for the original clinical trial is greater than the difference in means for the random samples. Simulation of random shuffles using R Here’s some code in the R statistics language to do the simulation. You can install R for free by downloading it from the website http://cran.r-project.org/ In California, a convenient mirror site where you can download R is at Berekely. http://cran.cnr.Berkeley.edu # Example of generating random permutations to see if an observed difference between # the means of two groups is large compared to differences between randomly generated groups # Here’s the original data from the clinical trial, stored in a vector x. The first 11 values are for treatment A, the second 11 are for treatment B. x=c(6,11,15,18,29,33,34,46,49,73,102,12,13,14,23,26,33,59,59,69,75,78) # Calculate the difference between the means. mean(x[12:22])-mean(x[1:11]) [1] 4.090909 # Take 100 random samples with two groups each from x and calculate the difference between the means of the two groups. We use the R function “sample()” to perform the random shuffle (sample without replacement). We store the difference between the means for the 100 samples in the variable “diff”. diff=0 for(permutation in 1:100) { y=sample(x,22) diff[permutation]=mean(y[1:11])-mean(y[12:22]) } # Here are the differences between the randomly generated groups print(diff) > print(diff) [1] [6] 1.72727273 3.18181818 3.54545455 4.09090909 15.90909091 0.09090909 2.09090909 15.36363636 3.54545455 -2.45454545 [11] [16] [21] [26] [31] [36] [41] [46] [51] [56] [61] [66] [71] [76] [81] [86] [91] [96] > -7.18181818 -3.72727273 8.63636364 -5.54545455 -13.00000000 -12.81818182 11.90909091 1.54545455 -3.72727273 20.27272727 -0.45454545 -7.18181818 -6.27272727 2.81818182 -9.18181818 -17.54545455 6.09090909 -10.27272727 5.18181818 3.90909091 -4.27272727 -15.72727273 -29.00000000 0.63636364 -19.36363636 -9.36363636 -0.09090909 -14.63636364 -6.63636364 13.90909091 -4.27272727 -2.27272727 -2.09090909 -3.00000000 -8.45454545 3.36363636 14.45454545 -19.72727273 -21.36363636 7.18181818 1.18181818 -0.09090909 -17.36363636 11.36363636 19.72727273 7.72727273 -15.36363636 -9.00000000 -4.27272727 20.63636364 7.90909091 4.45454545 8.45454545 -8.81818182 1.18181818 1.54545455 17.54545455 -7.18181818 1.36363636 -36.27272727 -15.90909091 -38.63636364 16.45454545 5.00000000 1.72727273 10.63636364 9.00000000 7.36363636 19.54545455 5.36363636 -4.45454545 -15.18181818 13.54545455 -2.63636364 -11.90909091 0.27272727 -8.27272727 12.09090909 23.90909091 -3.54545455 -0.81818182 -17.54545455 9.36363636 -10.27272727 22.09090909 -22.09090909 -20.45454545 -1.00000000 -1.90909091 20.45454545 # Here are the differences between the randomly generated groups, now sorted sort(diff) > sort(diff) [1] [6] [11] [16] [21] [26] [31] [36] [41] [46] [51] [56] [61] [66] [71] [76] [81] [86] [91] [96] > -38.63636364 -20.45454545 -17.36363636 -14.63636364 -10.27272727 -8.45454545 -6.63636364 -4.27272727 -3.00000000 -1.90909091 -0.09090909 1.18181818 1.72727273 3.54545455 5.00000000 7.36363636 9.00000000 12.09090909 15.90909091 20.27272727 -36.27272727 -19.72727273 -15.90909091 -13.00000000 -9.36363636 -8.27272727 -6.27272727 -4.27272727 -2.63636364 -1.00000000 0.09090909 1.36363636 2.09090909 3.54545455 5.18181818 7.72727273 9.36363636 13.54545455 16.45454545 20.45454545 -29.00000000 -19.36363636 -15.72727273 -12.81818182 -9.18181818 -7.18181818 -5.54545455 -3.72727273 -2.45454545 -0.81818182 0.27272727 1.54545455 2.81818182 3.90909091 5.36363636 7.90909091 10.63636364 13.90909091 17.54545455 20.63636364 -22.09090909 -17.54545455 -15.36363636 -11.90909091 -9.00000000 -7.18181818 -4.45454545 -3.72727273 -2.27272727 -0.45454545 0.63636364 1.54545455 3.18181818 4.09090909 6.09090909 8.45454545 11.36363636 14.45454545 19.54545455 22.09090909 -21.36363636 -17.54545455 -15.18181818 -10.27272727 -8.81818182 -7.18181818 -4.27272727 -3.54545455 -2.09090909 -0.09090909 1.18181818 1.72727273 3.36363636 4.45454545 7.18181818 8.63636364 11.90909091 15.36363636 19.72727273 23.90909091 In these random shuffles, the differences between the mean range from -38.6 to 23.9. # Here is a histogram showing the differences between the randomly generated groups hist(diff, main = "Histogram of differences between the means \n for 100 randomly generated groups", xlab="Difference between the means") What percent of the 100 random shuffles that have difference between the mean with absolute value greater than 4.09? For this set of 100 random shuffles, 69 percent of the random shuffles have a difference between the mean with absolute value greater than 4.09. So the probability that a random sample of two groups (A and B) will have a difference between the mean with absolute value greater than 4.09 is P=0.69. Recall our earlier question: If the null hypothesis is true (that there is no real difference between the treatments), what is the probability that we would observe as large a difference between the means of A and B as we see in the clinical trial data just by chance? The probability is P=0.69. That is, even if there is no difference between the treatment groups, 69% of the time we would get a difference between the means more extreme than the value of 4.09 in the original data. This is what a p-value means. Suppose that the difference between the means in the original clinical trial data was 41. In our 100 random shuffles, the most extreme difference in the means was |-38| = 38. If the original difference in means was 41, we would make the following statement. If the null hypothesis is true (that there is no real difference between the treatments), the probability that we would observe as large a difference between the means of A and B as we see in the clinical trial data is less than 1/100, or p < 0.01. Suppose that our 100 random shuffles, four of the random differences (4/100) were more extreme than in our original clinical trial. Then we would make the following statement. If the null hypothesis is true (that there is no real difference between the treatments), the probability that we would observe as large a difference between the means of A and B as we see in the clinical trial data is approximately 4/100, or p =0.04. We could get more precision by doing more random shuffles. Doing 1,000 or 10,000 would give us increased precision. This randomization / permutation method for estimating p-values underlies the p-values that we get for t-tests, anova, regression, and many other statistics tests. Doing permutations and random samples is computationally expensive. When statisticians were developing the t-test, ANOVA, regression, and similar methods, they didn’t have computers to take random samples. So they developed analytical methods that were easy to calculate (by making certain assumptions, such as normal distributions) that gave good approximations to the p-values you get by doing permutations and random samples. A simulation to illustrate p-values when the two treatment groups do not differ Suppose that we have the following situation. We have one population. Everyone got a placebo. We perform an experiment: take two samples (as in a clinical trial) and perform a test (such as a t-test) to see if the treatment groups are different). If there is no difference between the two treatment groups (both got placebo), we expect to get a p-value of 0.05 in about 1 in 20 experiments (samples). If we do a simulation, will we get p-value < 0.05 in about 1 in 20 experiments? # Let's generate 10,000 patients who receive a placebo. placebo= rnorm(10000, 150, 20) mean(placebo) sd(placebo) > mean(placebo) [1] 150.1079 > sd(placebo) [1] 20.16905 # Plot the 10,000 points. plot(density(placebo), xlim= c(50, 250),ylim=c(0,.025)) # Take samples each of size n = 30 placebo.sample1 = sample(placebo, size=30) placebo.sample2 = sample(placebo, size=30) # Plot the two samples plot(density(placebo.sample1), xlim= c(50, 250),ylim=c(0, .025)) lines(density(placebo.sample2), lty=2) # Do a t-test to see if the two samples are different t.test(placebo.sample1, placebo.sample2, var.equal=TRUE) ttest.result = t.test(placebo.sample1, placebo.sample2, var.equal=TRUE) ttest.result$p.value > t.test(placebo.sample1, placebo.sample2, var.equal=TRUE) Two Sample t-test data: placebo.sample1 and placebo.sample2 t = -0.3914, df = 58, p-value = 0.697 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.750072 7.906978 sample estimates: mean of x mean of y 148.9798 150.9014 > ttest.result = t.test(placebo.sample1, placebo.sample2, var.equal=TRUE) > ttest.result$p.value [1] 0.6969725 > The p-value of 0.69 indicates that the two samples are not different. # Repeat sampling and t-test a few times. placebo.sample1= sample(placebo, size=30) placebo.sample2= sample(placebo, size=30) t.test(placebo.sample1, placebo.sample2, var.equal=TRUE) ttest.result = t.test(placebo.sample1, placebo.sample2, var.equal=TRUE) ttest.result$p.value > ttest.result$p.value [1] 0.4570366 > ttest.result$p.value [1] 0.9336696 > ttest.result$p.value [1] 0.0993568 > ttest.result$p.value [1] 0.2260408 # We get many different p-values, just by taking a different sample. # What is the probability that we will detect a significant difference (p < 0.05) if we take many samples of size n=30? To answer that question, let's do a simulation of 1000 samples and t-tests, and look at the distribution of p-values. rm(pvalue.list) pvalue.list = c() for (i in 1:1000) { placebo.sample1= sample(placebo, size=30) placebo.sample2= sample(placebo, size=30) pvalue.list[i] = t.test(placebo.sample1, placebo.sample2, var.equal=TRUE)$p.value pvalue.list } # Plot the pvalue.list hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000)) We have one population. Everyone got a placebo. We performed an experiment: take two samples (as in a clinical trial) and perform a test (such as a t-test) to see if the treatment groups are different). If there is no difference between the two treatment groups (both got placebo), we expect to get a p-value of 0.05 or less in about 1 in 20 experiments (samples) 1/20 = 0.05. # How many of the 1000 simulated samples give a p-value less than 0.05? PLT05=sum(sort(pvalue.list)<.05) PLT05 [1] 46 PLT05/1000 [1] 0.046 In 1000 simulated experiments, we got 46 p-values less than 0.05, which is close to the expected number of 0.05*1000 = 50.