The t-test, the paired t-test, and introduction to non-parametric tests July 8, 2004 1. The t-test: for comparing means (averages) Comparing two means Is the difference in means that we observe between two groups more than we’d expect to see based on chance alone? Are the two means different enough to conclude that the observed difference is greater than would be expected by chance? When background noise is high, it’s difficult to tell if the two groups are different. When background noise is low, it’s easier to distinguish between groups. Comparing two means What is the sampling distribution of the difference in the means of two samples? First we need to know: What is the distribution of a difference between two normally distributed random variables? ~N(12,25) simulation of 500 averages of 30 from a normal distribution with mean 12 and standard deviation 5 (variance 25) SE 5 .91 30 Most experiments will yield a mean between 10 and 14 (±2 se) ~N(8,25) simulation of 500 averages of 30 from a normal distribution with mean 8 and standard deviation 5 (variance 25) SE 5 .91 30 Most experiments will yield a mean between 6 and 10 (±2 se) Distribution of the difference… simulation of 500 differences between means from above distributions Notice that most experiments will yield a difference value between 1 and 7 (wider than the above sampling distributions!) SE(diff ) 25 25 1.29 30 30 Distribution of differences if X and Y are independent and X ~ N(x, x2) and Y ~ N(y, y2) • recall that averages are normally distributed if n is large enough, by the central limit theorem then (X-Y) ~ N(y-x, x2+y2) and (X+Y) ~ N(y+x, x2+y2) Therefore, if X and Y are the averages of n and m subjects, respectively: X n Ym ~N ( x y , x n 2 y2 m ) Example A particular IQ test is designed to have a range of 0 to 200 with a standard deviation of 10 when given to U.S. adults. You suspect that female doctors have higher IQ’s than male doctors. To test this hypothesis, you take a random sample of 30 female doctors and 30 male doctors. The women score an average of 152 and the men an average of 149. What is your conclusion? Recall steps of a hypothesis test: 1. Define your hypotheses (null, alternative) 2. Specify your null distribution: 3. Do an experiment 4. Calculate the p-value of what you observed 5. Reject or fail to reject (~accept) the null hypothesis 1. Define your hypotheses (null, alternative) H0: ♀-doctor IQ = ♂-doctor IQ; (♀ - ♂ = 0) Ha: ♀-doctor IQ ≠ ♂-doctor IQ; (♀- ♂ ≠ 0 ) [two-sided] 2. Specify your null distribution Null hypothesis is that the difference is zero. F30 M 30 100 100 ~N (0, ) 30 30 100 100 s.e.(diff ) 2.58 30 30 3. Do an experiment Observed difference in our experiment = 3.0 IQ points 4. Calculate the p-value of what you observed 3/2.58=1.16 Z = (FROM SAS): data _null_; x=(1-probnorm(1.16))*2; put x; run; 0.2460488061 Two sided test! Both tails are possible, so must double the area from one tail. (two-sided p-value) 5. Reject or fail to reject (~accept) the null hypothesis Not enough evidence to reject at the .05 significance level. (.24>.05) Complication 1… The harsh reality is, we hardly ever know the true standard deviation a priori. If we knew that much, we probably wouldn’t need to run an experiment! In most cases, we must use the sample standard deviation as a stand-in for the truth. However, by estimating the population standard deviation we are adding more uncertainty to our experiment. The null distribution is slightly wider than a normal curve…called a “t-distribution.” Recall: sample variance and standard deviation N The variance of a population: 2 = ( xi ) 2 i 1 N N The variance of a sample: s2 = ( xi x ) 2 i 1 n 1 N The standard deviation of a sample: s= (x x) i i 1 n 1 2 Example: calculation of sample standard deviation systolic blood pressures: 104, 114, 120, 148, 130, 132, 143, 152, 133, 124 Mean = 1300/10 = 130 Sample standard deviation = (104 130) 2 (114 130) 2 (120 130) 2 0 (132 130) 2 (143 130) 2 (152 130) 2 (133 130) 2 (124 130) 2 2070 2070 230 10 1 230 15 Estimated standard error of the mean= 15 10 Complication 1… The null distribution is slightly wider than a normal curve…called a “t-distribution.” The “t” probability density function Where: is the degrees of freedom (gamma) is the Gamma function is the constant Pi (3.14...) The “t” distributions The t distribution depends on the degrees of freedom. Degrees of freedom here=number of observations used to calculate the standard deviation (n) minus the number of sample means (1 or 2) used in calculation of the sample standard deviation The “t” distributions The t distribution is just a slightly flattened version of the normal curve. The t distribution is actually a family of distributions that comes closer and closer to the normal probability distribution as degrees of freedom increase. With n>30, the t distribution is approximately normal. The t-function in SAS is: probt(t-statistic, df) Degrees of freedom Example A one-sample test when the standard deviation is unknown (one-sample t-test) Example: One sample t-test A British sleep researcher claims that the British sleep an average of 6.0 hours a night. If you ask 30 Brits how many hours they sleep per night and your sample average is 6.9 hours with a sample standard deviation of 3.0, do you think the researcher was mistaken in his claim? 1. Specify hypothesis: H0: average hours = 6.0 Ha: average hourse ≠ 6.0 [two-sided] One sample t-test 2. Specify null distribution. The null distribution here actually follows a “t-distribution with 29 (=n-1) “degrees of freedom” (the higher the number of degrees of freedom, the more the t-distribution looks like a normal curve). X 30 ~T29 (6.0, 3.0 30 0.55) One sample t-test 3. Observed data=6.9 hours with a sample standard of 3.0 One sample t-test 4. USE SAS TO CALCULATE p-value: data _null_; pval=1-probt(1.64, 29); put pval; run; 0.0559046876 For two-sided test, multiply by 2: p-value=.11 – This gives just a slightly higher answer than the Z-test (Z=1.64), which yields a two-sided p-value of .10. Diminished certainty due to estimating the standard deviation. One sample t-test 5. .11>.05; do not reject null at a significance level of .05 Example: two-sample t-test In 1980, some researchers reported that “men have more mathematical ability than women” as evidenced by the 1979 SAT’s, where a sample of 30 random male adolescents had a mean score ± 1 standard deviation of 436±77 and 30 random female adolescents scored lower: 416±81 (genders were similar in educational backgrounds, socioeconomic status, and age). Do you agree with the authors’ conclusions? Two-sample t-test 1. Define your hypotheses (null, alternative) H0: ♂-♀ math SAT = 0 Ha: ♂-♀ math SAT ≠ 0 [two-sided] Two-sample t-test 2. Specify your null distribution: F and M have approximately equal standard deviations/variances, so make a “pooled” estimate of variance. s 2p (n 1)sm2 (m 1)s 2f nm2 (29)77 2 (29)812 6245 58 6245 6245 M 30 F30 ~T58 (0, ) 30 30 6245 6245 20.4 30 30 Two-sample t-test 3. Observed difference in our experiment = 20 points Two-sample t-test 4. Calculate the p-value of what you observed 20 0 T58 .98 20.4 data _null_; pval=(1-probt(.98, 58))*2; put pval; run; 0.3311563454 5. Do not reject null! No evidence that men are better in math ;) Example 2 Example: Rosental, R. and Jacobson, L. (1966) Teachers’ expectancies: Determinates of pupils’ I.Q. gains. Psychological Reports, 19, 115-118. The Experiment (note: exact numbers have been altered) Grade 3 at Oak School were given an IQ test at the beginning of the academic year (n=90). Classroom teachers were given a list of names of students in their classes who had supposedly scored in the top 20 percent; these students were identified as “academic bloomers” (n=18). BUT: the children on the teachers lists had actually been randomly assigned to the list. At the end of the year, the same I.Q. test was readministered. The results Children who had been randomly assigned to the “top-20 percent” list had mean I.Q. increase of 12.2 points (sd=2.0) vs. children in the control group only had an increase of 8.2 points (sd=2.5) Is this a statistically significant difference? Give a confidence interval for this difference. 1. Hypotheses H0: mean change (“gifted”) – mean change (control) = 0 Ha: mean change (“gifted”) – mean change (control) ≠ 0 2. Null distribution Null distribution of difference of two means: 2 2 ( 17 ) 2 . 0 ( 71 ) 2 . 5 s 2p 5.81 88 5.81 5.81 "gifted" control ~T88 (0, ) 18 72 5.81 5.81 .64 18 72 3. Empirical data Observed difference in our experiment = 12.2-8.2 = 4.0 4. P-value t-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96) t 88 12.2 8.2 4 6.25 .64 .64 p-value <.0001 5. Reject null! Conclusion: I.Q. scores can bias expectancies in the teachers’ minds and cause them to unintentionally treat “bright” students differently from those seen as less bright. Confidence interval (more information!!) 95% CI for the difference: 4.0±1.99(.64) = (2.7 – 5.3) t-curve with 88 df’s has slightly wider cutoff’s for 95% area (t=1.99) than a normal curve (Z=1.96) 2. The paired T-test The Paired T-test Paired data: either the same person on different occasions or pairs of people who are more similar to each other than to individuals from other pairs (husband-wife pairs, twin pairs, matched cases and controls, etc.) For example, evaluates whether an observed change in mean (before vs. after) represents a true improvement (or decrease). Null hypothesis: difference (after-before)=0 Did the control group in the previous experiment improve at all during the year? t71 8.2 8.2 28 2 .29 2.5 72 p-value <.0001 Summary True standard deviation is known One sample (or paired sample) Two samples One-sample Z-test Two-sample Z-test Two-sample t-test Standard deviation is estimated by the sample One-sample t-test Equal variances are pooled Unequal variances (unpooled) Non-parametric tests t-tests require your outcome variable to be normally distributed (or close enough). Non-parametric tests are based on RANKS instead of means and standard deviations (=“population parameters”). Example: non-parametric tests 10 dieters following Atkin’s diet vs. 10 dieters following Jenny Craig Hypothetical RESULTS: Atkin’s group loses an average of 34.5 lbs. J. Craig group loses an average of 18.5 lbs. Conclusion: Atkin’s is better? Example: non-parametric tests BUT, take a closer look at the individual data… Atkin’s, change in weight (lbs): +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 J. Craig, change in weight (lbs) -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 Jenny Craig 30 25 20 P e r c 15 e n t 10 5 0 -30 -25 -20 -15 -10 -5 0 5 Weight Change 10 15 20 Atkin’s 30 25 20 P e r c 15 e n t 10 5 0 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 Weight Change -60 -40 -20 0 20 t-test doesn’t work… Comparing the mean weight loss of the two groups is not appropriate here. The distributions do not appear to be normally distributed. Moreover, there is an extreme outlier (this outlier influences the mean a great deal). Statistical tests to compare ranks: Wilcoxon Mann-Whitney test is analogue of two-sample t-test. Wilcoxon Mann-Whitney test RANK the values, 1 being the least weight loss and 20 being the most weight loss. Atkin’s +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 1, 2, 3, 4, 5, 6, 9, 11, 12, 20 J. Craig -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 7, 8, 10, 13, 14, 15, 16, 17, 18, 19 Wilcoxon Mann-Whitney test Sum of Atkin’s ranks: 1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 + 20=73 Sum of Jenny Craig’s ranks: 7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137 Jenny Craig clearly ranked higher! P-value *(from computer) = .018