Psych 5500/6500 Assumptions Underlying the t Test for Independent Means Fall, 2008 1 Assumptions We will look at four assumptions, the first three are mathematical in nature: 1. Independence of scores. 2. Both populations are normally distributed. 3. The two populations have the same variance. And we will add one more: 4. The mean is meaningful. 2 Assumption of the Independence of Scores This is a critical assumption in that violating it has severe consequences on the validity of the analysis. 3 Independence (cont.) To discuss it, let’s set up a simple experiment. Group 1 Group 2 Fred Sally Ethel George The assumption of independence for this t test is that Fred’s score is independent of Ethel’s (within-group independence) and is also independent of Sally’s and George’s (betweengroup independence). 4 Independence (cont.) Group 1 Fred Ethel Group 2 Sally George Technically, ‘independence’ means that how far above or below Fred’s score is from the mean of his group, cannot be predicted by knowing the same about Ethel’s or Sally’s or George’s scores. 5 Assumption of Normality This t test is based upon the assumption that both populations are normally distributed. The effects of failure to meet this assumption are somewhat different here than they were for the t test of a single group mean, as the sampling distribution of interest now, the one whose shape will be influenced by whether the assumptions are true or not, is that of Y1 Y2 t test for a single group t test for two independent groups 6 Determining Normality In a previous lecture we examined ways of determining whether or not a population is normally distributed based upon a sample. These techniques apply here. For each group we consider whether or not the population from which it was sampled is normally distributed. 7 Effects of Non-normality Boneau, C. A. (1960). The effects of violations of assumptions underlying the t test. Psychological Bulletin, 37, 49-64. Deviations from normality in the populations are not serious if the two populations are of the same shape, or nearly so. There seems to be little problem if both populations are symmetrical. If they are skewed then there is a problem if they are of different skewness or have different variances, as this will cause the sampling distribution to be skewed. This problem lessens under all but the most extreme cases as the size of the groups approaches 30. 8 There are limitations to Boneau’s approach. He only examined two-tailed tests, and as we have seen in a previous lecture the deviations of alpha from .05 due to skewness are much more severe with one-tailed tests. Also, Boneau didn’t examine the effect of violations of assumptions on the power of the experiments. In looking over the recent literature it appears that most authors advocate that when N is relatively small and the data are clearly non-normal in a way that involves skewness, that you consider either 1) normalizing the data through transformations (covered soon) or 2) using a test less affected by violations of normality (e.g. a nonparametric test, also covered soon). 9 Assumption of Equal Variances The t test for independent groups assumes that both populations have the same variance σ²1= σ²2 (homogeneity of variance) rather than different variances σ²1 σ²2 (heterogeneity of variance). 10 Detecting Heterogeneity The easiest way to detect heterogeneity is to simply compare the variances (or standard deviations) of the two groups. Of course, any difference in the variances could just be due to chance (i.e. the population variances are the same but the sample variances are not) There are statistical tests that can be used to determine from the data in the two groups whether or not the population variances differ. H0: σ1²=σ2² Ha: σ1²σ2² 11 Tests for Heterogeneity Two of the popular tests are the F ratio and Levene’s F. We will be covering these tests later in the semester when we have enough information to understand how they work (SPSS automatically provides Levene’s when you do a t test). The tests have the familiar problem that we can ‘reject H0’ (prove the variances differ) but we can’t ‘accept H0’ (prove the variances are the same). There is also the related issue of low power when the N’s are small and hypersensitivity to differences in variance when the N’s are very large. 12 Graphing Options There are at least a couple of good graphing options to see how much the variances of the two groups differ: 1. Scatter plots 2. Box plots We covered both of these in the lecture on normality, we will turn to them again but this time we will have them show both groups on the same graph, which makes it easier to see if the two groups differ in terms of variability of scores. 13 Graphing Options The next slide presents the data we will be graphing. To understand these graphic techniques we will first look at the control group by itself, then look at how both groups can be displayed on the same graph. 14 Data Control Group 5 6 4 1 7 9 5 13 Treatment Group 7 9 6 11 10 13 12 6 6 15 Scatter Plot First we will look at a scatter plot of the data in the control group. I particularly like this type of graph when the N of the samples is rather small. The disadvantage of the way SPSS does this type of graph is that when a score occurs more than once, the circles representing that score overlap, and you can’t tell from the graph how many times the score occurred. For example, in the following graph it is easy to see the spread of scores, but the fact that a score of ‘5’ occurred twice is concealed. 16 Scatter Plot of Control Group 12.5 10.0 Y 7.5 5.0 2.5 control GroupName 17 We can see in the scatter plot on the previous slide that the scores were more or less grouped around the center value, with one rather low score and one rather high score. Next we will have SPSS plot both groups on the scatter plot (next slide). Now we can see the spread of both groups, notice that the mean of the treatment group looks to be higher than the mean of the control group, and that the treatment group might have less variance than the control group. 18 Scatter Plot Both Groups 12.5 10.0 Y 7.5 5.0 2.5 control treatment GroupName 19 Box Plot Now we will look at box plots. Again we will start by looking at just the control group and then look at displaying both groups on the same graph. I have decided to repeat a couple of slides from when we covered box plots before to remind you how they work. 20 Elements of the Box Plot The next slide shows the elements of a box plot. The plot divides the scores into quartiles (i.e. the lowest 25% of the scores, the second 25%, the third 25% and the highest 25%). Usually the lower bound marks the lowest score and the upper bound marks the highest score, but SPSS does something slightly different. The height of the box (the distance from the 25th percentile to the 75th percentile) is called the 'interquartile range', to simplify the following description we will call this one 'step'. SPSS draws the upper boundary at the highest score that is not more than 1.5 steps above the box. Any point above that is marked as either an 'outlier' (if it is between 1.5 and 3 'steps' above the box) or as an 'extreme score' (if it is more than 3 'steps' above the box). The same thing is done for the lower boundary. 21 Box Plot 22 The next slide shows the box plot of the control group in our example. There are no outliers or extreme scores, so the upper and lower boundaries represent the highest and lowest score. Note that we can see that the sample is slightly positively skewed (the spread of scores above the median is greater than the spread of scores below the median). 23 Box Plot Control Group 12.5 10.0 Y 7.5 5.0 2.5 control GroupName 24 Two Groups We can ask SPSS to display the box plots for both groups on the same graph (see the next slide). We can see that the medians of the two groups are quite different (and the means probable will be too), that the groups differ in how spread out their scores are, and we can see to what degree each group is skewed. Note the bottom whisker is missing from the treatment group, this is because the lowest score is also the 25th percentile (look back at the data, there are not many scores and the lowest three scores are all 6’s, making ‘6’ both the lowest score as well as the 25th percentile). 25 Box Plot Both Groups 12.5 10.0 Y 7.5 5.0 2.5 control treatment GroupName 26 What to do About Heterogeneity Fortunately, Monte Carlo studies have shown that the assumption of homogeneity can be violated with little effect on the validity of the t test so long as the two groups have the same or very similar sizes (i.e. when N1 N2). If the N’s differ greatly, however, then heterogeneity can seriously affect the validity of the t test. We will take a look at what to do in that case. 27 Student’s t test & Welch’s t test The standard t test, the one we have already covered, is sometimes known as Student’s t test. It is based upon the assumption of homogeneity of variances. A t test that does not depend upon that assumption is Welch’s t test, sometimes known as the t test for unequal variances. 28 Standard Error: Student’s t A difference in the two t tests is evident in their respective formulas for computing the standard error. We will begin by repeating the formula from Student’s t: est.σ 2 pooled (N1 1)est.σ (N2 1)est.σ N1 N 2 2 2 1 2 2 then... est.σ Y1 Y2 est.σ est.σ 1 1 est.σ N1 N 2 29 N1 N 2 2 p 2 p 2 p Standard Error: Welch’s t Welch’s t doesn’t assume that the two estimates are of the same variance, and thus doesn’t pool them: Standard error Welch’s t: est. σ Y1 Y2 2 1 2 2 est.σ 2p est.σ 2p est.σ est.σ N1 N2 Standard error Student’s t: est.σ Y1 Y2 N1 N2 30 Welch’s t formula (t’) t 'obt Y Y μ 1 2 Y1 Y2 est.σ Y1 Y2 Y Y μ 1 2 2 1 Y1 Y2 est.σ est.σ N1 N2 2 2 When N1=N2 the standard error comes out to be the same in Student’s t and in Welch’s t’, and thus the tobt is the same as well. 31 Welch’s t Degrees of Freedom The new standard error, however, leads to a different degrees of freedom formula. When the N’s are equal and the est.σ²’s are equal then this formula gives the same df’s as Student’s t, but when the N’s differ or the est.σ²’s differ then this leads to a smaller df than in Student’s t (and the Welch df is often not a whole number). 2 1 u N1 N 2 df 2 1 u 2 2 N1 N1 1 N 2 N 2 1 2 2 2 1 est.σ whereu est.σ 32 Type 1 Error Rate (Monte Carlo) From Coombs et al. (1996) as adapted and cited by Ruxton (2006) N1 N2 σ1 σ2 Stud. t Welch t’ 11 11 1 1 .052 .051 11 11 4 1 .064 .054 11 21 1 1 .052 .051 11 21 4 1 .155 .051 11 21 1 4 .012 .046 25 25 1 1 .049 .049 25 25 4 1 .052 33 .048 Which to Use? 1. If σ²1= σ²2 then you may use either Student’s or Welch’s t test (but they may differ in power) 2. If σ²1 σ²2 and N1 N2 then then you may use either Student’s or Welch’s t test (but they may differ in power). 3. If σ²1 σ²2 and N1 N2 (differ more than a little) then you have to use Welch’s. The challenge is that we don’t know the values of σ²1 and σ²2, we only have the population variance estimates from the samples (groups) which will almost always differ at least a little. 34 Strategies In my review of the literature I have run across two suggested strategies: 1. Use the more familiar Student’s t test unless there is sufficient evidence that the two population variances differ, in which case use Welch’s t’. 2. Always use Welch’s t’, then you don’t have to worry about violating the assumption of homogeneity of variance. 35 Strategy 1 With this strategy you usually first do a Levene’s test to see if their is sufficient evidence that the population means are not equal. If Levene’s test is statistically significant (i.e. it’s p.05) then you have proof the variances are different and so you use Welch’s t’, otherwise use Student’s t. The logical problems associated with this (possible lack of power or hypersensitivity, and the inability to prove H0 is true) were discussed earlier. 36 Strategy 1 (cont.) There is also a problem associated with using one statistical test (e.g. Levene’s) to determine which other statistical test is appropriate to use (e.g. Student’s vs. Welch’s). When tests are linked in that way the issue of the overall probability of making an error in the analysis becomes complex. See Ruxton (2006) for references if you would like to know more. 37 Strategy 2 The second strategy is to simply always use Welch’s t’, as it doesn’t depend upon the assumption that the variances are equal. 38 Power Considerations Playing around with SPSS this is what I found (keeping the difference between the two group means the same in all cases): 1. When N1= N2 and est.σ²1= est.σ²2 then both t tests had the same p value. 2. When N1= N2 and est.σ²1 est.σ²2 then Student’s t had the lower p value. 3. When N1 N2 and est.σ²1= est.σ²2 then Student’s t had the lower p value. 4. When N1 N2 and est.σ²1 est.σ²2 then Welch’s t’ had the lower p value. 39 Selecting a Strategy It is a complicated choice, the advantage of Strategy 2 is that you never have to worry about whether the variances are the same or not. But, for example, if you have greatly different N’s, and the population variances are indeed the same, then Student’s t would be appropriate and it would have more power. Whichever strategy you choose the choice must be a priori, you can’t look at the analysis and then select the approach that had the lower p value. 40 Final Assumption: The Mean is Meaningful Quote from one of my mentors (a behaviorist involved in single-subject research): “I study the behavior of individuals, you study the behavior of means.” I’d like to end on a bit of a philosophical note. There is some reason to question whether or not the study of means is meaningful. There are decades’ (a century’s?) worth of precedence in saying that the mean is a meaningful thing to study, so I’ll take the time only to present the case against it. 41 Santa Claus Point one: the mean is exactly as real as Santa Claus, they are both important cultural ideas that influence our behavior but neither really exist. There are no means out there in the territory, they exist only in our models. Counterpoint: if we take anthropologist Gregory Bateson’s definition of ideas, then we would say that ideas are far more important in models of nature than are such ‘real’ things as physical forces and objects. 42 Individual Differences Point two: the focus on means takes us away from something much more important, individual differences. To explain this point I will pick on studies that exam gender differences. We could look at, for example, a study that shows that males have better spatial abilities than females. The point here is that the statement I just made is fundamentally inaccurate in a very important way, I should have said that the mean score of males is higher than the mean score of females. We often forget to include that information in our statements but its presence is crucial. 43 The t test for independent means may lead you to say that the mean of the females is lower than the mean of the males, but look at the hypothetical graph, many many females have spatial ability scores that are above the mean score of the males and many many males have scores that are below the mean of the females. The mean is a huge generalization, and like all generalizations is useful in that it makes the world much simpler but at the expense of hiding all the specific cases in which it does not apply. 44 Counterpoint: analyses that focus on differences between means might limit us to just understanding the behavior of means. Multiple regression (a topic we cover in the section on the Model Comparison Approach), however, provides an approach for focusing more on individual differences. 45 Recommended Reading Boneau, C. A. (1960). The effects of violations of assumptions underlying the t test. Psychological Bulletin, 37, 49-64. Legendre, P. & Borcard, D. Appendix: t-test with Welch correction, in Statistical comparison of univariate tests of homogeneity of variances. http://biol10.biol.umontreal.ca/BIO2041e/Correction_Wel ch.pdf Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student's t-test and the Mann– Whitney U test. Behavioral Ecology, 17, 688 - 690. Welch BL. 1938. The significance of the difference between two means when the population variances are unequal. Biometrika, 29, 350–62. 46