Update to Research Methods Class – R. G. Bias – 11/25/2009 Seven things: 1 – I hope you all have a great Thanksgiving holiday. And I hope that when someone sitting around the table says something like “Hey, didja read where tryptophan causes people to be stupid?,” that you a) start asking yourself questions like “How did they operationalize ‘stupid’?,” and b) are polite to your dinner hosts, and choose carefully whether or not it is worth it to embarrass the poor soul who may not be so lucky as you to have learned how to be a critical consumer of research. 2 – Nov. 18th lecture – further payback. OK, so I felt really bad that my sloppiness cost us last week. The worked χ2 example on Hinton, p. 243 (in my edition – it’s the “worked example” in the section on Chi-square as a ‘goodness of fit’ test”) is a crisper one. A bit later Hinton shows what I was trying to show when I was talking about “marginal values” being used to calculate expected values. On my p. 247 he shows the formula: row total X column total The expected value of a cell = ------------------------------overall total Here’s an example I got off the Internets (http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/introduction_1.html): The Chi-Square Test Introduction One of the most common and useful ways to look at information about the social world is in the format of a table. Say, for example, we want to know whether boys or girls get into trouble more often in school. There are many ways we might show information related to this question, but perhaps the most frequent and easiest to comprehend method is in a table. Got in Trouble No Trouble Total Boys 46 71 117 Girls 37 83 120 Total 83 154 237 The above example is relatively straightforward in that we can fairly quickly tell that more boys than girls got into trouble in school. Calculating percentages, we find that 39 percent of boys got into trouble (46 boys got in trouble out of 117 total boys = 39%), as compared with 31 percent of girls (37 girls got in trouble out of 120 total girls = 31%). However, to re-frame the issue, what if we wanted to test the hypothesis that boys get in trouble more often than girls in school. These figures are a good start to examining that hypothesis; however, the figures in the table are only descriptive. To examine the hypothesis, we need to employ a statistical test, the chi-square test. About the Chi-Square Test Generally speaking, the chi-square test is a statistical test used to examine differences with categorical variables. There are a number of features of the social world we characterize through categorical variables - religion, political preference, etc. To examine hypotheses using such variables, use the chi-square test. The chi-square test is used in two similar but distinct circumstances: a. for estimating how closely an observed distribution matches an expected distribution - we'll refer to this as the goodness-of-fit test b. for estimating whether two random variables are independent. The Goodness-of-Fit Test One of the more interesting goodness-of-fit applications of the chi-square test is to examine issues of fairness and cheating in games of chance, such as cards, dice, and roulette. Since such games usually involve wagering, there is significant incentive for people to try to rig the games and allegations of missing cards, "loaded" dice, and "sticky" roulette wheels are all too common. So how can the goodness-of-fit test be used to examine cheating in gambling? It is easier to describe the process through an example. Take the example of dice. Most dice used in wagering have six sides, with each side having a value of one, two, three, four, five, or six. If the die being used is fair, then the chance of any particular number coming up is the same: 1 in 6. However, if the die is loaded, then certain numbers will have a greater likelihood of appearing, while others will have a lower likelihood. One night at the Tunisian Nights Casino, renowned gambler Jeremy Turner (a.k.a. The Missouri Master) is having a fantastic night at the craps table. In two hours of playing, he's racked up $30,000 in winnings and is showing no sign of stopping. Crowds are gathering around him to watch his streak - and The Missouri Master is telling anyone within earshot that his good luck is due to the fact that he's using the casino's lucky pair of "bruiser dice," so named because one is black and the other blue. Unbeknownst to Turner, however, a casino statistician has been quietly watching his rolls and marking down the values of each roll, noting the values of the black and blue dice separately. After 60 rolls, the statistician has become convinced that the blue die is loaded. Value on Blue Die Observed Frequency Expected Frequency 1 16 10 2 5 10 3 9 10 4 7 10 5 6 10 6 17 10 Total 60 60 At first glance, this table would appear to be strong evidence that the blue die was, indeed, loaded. There are more 1's and 6's than expected, and fewer than the other numbers. However, it's possible that such differences occurred by chance. The chi-square statistic can be used to estimate the likelihood that the values observed on the blue die occurred by chance. The key idea of the chi-square test is a comparison of observed and expected values. How many of something were expected and how many were observed in some process? In this case, we would expect 10 of each number to have appeared and we observed those values in the left column. With these sets of figures, we calculate the chi-square statistic as follows: Using this formula with the values in the table above gives us a value of 13.6. [Randolph here. First off, I don’t know why their formula is so complicated. We’ll just use the one we find in Hinton, p. 242 -- ∑ (O – E)2/E Let’s do these calculations. So for a die value of “1,” O-E=6. Square it to get 36. Divide by E (which is10) to get 3.6. Now we have to do this for all 6 cells: 1 – = 3.6 2 – (-5)2/10 = 2.5 3 – (-1) 2 /10 = .1 4 – (-3) 2/10 = .9 5 – (-4) 2/10 = 1.6 6 – (7) 2/10 = 4.9 Now sum ‘em all up: 3.6 + 2.5 + .1 + .9 + 1.6 + 4.9 = 13.6. Woo hoo, they got it right. If you are reading this and you do NOT know where I got those figures [e.g., (7)2/10], email me immediately, and we’ll “discuss.” Or call if you wish. I now return you to your originallyschedule stolen Chi-square discussion.] Lastly, to determine the significance level we need to know the "degrees of freedom." In the case of the chi-square goodness-of-fit test, the number of degrees of freedom is equal to the number of terms used in calculating chi-square minus one. There were six terms in the chisquare for this problem - therefore, the number of degrees of freedom is five. We then compare the value calculated in the formula above to a standard set of tables. The value returned from the table is 1.8%. We interpret this as meaning that if the die was fair (or not loaded), then the chance of getting a χ2 statistic as large or larger than the one calculated above is only 1.8%. In other words, there's only a very slim chance that these rolls came from a fair die. The Missouri Master is in serious trouble. Recap To recap the steps used in calculating a goodness-of-fit test with chi-square: 1. Establish hypotheses. 2. Calculate chi-square statistic. Doing so requires knowing: o The number of observations o Expected values o Observed values 3. Assess significance level. Doing so requires knowing the number of degrees of freedom. 4. Finally, decide whether to accept or reject the null hypothesis. Testing Independence The other primary use of the chi-square test is to examine whether two variables are independent or not. What does it mean to be independent, in this sense? It means that the two factors are not related. Typically in social science research, we're interested in finding factors that are related - education and income, occupation and prestige, age and voting behavior. In this case, the chi-square can be used to assess whether two variables are independent or not. More generally, we say that variable Y is "not correlated with" or "independent of" the variable X if more of one is not associated with more of another. If two categorical variables are correlated their values tend to move together, either in the same direction or in the opposite. Example Return to the example discussed at the introduction to chi-square, in which we want to know whether boys or girls get into trouble more often in school. Below is the table documenting the percentage of boys and girls who got into trouble in school: Got in Trouble No Trouble Total Boys 46 71 117 Girls 37 83 120 Total 83 154 237 To examine statistically whether boys got in trouble in school more often, we need to frame the question in terms of hypotheses. 1. Establish Hypotheses As in the goodness-of-fit chi-square test, the first step of the chi-square test for independence is to establish hypotheses. The null hypothesis is that the two variables are independent - or, in this particular case that the likelihood of getting in trouble is the same for boys and girls. The alternative hypothesis to be tested is that the likelihood of getting in trouble is not the same for boys and girls. Cautionary Note It is important to keep in mind that the chi-square test only tests whether two variables are independent. It cannot address questions of which is greater or less. Using the chi-square test, we cannot evaluate directly the hypothesis that boys get in trouble more than girls; rather, the test (strictly speaking) can only test whether the two variables are independent or not. 2. Calculate the expected value for each cell of the table As with the goodness-of-fit example described earlier, the key idea of the chi-square test for independence is a comparison of observed and expected values. How many of something were expected and how many were observed in some process? In the case of tabular data, however, we usually do not know what the distribution should look like (as we did with rolls of dice). Rather, in this use of the chi-square test, expected values are calculated based on the row and column totals from the table. The expected value for each cell of the table can be calculated using the following formula: For example, in the table comparing the percentage of boys and girls in trouble, the expected count for the number of boys who got in trouble is: The first step, then, in calculating the chi-square statistic in a test for independence is generating the expected value for each cell of the table. Presented in the table below are the expected values (in parentheses and italics) for each cell: Got in Trouble No Trouble Total Boys 46 (40.97) 71 (76.02) 117 Girls 37 (42.03) 83(77.97) 120 Total 83 154 237 This is Randolph. Let me jump in here and check their work. So (117X83)/237 = 9711/237=40.97. Check. (117X154)/237=76.02. Check. (120X83)/237 = 42.03. Check. (I learned to round to the nearest even number, so it woulda been 42.02, but close enough.) (120X154)/237 = 77.97. Excellent. NOTE: as I was trying to say, last Wednesday, what this accomplishes is, basically, to say, “OK, given that a total of 83 of the 237 kids in our sample got in trouble – i.e., 35.02% of them – if there was NO effect of gender, then we would EXPECT there to have been 35.02% of the (117) boys, i.e., 40.97 of ‘em getting in trouble, and 35.02% of the (120) girls, or 42.0 of them.” So, we expected about 41 boys to be in trouble (given the marginal values) and actually it was 46. We expected about 42 girls to have gotten in trouble, and we found 37. Is this going to yield a significant effect of gender? I’ll bet not. 3. Calculate Chi-square statistic With these sets of figures, we calculate the chi-square statistic as follows: In the example above, we get a chi-square statistic equal to: That’s what I got. Yikes – up in the table they found “76.02” – c’mon, this class can’t abide any more sloppiness! 4. Assess significance level Lastly, to determine the significance level we need to know the "degrees of freedom." In the case of the chi-square test of independence, the number of degrees of freedom is equal to the number of columns in the table minus one multiplied by the number of rows in the table minus one. [Note, they had a cut-and-paste problem on the web site – I’ve replaced their conclusion with my own. RGB.] In this table, there were two rows and two columns. Therefore, the number of degrees of freedom is: df = (# rows – 1) X (# columns -1) = 1 X 1 = 1. Table value (critical value) for Chi-square with 1 df is 3.84. Thus we cannot reject the null hypothesis (we thought not!!), and we conclude that we have no data to suggest that (within the constraints of our experimental design) the likelihood of getting in trouble is affected by gender. Recap To recap the steps used in calculating a goodness-of-fit test with chi-square 1. Establish hypotheses 2. Calculate expected values for each cell of the table. 3. Calculate chi-square statistic. Doing so requires knowing: a. The number of observations b. Observed values 4. Assess significance level. Doing so requires knowing the number of degrees of freedom 5. Finally, decide whether to accept or reject the null hypothesis. 3 – One thing that I didn’t say in class, and indeed I can’t find it in either textbook (I know it’s in there, somewhere) is that all t test scores must be positive. (You could’ve chosen to do X2 – X1 instead of X1– X2.) So, if you end up with a negative numerator in the calculation of your t value, just make it positive. 4 – One thing we didn’t talk about in class but that I want you to know about was “effect size.” Read Hinton (pp. 96 – 97 in my edition) first, then read S, Z, and Z, pp. 242-243, and then pp. 411-413. No, I won’t ask you to calculate one. Just know that Cohen’s d is a measure of effect size, and that .2 is small, .5 is medium, and .8 is large. (Yikes – be careful. This is “Cohen’s d,” not to be confused with the “d” scores, the difference scores, that we’ll attend to in a minute.) 5 – Post hoc tests. Someone in class asked me about post hoc tests. Geoff asked me, online, if they could use a Tukey test to perform pairwise comparisons after finding a significant ANOVA for their independent variable with three levels? The answer is “yes,” and the beginning of Ch. 12 in Hinton describes it pretty well. No, this won’t be on the final. 6 – Formulae Thanks tons, to Daniel, for these. This is how they’ll appear on the final (unless someone unearths a typo between now and then). The only t test formulae you’ll have to use is the one for related groups or the single-sample one – NOT the one for independent groups. Mean (population and sample) Semi-interquartile range Standard deviation (population and sample) Standard error of the mean z-score One-sample t test Two-sample t test (independent groups) Two-sample t test (related groups) Confidence interval F-score (ANOVA) 7 – Sample questions-problems: In a controlled experiment to test several treatments for a difference in mean response variable value, the following partial anova table is produced. Complete the table and give the elements of a test of the claim that the treatments produce different means. Source Between groups Within groups Total Sum of Squares 57.08 77.55 134.63 DF 3 15 18 Mean Square F Value Answer Source Between groups Within groups Total Sum of Squares 57.08 77.55 134.63 DF 3 15 18 Mean Square 19.03 5.17 F Value 3.68 The null hypothesis states that there was no effect of the independent variable – the four (see where I got that? If not, tell me.) groups did not vary in their scores (the dependent variable). The alternative hypothesis is that there IS some effect, the means DO differ -- put another way, the four sample groups represent different underlying population distributions. Going into the F table, with alpha = .05, the critical value of F with 3 and 15 df is 3.29. Our observed (calculated) F exceeds the table value, so we reject the null hypothesis and conclude that there IS an effect of “groups,” i.e., an effect of our Independent Variable. Another problem: One-way analysis of variance test: [NB: “One-way” here does NOT mean “one-tailed.” It just means we are testing just one IV.] [Just read this one – don’t work it out.] A psychologist is studying the effectiveness of three methods of reducing smoking. He wants to determine whether the mean reduction in the number of cigarettes smoked daily differs from one method to another among men patients. Sixteen men are included in the experiment. Each smoked 60 cigarettes a day before treatment. Four randomly chosen members of the group pursue method I; four pursue method II; and four pursue method III. [Note: What goes unspoken here is that four pursue NO method – this is the control group. I don’t know why these four people are left out of the analysis; below we see just 2 between-group df, so there were just 3 groups tested. They shoulda done a one-way ANOVA with four groups.] The results are as follows: Method I 50 51 51 52 Method II 41 40 39 40 Method III 49 47 45 47 Use a one-way analysis of variance to test whether the mean reduction in the number of cigarettes smoked daily is equal for the three methods. (Let the significance level equal .05.) SOLUTION: The mean reduction for the first method is 51; for the second method it is 40; and for the third method it is 47. The mean for all methods combined is 46. Thus, the between-group sum of squares equals 4 [(51-46)2 + (40-46)2 + (47-46)2], or 248. The within-group sum of squares equals (50-51)2 + (51-51)2 + (51-51)2 + (52-51)2 +(41-40)2 + (40-40)2 +(39-40)2 + (40-40)2 + (49-47)2 + (47-47)2 + (45-47)2 + (47-47)2 = 12. Thus, the analysis-of-variance table is Source of variation Between groups Within groups Total Sum of squares 248 12 260 Degrees of freedom 2 9 11 Mean square F 124 1.33 93 Since there are 2 and 9 degrees of freedom, F.05 = 4.26. Since the observed value of F far exceeds this amount, the psychologist should reject the null hypothesis that the mean reduction in the number of cigarettes smoked daily is the same for the three methods. Source Mansfield, Edwin. Basic Statistics with Applications. New York: WW Norton & Co., 1986, p. 424. Submitted by Clarke Iakovakis One last one – a t test example. Full-on. Full out. The full Monty. Full-contact t tests! Let’s say I randomly sample 7 students from a, oh, say, Research Methods class, at the 13-week point in the semester. I give them all a final exam. Their scores are 73, 100, 93, 86, 67, 80, and 89. Then I gave the same 7 students a 15-page document including various notes, sample problems, and an attempted recovery from a poor lecture. (Heh.) After reading this the students took another, similar (matched for difficulty) final. On this their scores were 83, 96, 95, 95, 82, 80, and 92. (In fact, worried about how well the two tests are matched, I’m going to give 4 of the test subjects test A first, and then test B after they have access to the study sheet. The other 3 subjects will take test B first, and then test A after they have access to the study sheet. That is I will [almost perfectly, given the odd number of subjects] counterbalance for the particular test taken.) Did the 15-page document influence performance on a Research Methods final? My independent variable is the 15-page study document, with two levels (present and not present). My dependent variable is score on a Research Methods final. This is a withinsubjects (i.e., repeated measures) experimental design; each test participant served in each group, i.e., saw both levels of the treatment (i.e., they DIDN’T have the study sheet, and then they did). I am going to use an inferential statistic, t, to test the null hypothesis that the 15-page document did NOT influence scores on the Research Methods final. H0: µ1= µ2. where µ1 and µ2 are the population means of Research Methods students without and with the aid of the study document. Ha: µ1≠µ2 Test participant 1 2 3 4 5 6 7 Score on 1st final (X1) Score on 2nd final (after having the benefit of the study sheet) (X2) 73 100 93 86 67 80 89 83 96 95 95 82 80 92 First, let’s eyeball this. Hmm, 7 test subjects, 1 did worse after the study sheet, 1 did the same, 5 did better, some of them quite a bit better. I’m thinking this is going to close. (Seven subjects is a small sample, and thus requires a pretty hefty t score to exceed the critical value.) Just for grins, let’s treat this as two independent groups. Here’s the (somewhat daunting) formulae: So. Numerator is easy. The mean of the “without study sheet” group is 588/7 = 84. The mean of the “with the study sheet” group is 623/7 = 89. So, the people scored on the average 5 points higher when they took the test after having had access to the study sheet. So, (switching to M from “X bar” ‘cause I can’t enter “X bar” – Daniel, maybe if you gave me all the component parts of the formulae . . . ) M1 – M2 = -5. But as I said above, just make it positive – the numerator is 5. That was easy. Now for the denominator. Here’s a cut-and-paste from an xls spreadsheet that might help us: (X1)2 X1 73 100 93 86 67 80 89 588 84 5329 10000 8649 7396 4489 6400 7921 50184 X2 83 96 95 95 82 80 92 623 89 OK, I’ll do this by hand and scan it in. (X2)2 6889 9216 9025 9025 6724 6400 8464 55743 So, t (df=12) = .98. Critical value for a two-tailed t with α = .05 and df = 12 is 2.179. So we fail to reject our null hypothesis and infer that we have no data to support the notion that this study sheet influenced performance on a Research Methods final. (Dang, don’t you wish you hadn’t spent so much time on it?!) “But,” you say, “given that you said this was a repeated measures design, why can’t we just use the two-sample t test for related groups?” Great idea. So now we gotta calculate those “d” scores. Two-sample t test (related groups) Score on 1st final (X1) Test participant 1 2 3 4 5 6 7 Total M 73 100 93 86 67 80 89 588 84 Score on 2nd final (after having the benefit of the study sheet) (X2) 83 96 95 95 82 80 92 623 89 d (“difference,” “delta,” Δ) (To make ‘em positive, I’m gonna take X2-X1) 10 -4 2 9 15 0 3 Spreadsheet help, please. X1 d (X2X1) X2 73 100 93 86 67 80 89 588 84 83 96 95 95 82 80 92 623 89 d2 10 -4 2 9 15 0 3 35 5 100 16 4 81 225 0 9 435 And hand calculations, using the SIMPLER t test formula: So, t (df=6 – now that we have related groups!) = 2.01. The critical value for a two-tailed t with α = .05 and df = 6 is 2.447. So, we cannot reject the null hypothesis, just as before. Two more things: 1 – It seems as those this latter analysis was CLOSER to significant. (Purists would say that being “almost significant” is like being “kinda pregnant” – it either is or is not significant. But still . . . .) I THINK this is because variance that is grouped in with the “error variance” in an independent groups t test gets factored out in a related groups test. That is, we pretended that those pairs of scores were totally independent, and so there would be some variability among all 14 scores. But in fact there was some dependence among pairs of scores – each pair came from the same person. And so there was less error variance. 2 – OK, watch this. Just under the aforementioned “Full Monty” rubric, let’s do a confidence interval on these d scores. OK, we know that M is 5. We can find the value of t in the table – t for 6 df is 2.447. SE is S divided by the square root of N (i.e., the square root of 7, i.e., 2.64). OK, so all we need is S. Another scanned file: OK, now we can calculate the confidence interval. CI = 5 +/- (2.447) (6.58/2.64) = 5 +/- (2.447) (2,49) = 5 +/- 6.09 = -1.09 - 11.09. That is, we are 95% certain that the true population mean for the difference scores lies between -1.09 and 11.09. So what? What does this tell you? (I have NOT told you this in class, but it is in the texts.) What this tells us is, ZERO is within this interval!! That is, the 95% confidence interval contains zero – the true underlying mean for the distribution of difference scores, given our experimental conditions, MIGHT WELL be zero. Thus we see again that we should NOT reject the null hypothesis. If the CI did NOT contain zero then we would reject the null hypothesis. But THEN you say, “But Dr. Bias – I’m not so sure this question lends itself to a repeated measures design. I mean, if nothing else, there was a time confound – everyone took the second test after he/she took the first one (time being linear, as it is, and time travel being so, uh, unlikely). (Note I didn’t say ‘impossible’ – you’ve taught us not to accept the null hypothesis, so all we can say is that there has been no compelling evidence to date of time travel.) But more, maybe there was something about taking the FIRST test that influenced performance on the second test, in addition to whatever effects the study sheet might have had. Either practice, or fatigue? That is, aren’t there likely to be order effects here, in your design?” To which I say, “Gawd I love it when you critically consider research design.” OK, it’s Thanksgiving. Study hard. Randolph.