Update to Research Methods Class – R. G. Bias –... 1 Seven things:

advertisement
Update to Research Methods Class – R. G. Bias – 11/25/2009
Seven things:
1 – I hope you all have a great Thanksgiving holiday. And I hope that when someone
sitting around the table says something like “Hey, didja read where tryptophan causes
people to be stupid?,” that you a) start asking yourself questions like “How did they
operationalize ‘stupid’?,” and b) are polite to your dinner hosts, and choose carefully
whether or not it is worth it to embarrass the poor soul who may not be so lucky as you to
have learned how to be a critical consumer of research.
2 – Nov. 18th lecture – further payback. OK, so I felt really bad that my sloppiness
cost us last week. The worked χ2 example on Hinton, p. 243 (in my edition – it’s the
“worked example” in the section on Chi-square as a ‘goodness of fit’ test”) is a crisper
one. A bit later Hinton shows what I was trying to show when I was talking about
“marginal values” being used to calculate expected values. On my p. 247 he shows the
formula:
row total X column total
The expected value of a cell =
------------------------------overall total
Here’s an example I got off the Internets
(http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/introduction_1.html):
The Chi-Square Test
Introduction
One of the most common and useful ways to look at information about the social world is in
the format of a table. Say, for example, we want to know whether boys or girls get into
trouble more often in school. There are many ways we might show information related to this
question, but perhaps the most frequent and easiest to comprehend method is in a table.
Got in Trouble No Trouble Total
Boys
46
71 117
Girls
37
83 120
Total
83
154 237
The above example is relatively straightforward in that we can fairly quickly tell that more
boys than girls got into trouble in school. Calculating percentages, we find that 39 percent of
boys got into trouble (46 boys got in trouble out of 117 total boys = 39%), as compared with
31 percent of girls (37 girls got in trouble out of 120 total girls = 31%). However, to re-frame
the issue, what if we wanted to test the hypothesis that boys get in trouble more often than
girls in school. These figures are a good start to examining that hypothesis; however, the
figures in the table are only descriptive. To examine the hypothesis, we need to employ a
statistical test, the chi-square test.
About the Chi-Square Test
Generally speaking, the chi-square test is a statistical test used to examine differences with
categorical variables. There are a number of features of the social world we characterize
through categorical variables - religion, political preference, etc. To examine hypotheses
using such variables, use the chi-square test.
The chi-square test is used in two similar but distinct circumstances:
a. for estimating how closely an observed distribution matches an
expected distribution - we'll refer to this as the goodness-of-fit test
b. for estimating whether two random variables are independent.
The Goodness-of-Fit Test
One of the more interesting goodness-of-fit applications of the chi-square test is to examine
issues of fairness and cheating in games of chance, such as cards, dice, and roulette. Since
such games usually involve wagering, there is significant incentive for people to try to rig the
games and allegations of missing cards, "loaded" dice, and "sticky" roulette wheels are all
too common.
So how can the goodness-of-fit test be used to examine cheating in gambling? It is easier to
describe the process through an example. Take the example of dice. Most dice used in
wagering have six sides, with each side having a value of one, two, three, four, five, or six. If
the die being used is fair, then the chance of any particular number coming up is the same: 1
in 6. However, if the die is loaded, then certain numbers will have a greater likelihood of
appearing, while others will have a lower likelihood.
One night at the Tunisian Nights Casino, renowned gambler Jeremy Turner (a.k.a. The
Missouri Master) is having a fantastic night at the craps table. In two hours of playing, he's
racked up $30,000 in winnings and is showing no sign of stopping. Crowds are gathering
around him to watch his streak - and The Missouri Master is telling anyone within earshot
that his good luck is due to the fact that he's using the casino's lucky pair of "bruiser dice," so
named because one is black and the other blue.
Unbeknownst to Turner, however, a casino statistician has been quietly watching his rolls
and marking down the values of each roll, noting the values of the black and blue dice
separately. After 60 rolls, the statistician has become convinced that the blue die is loaded.
Value on Blue Die Observed Frequency Expected Frequency
1
16
10
2
5
10
3
9
10
4
7
10
5
6
10
6
17
10
Total
60
60
At first glance, this table would appear to be strong evidence that the blue die was, indeed,
loaded. There are more 1's and 6's than expected, and fewer than the other numbers.
However, it's possible that such differences occurred by chance. The chi-square statistic can
be used to estimate the likelihood that the values observed on the blue die occurred by
chance.
The key idea of the chi-square test is a comparison of observed and expected values. How
many of something were expected and how many were observed in some process? In this
case, we would expect 10 of each number to have appeared and we observed those values
in the left column.
With these sets of figures, we calculate the chi-square statistic as follows:
Using this formula with the values in the table above gives us a value of 13.6.
[Randolph here. First off, I don’t know why their formula is so complicated. We’ll just use the
one we find in Hinton, p. 242 -- ∑ (O – E)2/E
Let’s do these calculations. So for a die value of “1,” O-E=6. Square it to get 36. Divide by
E (which is10) to get 3.6. Now we have to do this for all 6 cells:
1 – = 3.6
2 – (-5)2/10 = 2.5
3 – (-1) 2 /10 = .1
4 – (-3) 2/10 = .9
5 – (-4) 2/10 = 1.6
6 – (7) 2/10 = 4.9
Now sum ‘em all up: 3.6 + 2.5 + .1 + .9 + 1.6 + 4.9 = 13.6. Woo hoo, they got it right. If you
are reading this and you do NOT know where I got those figures [e.g., (7)2/10], email me
immediately, and we’ll “discuss.” Or call if you wish. I now return you to your originallyschedule stolen Chi-square discussion.]
Lastly, to determine the significance level we need to know the "degrees of freedom." In the
case of the chi-square goodness-of-fit test, the number of degrees of freedom is equal to the
number of terms used in calculating chi-square minus one. There were six terms in the chisquare for this problem - therefore, the number of degrees of freedom is five.
We then compare the value calculated in the formula above to a standard set of tables. The
value returned from the table is 1.8%. We interpret this as meaning that if the die was fair (or
not loaded), then the chance of getting a χ2 statistic as large or larger than the one
calculated above is only 1.8%. In other words, there's only a very slim chance that these rolls
came from a fair die. The Missouri Master is in serious trouble.
Recap
To recap the steps used in calculating a goodness-of-fit test with chi-square:
1. Establish hypotheses.
2. Calculate chi-square statistic. Doing so requires knowing:
o
The number of observations
o
Expected values
o
Observed values
3. Assess significance level. Doing so requires knowing the number of
degrees of freedom.
4. Finally, decide whether to accept or reject the null hypothesis.
Testing Independence
The other primary use of the chi-square test is to examine whether two variables are
independent or not. What does it mean to be independent, in this sense? It means that the
two factors are not related. Typically in social science research, we're interested in finding
factors that are related - education and income, occupation and prestige, age and voting
behavior. In this case, the chi-square can be used to assess whether two variables are
independent or not.
More generally, we say that variable Y is "not correlated with" or "independent of" the
variable X if more of one is not associated with more of another. If two categorical variables
are correlated their values tend to move together, either in the same direction or in the
opposite.
Example
Return to the example discussed at the introduction to chi-square, in which we want to know
whether boys or girls get into trouble more often in school. Below is the table documenting
the percentage of boys and girls who got into trouble in school:
Got in Trouble No Trouble Total
Boys
46
71 117
Girls
37
83 120
Total
83
154 237
To examine statistically whether boys got in trouble in school more often, we need to frame
the question in terms of hypotheses.
1. Establish Hypotheses
As in the goodness-of-fit chi-square test, the first step of the chi-square test for
independence is to establish hypotheses. The null hypothesis is that the two variables are
independent - or, in this particular case that the likelihood of getting in trouble is the same for
boys and girls. The alternative hypothesis to be tested is that the likelihood of getting in
trouble is not the same for boys and girls.
Cautionary Note
It is important to keep in mind that the chi-square test only tests whether two variables are
independent. It cannot address questions of which is greater or less. Using the chi-square test,
we cannot evaluate directly the hypothesis that boys get in trouble more than girls; rather, the
test (strictly speaking) can only test whether the two variables are independent or not.
2. Calculate the expected value for each cell of the table
As with the goodness-of-fit example described earlier, the key idea of the chi-square test for
independence is a comparison of observed and expected values. How many of something
were expected and how many were observed in some process? In the case of tabular data,
however, we usually do not know what the distribution should look like (as we did with rolls of
dice). Rather, in this use of the chi-square test, expected values are calculated based on the
row and column totals from the table.
The expected value for each cell of the table can be calculated using the following formula:
For example, in the table comparing the percentage of boys and girls in trouble, the expected
count for the number of boys who got in trouble is:
The first step, then, in calculating the chi-square statistic in a test for independence is
generating the expected value for each cell of the table. Presented in the table below are the
expected values (in parentheses and italics) for each cell:
Got in Trouble No Trouble Total
Boys 46 (40.97)
71 (76.02) 117
Girls 37 (42.03)
83(77.97) 120
Total 83
154
237
This is Randolph. Let me jump in here and check
their work. So (117X83)/237 = 9711/237=40.97.
Check. (117X154)/237=76.02. Check.
(120X83)/237 = 42.03. Check. (I learned to round
to the nearest even number, so it woulda been 42.02,
but close enough.) (120X154)/237 = 77.97.
Excellent. NOTE: as I was trying to say, last
Wednesday, what this accomplishes is, basically, to
say, “OK, given that a total of 83 of the 237 kids in
our sample got in trouble – i.e., 35.02% of them – if
there was NO effect of gender, then we would
EXPECT there to have been 35.02% of the (117)
boys, i.e., 40.97 of ‘em getting in trouble, and
35.02% of the (120) girls, or 42.0 of them.” So, we
expected about 41 boys to be in trouble (given the
marginal values) and actually it was 46. We
expected about 42 girls to have gotten in trouble,
and we found 37. Is this going to yield a significant
effect of gender? I’ll bet not.
3. Calculate Chi-square statistic
With these sets of figures, we calculate the chi-square statistic as follows:
In the example above, we get a chi-square statistic equal to:
That’s what I got.
Yikes – up in the table they
found “76.02” – c’mon, this
class can’t abide any more
sloppiness!
4. Assess significance level
Lastly, to determine the significance level we need to know the "degrees of freedom." In the
case of the chi-square test of independence, the number of degrees of freedom is equal to
the number of columns in the table minus one multiplied by the number of rows in the table
minus one.
[Note, they had a cut-and-paste problem on the web site – I’ve replaced their conclusion with
my own. RGB.]
In this table, there were two rows and two columns. Therefore, the number of degrees of
freedom is: df = (# rows – 1) X (# columns -1) = 1 X 1 = 1. Table value (critical value) for
Chi-square with 1 df is 3.84. Thus we cannot reject the null hypothesis (we thought not!!),
and we conclude that we have no data to suggest that (within the constraints of our
experimental design) the likelihood of getting in trouble is affected by gender.
Recap
To recap the steps used in calculating a goodness-of-fit test with chi-square
1. Establish hypotheses
2. Calculate expected values for each cell of the table.
3. Calculate chi-square statistic. Doing so requires knowing:
a. The number of observations
b. Observed values
4. Assess significance level. Doing so requires knowing the number of
degrees of freedom
5. Finally, decide whether to accept or reject the null hypothesis.
3 – One thing that I didn’t say in class, and indeed I can’t find it in either textbook (I
know it’s in there, somewhere) is that all t test scores must be positive. (You could’ve
chosen to do X2 – X1 instead of X1– X2.) So, if you end up with a negative numerator in
the calculation of your t value, just make it positive.
4 – One thing we didn’t talk about in class but that I want you to know about was “effect
size.” Read Hinton (pp. 96 – 97 in my edition) first, then read S, Z, and Z, pp. 242-243,
and then pp. 411-413. No, I won’t ask you to calculate one. Just know that Cohen’s d is
a measure of effect size, and that .2 is small, .5 is medium, and .8 is large. (Yikes – be
careful. This is “Cohen’s d,” not to be confused with the “d” scores, the difference
scores, that we’ll attend to in a minute.)
5 – Post hoc tests. Someone in class asked me about post hoc tests. Geoff asked me,
online, if they could use a Tukey test to perform pairwise comparisons after finding a
significant ANOVA for their independent variable with three levels? The answer is
“yes,” and the beginning of Ch. 12 in Hinton describes it pretty well. No, this won’t be
on the final.
6 – Formulae
Thanks tons, to Daniel, for these. This is how they’ll appear on the final (unless someone
unearths a typo between now and then). The only t test formulae you’ll have to use is the
one for related groups or the single-sample one – NOT the one for independent groups.
Mean (population and sample)
Semi-interquartile range
Standard deviation (population and sample)
Standard error of the mean
z-score
One-sample t test
Two-sample t test (independent groups)
Two-sample t test (related groups)
Confidence interval
F-score (ANOVA)
7 – Sample questions-problems:
In a controlled experiment to test several treatments for a difference in mean
response variable value, the following partial anova table is produced. Complete
the table and give the elements of a test of the claim that the treatments produce
different means.
Source
Between groups
Within groups
Total
Sum of
Squares
57.08
77.55
134.63
DF
3
15
18
Mean Square
F Value
Answer
Source
Between groups
Within groups
Total
Sum of
Squares
57.08
77.55
134.63
DF
3
15
18
Mean Square
19.03
5.17
F Value
3.68
The null hypothesis states that there was no effect of the independent variable –
the four (see where I got that? If not, tell me.) groups did not vary in their scores
(the dependent variable). The alternative hypothesis is that there IS some effect,
the means DO differ -- put another way, the four sample groups represent
different underlying population distributions. Going into the F table, with alpha =
.05, the critical value of F with 3 and 15 df is 3.29. Our observed (calculated) F
exceeds the table value, so we reject the null hypothesis and conclude that there
IS an effect of “groups,” i.e., an effect of our Independent Variable.
Another problem:
One-way analysis of variance test:
[NB: “One-way” here does NOT mean “one-tailed.” It just means we are testing just
one IV.]
[Just read this one – don’t work it out.]
A psychologist is studying the effectiveness of three methods of reducing smoking. He
wants to determine whether the mean reduction in the number of cigarettes smoked daily
differs from one method to another among men patients. Sixteen men are included in the
experiment. Each smoked 60 cigarettes a day before treatment. Four randomly chosen
members of the group pursue method I; four pursue method II; and four pursue method
III. [Note: What goes unspoken here is that four pursue NO method – this is the control
group. I don’t know why these four people are left out of the analysis; below we see just
2 between-group df, so there were just 3 groups tested. They shoulda done a one-way
ANOVA with four groups.] The results are as follows:
Method I
50
51
51
52
Method II
41
40
39
40
Method III
49
47
45
47
Use a one-way analysis of variance to test whether the mean reduction in the number of
cigarettes smoked daily is equal for the three methods. (Let the significance level equal
.05.)
SOLUTION: The mean reduction for the first method is 51; for the second method it is
40; and for the third method it is 47. The mean for all methods combined is 46. Thus, the
between-group sum of squares equals 4  [(51-46)2 + (40-46)2 + (47-46)2], or 248. The
within-group sum of squares equals (50-51)2 + (51-51)2 + (51-51)2 + (52-51)2 +(41-40)2
+ (40-40)2 +(39-40)2 + (40-40)2 + (49-47)2 + (47-47)2 + (45-47)2 + (47-47)2 = 12. Thus,
the analysis-of-variance table is
Source of
variation
Between groups
Within groups
Total
Sum of squares
248
12
260
Degrees of
freedom
2
9
11
Mean square
F
124
1.33
93
Since there are 2 and 9 degrees of freedom, F.05 = 4.26. Since the observed value of F far
exceeds this amount, the psychologist should reject the null hypothesis that the mean
reduction in the number of cigarettes smoked daily is the same for the three methods.
Source
Mansfield, Edwin. Basic Statistics with Applications. New York: WW Norton & Co.,
1986, p. 424.
Submitted by Clarke Iakovakis
One last one – a t test example. Full-on. Full out. The full Monty. Full-contact t tests!
Let’s say I randomly sample 7 students from a, oh, say, Research Methods class, at the
13-week point in the semester. I give them all a final exam. Their scores are 73, 100, 93,
86, 67, 80, and 89. Then I gave the same 7 students a 15-page document including
various notes, sample problems, and an attempted recovery from a poor lecture. (Heh.)
After reading this the students took another, similar (matched for difficulty) final. On
this their scores were 83, 96, 95, 95, 82, 80, and 92. (In fact, worried about how well the
two tests are matched, I’m going to give 4 of the test subjects test A first, and then test B
after they have access to the study sheet. The other 3 subjects will take test B first, and
then test A after they have access to the study sheet. That is I will [almost perfectly,
given the odd number of subjects] counterbalance for the particular test taken.)
Did the 15-page document influence performance on a Research Methods final?
My independent variable is the 15-page study document, with two levels (present and not
present). My dependent variable is score on a Research Methods final. This is a withinsubjects (i.e., repeated measures) experimental design; each test participant served in
each group, i.e., saw both levels of the treatment (i.e., they DIDN’T have the study sheet,
and then they did).
I am going to use an inferential statistic, t, to test the null hypothesis that the 15-page
document did NOT influence scores on the Research Methods final.
H0: µ1= µ2. where µ1 and µ2 are the population means of Research Methods students
without and with the aid of the study document.
Ha: µ1≠µ2
Test
participant
1
2
3
4
5
6
7
Score on 1st
final (X1)
Score on 2nd final (after
having the benefit of the
study sheet) (X2)
73
100
93
86
67
80
89
83
96
95
95
82
80
92
First, let’s eyeball this. Hmm, 7 test subjects, 1 did worse after the study sheet, 1 did the
same, 5 did better, some of them quite a bit better. I’m thinking this is going to close.
(Seven subjects is a small sample, and thus requires a pretty hefty t score to exceed the
critical value.)
Just for grins, let’s treat this as two independent groups. Here’s the (somewhat daunting)
formulae:
So. Numerator is easy. The mean of the “without study sheet” group is 588/7 = 84. The
mean of the “with the study sheet” group is 623/7 = 89. So, the people scored on the
average 5 points higher when they took the test after having had access to the study sheet.
So, (switching to M from “X bar” ‘cause I can’t enter “X bar” – Daniel, maybe if you
gave me all the component parts of the formulae . . . ) M1 – M2 = -5. But as I said above,
just make it positive – the numerator is 5. That was easy. Now for the denominator.
Here’s a cut-and-paste from an xls spreadsheet that might help us:
(X1)2
X1
73
100
93
86
67
80
89
588
84
5329
10000
8649
7396
4489
6400
7921
50184
X2
83
96
95
95
82
80
92
623
89
OK, I’ll do this by hand and scan it in.
(X2)2
6889
9216
9025
9025
6724
6400
8464
55743
So, t (df=12) = .98. Critical value for a two-tailed t with α = .05 and df = 12 is 2.179.
So we fail to reject our null hypothesis and infer that we have no data to support the notion
that this study sheet influenced performance on a Research Methods final. (Dang, don’t
you wish you hadn’t spent so much time on it?!)
“But,” you say, “given that you said this was a repeated measures design, why can’t we just
use the two-sample t test for related groups?” Great idea. So now we gotta calculate those
“d” scores.
Two-sample t test (related groups)
Score on 1st
final (X1)
Test
participant
1
2
3
4
5
6
7
Total
M
73
100
93
86
67
80
89
588
84
Score on 2nd final
(after having the
benefit of the study
sheet) (X2)
83
96
95
95
82
80
92
623
89
d (“difference,”
“delta,” Δ) (To make
‘em positive, I’m
gonna take X2-X1)
10
-4
2
9
15
0
3
Spreadsheet help, please.
X1
d (X2X1)
X2
73
100
93
86
67
80
89
588
84
83
96
95
95
82
80
92
623
89
d2
10
-4
2
9
15
0
3
35
5
100
16
4
81
225
0
9
435
And hand calculations, using the SIMPLER t test formula:
So, t (df=6 – now that we have related groups!) = 2.01. The critical value for a two-tailed
t with α = .05 and df = 6 is 2.447. So, we cannot reject the null hypothesis, just as before.
Two more things:
1 – It seems as those this latter analysis was CLOSER to significant. (Purists would say
that being “almost significant” is like being “kinda pregnant” – it either is or is not
significant. But still . . . .) I THINK this is because variance that is grouped in with the
“error variance” in an independent groups t test gets factored out in a related groups test.
That is, we pretended that those pairs of scores were totally independent, and so there
would be some variability among all 14 scores. But in fact there was some dependence
among pairs of scores – each pair came from the same person. And so there was less
error variance.
2 – OK, watch this. Just under the aforementioned “Full Monty” rubric, let’s do a
confidence interval on these d scores.
OK, we know that M is 5. We can find the value of t in the table – t for 6 df is 2.447. SE
is S divided by the square root of N (i.e., the square root of 7, i.e., 2.64). OK, so all we
need is S.
Another scanned file:
OK, now we can calculate the confidence interval.
CI = 5 +/- (2.447) (6.58/2.64) = 5 +/- (2.447) (2,49) = 5 +/- 6.09 = -1.09 - 11.09.
That is, we are 95% certain that the true population mean for the difference scores lies
between -1.09 and 11.09. So what? What does this tell you? (I have NOT told you this
in class, but it is in the texts.) What this tells us is, ZERO is within this interval!! That
is, the 95% confidence interval contains zero – the true underlying mean for the
distribution of difference scores, given our experimental conditions, MIGHT WELL be
zero. Thus we see again that we should NOT reject the null hypothesis. If the CI did
NOT contain zero then we would reject the null hypothesis.
But THEN you say, “But Dr. Bias – I’m not so sure this question lends itself to a repeated
measures design. I mean, if nothing else, there was a time confound – everyone took the
second test after he/she took the first one (time being linear, as it is, and time travel being
so, uh, unlikely). (Note I didn’t say ‘impossible’ – you’ve taught us not to accept the null
hypothesis, so all we can say is that there has been no compelling evidence to date of
time travel.) But more, maybe there was something about taking the FIRST test that
influenced performance on the second test, in addition to whatever effects the study sheet
might have had. Either practice, or fatigue? That is, aren’t there likely to be order effects
here, in your design?”
To which I say, “Gawd I love it when you critically consider research design.”
OK, it’s Thanksgiving. Study hard. Randolph.
Download