MascoloDoc2

advertisement
Psych 626 Data Analysis & NHST– Dr. Mascolo
1
What Happens When We Don’t Know Population Parameters?
The Probability doc took us from Dice, Poker, & the California Lottery to Percentile Ranks – first for
an individual score Y & then for a group’s average score π‘ŒΜ….
Now I’ve made 2 important points recently: 1) the Central Limit Theorem allows us to derive Sampling
Distribution Parameters based upon Population Distribution Parameters and 2) we almost never know
the Population Distribution Parameters. Isn’t that like giving a starving person a can of beans but no
can opener? Or like giving you the combination for a safe containing $1 million but not telling you
where the safe is?
Let’s focus on one Sampling Distribution Parameter -- the standard deviation (i.e., the “standard error
of the mean”). The Central Limit Theorem tells us:
sY =
s
, but we do not know the value of the numerator (β„΄), so we cannot derive the standard error
N
(β„΄π‘ŒΜ… ). So we can only, well, estimate it – and all we have to go on are our very own data. That brings
up a 3rd point I’ve made recently: First and foremost, the purpose of statistical data analysis is to
estimate population parameters.
So as an example, in the formula for a standard score:
z=
Y - my
sy
We could try simply inserting our sample standard deviation (s) into the denominator:
z=
Y - my
sy
The numerator doesn’t change because our sample mean is an unbiased estimate of the population
mean. However, the denominator does change because the standard deviation is not unbiased – it
underestimates the population standard deviation. We try to compensate for this underestimate by
“tweaking” the formula for the sample variance -- s2 -- (& therefore the standard deviation -- s) – that
is, we use “N-1” in the denominator rather than simply N. Using a slightly smaller denominator in the
formula for the sample variance will slightly inflate the result – again, trying to compensate for its
underestimation of the population variance.
This trick does not completely erase the underestimation problem, so we write the formula with s
instead of s in the denominator -- an admission that we are starting with an estimate of the
population standard deviation – so this is not simply a direct derivation from the Central Limit
Theorem.
There’s a cost to this imperfect and biased solution – the Central Limit Theorem’s 4th point is not true
– the sampling distribution built with an estimate of the population standard deviation is not a normal
distribution – in particular, it’s too short in the head and too fat in the tails. So, we really cannot begin
Psych 626 Data Analysis & NHST– Dr. Mascolo
the formula with “ z =” (a z-score can be calculated for a score in any data set -- it does not magically
transform that data set into a normal distribution…. still, it’s misleading.
This class would end at this point -- if not for an employee at the Guinness Brewery in Ireland:
William Gosset “did the math” and determined exactly how a distribution using s instead
of s in the denominator differs from the normal distribution. Gosset published his work
under the pseudonym Student (there are competing explanations of this), and his
mathematical result is called the Student t Distribution.
So now we have the final formula with all its intellectual honesty:
Y - mY
t=
sY
In class we’ll see how the t table in your text’s appendix works – it’ll depend upon the sample size N,
or more correctly, the degrees of freedom, which equals N-1 (that’s right, just like the adjusted
denominator we used to calculate S2)
2
Psych 626 Data Analysis & NHST– Dr. Mascolo
3
Data Analysis and Hypothesis Testing
Your text refers to 2 hypotheses, and I have detailed the models that provide the basis for deciding
between them. That is, we “fit” each model to our data (because science is empirical) and then
compare the errors -- the “residuals”. The Null Hypothesis (Ho) – is represented by the Null Model,
and the Research Hypothesis (H1) is represented by the Full Model.
Here are 2 ways to see how the Null Model is simpler than the Full Model:
Logical: the Null Model says all the participants come from the same group while the Full Model says
the participants must be separated into 2 groups according to a “Predictor Variable”
Statistical: Null Model has only 1 parameter to estimate, while the Full Model has 2 parameters to
estimate
So all things being equal, the Null Model is favored because it is simpler compared to the Full Model.
This preference for simplicity has been stated in different ways, for example, the Law of Parsimony,
Morgan’s Canon, Occam’s Razor. A scientific theory should not be more complicated than necessary
to explain the data. Statistically, the Null Model requires a single parameter estimate: µ. However,
the Full Model requires two parameter estimates: µ1 & µ2.
Thus, the Null Model enjoys preferential treatment and is discarded in favor of the Full Model only
when the data strongly warrant the move to the more complicated model. What would constitute such
evidence? The evidence would be that the Full Model more accurately fits the data by significantly
decreases errors (residuals).
So we state the hypotheses and models this way:
Ho: µ1 = µ2
Null Model: Y = µ + e
H1: µ1 ≠ µ2
Full Model:
Y1 = µ1 + e
Y2 = µ2 + e
Remember those little e’s are error -- and that’s what we want to compare between models
O.K. -- now time to “fit the models” by estimating these parameters: the Null Model’s µ and the Full
Model’s µ1 & µ2.
In math, estimates are designated with the symbol ‘ or ^ -- in our case, Y’ or Y (let’s use Y’ -- it’s a
whole lot easier for me to format). How can we possibly hope to estimate a parameter representing
an entire population? Well, necessity is the mother of invention, and all we have available to us
comes from our actual data, so we use our sample means:
For the Full Model, µ1 is estimated as Y 1, and µ2 is estimated as Y 2. That is, our sample means for
each group in our 2-group study.
Psych 626 Data Analysis & NHST– Dr. Mascolo
4
For the Null Model, we only have to estimate the overall mean -- as though we just have one big set
of data -- not 2 different groups. In our course we assume that all groups in a study have equal size
(N), so we can calculate the overall mean of our data by averaging the 2 groups means:
Y1 + Y2
(no subscript = no individual groups) – the different levels of the Predictor Variable are
2
ignored
Y =
O.K., this may all sound theoretical, so what follows is a concrete example of all this -- and it
introduces 2 very fundamental statistics used for data analysis.
The following example is fashioned after Lockhart (1999), in my estimation the most thorough &
intelligible explanation of basic data analysis – including its historical development – ever written.
Let’s say we have a set of 8 scores from the Interpersonal Effective Survey (IES): 4 from a Treatment
group (T), 4 from a Control group (C). We’ve been using Y to represent the Response Variable (in
this case, the IES scores), so we’ll use X to represent the Predictor Variable (in this case, Treatment
versus Control):
X:
T
C
C
T
C
T
T
C
Y:
12
7
9
14
11
14
16
13
Now the Null Model simplifies the analysis by ignoring the Predictor Variable X. So for example, to
calculate the mean for these data, simply add up all the scores – like one group of 8 scores – we
ignore the fact that 4 of the scores (the Treatment group) is different from the other 4 (the Control
Group) The 8 scores total 96, so the mean of the group is:
Y=
åY 96
=
= 12
N
8
So if we use 12 (the overall mean) as our estimate (Y’), we can calculate how much each of the 8
scores deviates from this estimate:
X:
T
C
C
T
C
T
T
C
Y:
12
7
9
14
11
14
16
13
Y:
12
12
12
12
12
12
12
12
0
-5
-3
+2
-1
+2
+4
+1 -- These Y -
Y-
Y:
Y ’s are what we mean by e’s
Take a moment to add these e’s up -- they sum to zero -- hopefully this is not a surprise. In fact, we
are going to continue our analysis by calculating the Sum of Squares (SS) for each model -- the same
calculations you learned for the previous Section Exam. So, we continue by squaring each of these
deviations and then adding them up:
å(Y - Y ) =
2
(0) 2 + (-5) 2 + (-3) 2 + (+2) 2 + (-1) 2 + (+2) 2 + (+4) 2 + (+1) 2 = 60 = SStotal
Psych 626 Data Analysis & NHST– Dr. Mascolo
5
So this Sum of Squares represents the sum of squared deviations from the overall mean of the data - as though we just had 1 group of 8 scores instead of 2 groups of 4 scores. This SS represents the
amount of error (e’s) associated with the Null Model, which ignores the Predictor Variable. This is our
baseline amount of error -- the amount associated with the simpler model, and it is called the SStotal.
Now, our central question in data analysis is whether we can significantly reduce error when we use
instead the Full Model, so we have to calculate a SS associated with that model. Remember this
model takes into account the Predictor Variable -- that is why it is more complex -- so we cannot
adopt this model unless we can show it is worth the added complexity.
The SS associated with the Full Model is called the SSe -- and we calculate it the same way we
calculate any SS, except in this case the inclusion of the Predictor Variable demands we calculate a
SS1 for the treatment group and a separate SS2 for the control group:
For the treatment group calculations, we use the mean for the first group, which equals 14:
X:
T
T
T
T
Y:
_
Y:
12
14
14
16
14
14
14
14
-2
0
0
+2
_
Y - Y:
å(Y1 - Y1 ) = (-2) 2 + (0) 2 + (0) 2 + (+2) 2 = 8 ….. This is SS1
2
For the control group calculations, we use the mean for the second group, which equals 10:
X:
C
C
C
C
Y:
_
Y:
7
9
11
13
10
10
10
10
-3
-1
+1
+3
_
Y - Y:
å(Y2 - Y2 ) = (-3) 2 + (-1) 2 + (-1) 2 + (+3) 2 = ∑e2 = 20 ….. This is SS2
2
Our last step is to add these 2 SS’s together to get the SS associated with the Full Model -- if this
seems too good to be true, it kind of is because we’re supposed to meet a certain criterion called the
“Assumption of Homogeneity of Variance” -- I’ll explain that later in class.
So for now, we just laze along and calculate SSe = SS1 + SS2 = 8 + 20 = 28
Let’s review:
ο‚·
ο‚·
When we use the simpler (Null) model we had this error measure: SStotal = 60
When we use the more complex (Full) model we had this error measure: SSe = 28
Psych 626 Data Analysis & NHST– Dr. Mascolo
1
So even though the Full Model costs us the added complexity of having to take into account the
Predictor Variable (and so have to estimate 2 parameters instead of just 1), but the Full Model also
provides a better “fit” for the data, that is, a decrease in error.
In fact, there’s a measure for this better fit (decreased error); it’s called the “ SSmodel”: it is a measure
of the reduction in error when we go from the Null Model to the Full Model:
SSmodel = SStotal - SSe. In our example, SSmodel = 60 - 28 = 32
And this leads to our 1st tool for data analysis: the Coefficient of Determination or R2. It is simply
the reduction in error in proportion to the total error we started with. That is:
R2 = SSmodel / SStotal In our example, R2 = 32/60 = .53
As a ratio, it is interpreted as “the proportion of variability in the Response Variable that is explained
by the Predictor Variable”. The higher it is, the more tempted we are to abandon the simpler Null
Model in favor of the more complex Full Model – but there’s no objective standard or cut-off that says
“out with the Null, in with the Full”.
And this leads to our 2nd tool for data analysis: the Magnitude of Effect or Cohen’s d. It measures
how strongly the PV effects changes in the RV in the statistical sense. Cohen is not pretending that
he is measuring a causal connection between the PV and the RV – that depends on your study’s
design-- his statistic is calculated regardless of whether the PV was manipulated or randomly
assigned. In fact, other than knowing the levels of measurement of your PV and RV, a statistician
doesn’t really need to know about your study at all. Again, internal validity depends upon your
research design, not your data analysis.
Cohen’s d is expressed in terms of population parameters:
m - m2
d= 1
s
However, just like the story of William Gosset and the t distribution, we almost never know population
parameters, so the calculation relies on estimates. Once again, sample means, variances, and
standard deviations provide the estimates of the corresponding population parameters. And so the
formula for Cohen’s d that we actually calculate is:
*
Y -Y
d¢ = 1 2
MSe
Y - Y 14 - 10
So in my example, d ¢ = 1 2 =
= 1.85
2.16
MSe
Like R2, there’s no objective standard or cut-off for rejecting Ho. Neither is designed to test the Null
Hypothesis. However, Cohen does offer these guidelines for interpreting the Effect Size (ES):
d’
Effect Size
0.8
Large
0.5
Moderate
0.2
Small
Note: I’m not detailing error terms (the denominator) in our class -- MS stands for a “Mean Square” – another name for a Variance S2 –
and is calculated as SSe / df. In a 2 group study df = (n1 - 1) + (n2 – 1). You lose 1 df for each parameter estimate -- another example
of how the Full Model is less parsimonious than the Null Model.
Psych 626 Data Analysis & NHST– Dr. Mascolo
1
Null Hypothesis Statistical Testing (NHST):
A statistic that is designed to test the Null Hypothesis is the t-test, which is calculated this way:
*
Y1 - Y2
tobs =
sy1 -y 2
This t calculation looks like Cohen’s d – and like Cohen it is based upon our data, so it is dubbed “tobs”
(“t observed” – meaning based upon our data)
What do we make of this statistic we calculated based upon our data? The answer is based on the
first part of this document -- recall that we put the “burden of proof” on the Research Hypothesis
because its Full Model is more complex than the Null Model associated with the Null Hypothesis (Ho).
In other words, we assume Ho is true unless we have sufficient evidence to reject it.
So we want to know whether our tobs is rare enough so that we may question our assumption that Ho
is true. In fact, if our tobs is really rare, we may actually decide that our data provide sufficient evidence
to reject Ho.
To determine how rare tobs is, we construct a theoretical Sampling Distribution using the Central Limit
Theorem and the Law of Large Numbers – in my Probability document I introduced the Sampling
Distribution of the Mean. To determine how rare tobs is, we instead construct a theoretical Sampling
Distribution of t. In particular, it is constructed with the assumption that Ho is true, and so the mean of
this sampling distribution is 0.
Why? If Ho is true, then µ1 = µ2, which is the same as saying µ1 - µ2 = 0. The numerator of tobs is π‘ŒΜ…1 −
π‘ŒΜ…2, so if the Ho is true, then π‘ŒΜ…1 − π‘ŒΜ…2 will likely equal 0, and so tobs will equal 0.
Now, the extreme tails of this distribution represent extremely improbable values of tobs if Ho is true –
not impossible, but extremely improbable – so improbable that if tobs is large enough to fall in one of
these extreme tails, we abandon our assumption that Ho is true and decide instead to reject Ho. For
a t-test, these extreme tails constitute a “rejection region” for Ho.
This rejection region, called alpha (α), is an objective & quantitative criterion for rejecting Ho, but it is
also arbitrary. In Psychology, the very strong tradition arbitrarily sets α no larger than 5% (like with
Gosset, there’s a whole backstory here).
Before modern computer technology, researchers conducting a t-test consulted a Statistical Table
(based on Gosset’s work) to determine the value of t that left the most extreme 5% of the Sampling
Distribution. This value is called tcrit because it serves as the cut-off point for statistical significance.
That is, in order to be considered statistically significant, tobs must meet or exceed tcrit. So this is the
decision rule:
If tobs ≥ tcrit, then tobs falls in the rejection region -- Reject Ho
If tobs < tcrit, then tobs does not fall in the rejection region -- Accept Ho
Note: Again, I’m not detailing error terms (the denominator) in our class -- π‘ π‘ŒΜ…1− π‘ŒΜ…2 is the estimated standard error for the difference
between means. The formula for t uses π‘ π‘ŒΜ… for one group and π‘ π‘ŒΜ…1− π‘ŒΜ…2 for two groups.
Psych 626 Data Analysis & NHST– Dr. Mascolo
1
Put another way, tobs is significant if its probability is even less than α (e.g., .05). So the decision rule
can be restated this way:
If tobs ≥ tcrit, then p < .05 -- Reject Ho
If tobs < tcrit, then p > .05 -- Accept Ho
Statistical Tables were deleted from this edition of your text – I’ve attached a Table of t-values at the
end of this document. The table requires 3 specifics: the alpha level, the degree of freedom, &
whether our test is one-or two-tailed (we will always be using a two-tailed test – I’ll explain in lecture).
For a two-group study, df = (n1 – 1) + (n2 – 1). That is, we lose 1 df per group (more specifically, 1 df
per parameter estimate).
Returning to my example, we are conducting a two-tailed test with12 degrees of freedom and alpha
set at .05 – the Table says tcrit = 2.447.
Now we calculate the t statistic: tobs =
Y1 - Y2 14 -10
=
= 2.619
sy1 -y 2
1.528
So, tobs = 2.619, which exceeds the tcrit of 2.447, and so we reject H0. Reporting this in the Results
section of an APA journal would look like this: “t(6) = 2.619, p < .05”
Modern statistical software will calculate your tobs and determine its exact p-value. For example,
having already announced your alpha level is .05, your audience reads “t(6) = 2.619, p = .0396” and
knows you were able to reject the Null.
So really – what is α? Remember the underlying assumption of our Sampling Distribution is that Ho is
true, but α is the portion of the distribution we use as a justification for rejecting Ho. So α is the
portion of the distribution (i.e., probability) that we reject Ho when we shouldn’t.
Earlier I stated that the very strong tradition in Psychology arbitrarily sets α no larger than 5%. Now I
can be more specific about this tradition: a research finding is considered statistically significant only
if the probability of incorrectly rejecting Ho is less than .05.
Psych 626 Data Analysis & NHST– Dr. Mascolo
1
NHST Errors
All the above is summarized and extended in this 4-square Truth Table:
Reality
Ho True
Ho False
Reject Ho
Type I Error
P(Type I) = α
Correct
P(Correctly Reject Ho) = Power
Accept Ho
Correct
Type II Error
P(Type II) = β
Researcher
Decision
So let me ask you – which type of error is worse – Type I Error or Type II?
That’s right – you can’t answer --- you need some context – a scenario so you could figure out what
each type of error entails and then figure out the consequences of each.
So for example, let’s say a pharmaceutical company is testing a new drug that might be effective in
treating advanced stage Melanoma – which is quite likely fatal. Now add to this situation: the new
drug is quite dangerous. So patients are randomly assigned into 2 groups – 1) treatment and 2)
control, and their survival rates are tracked.
So the Null Hypothesis (Ho) says the 2 groups will not differ in survival rate because the new drug
doesn’t work:
Ho: µ1 = µ2 (Drug is Ineffective)
The Research Hypothesis (H1) says the 2 groups will differ in survival rate because the new drug
does work:
H1: µ1 ≠ µ2 (Drug is Effective)
Now we can specify the consequences of each type of error:
Type I Error – Ho is incorrectly rejected (Drug is called Effective but is actually Ineffective)
Type II Error – Ho is incorrectly accepted (Drug is called Ineffective but is actually Effective)
Which is worse? It’s true the drug is dangerous, but it’s also true the disease is fatal. Myself, I’d
rather risk a Type I Error than a Type II Error. With Type I Error, an Ineffective drug is given to
terminally ill patients, but with a Type II Error, an Effective drug is withheld from terminally ill patients,
which I think is much worse for the patients.
Psych 626 Data Analysis & NHST– Dr. Mascolo
2
Now, does our answer change if the study is testing Safety rather than Effectiveness? Let’s say we’ve
already established the drug is Effective – but now we are measuring dangerous side effects.
So here’s how this plays out:
The Null Hypothesis (Ho) says the 2 groups will not differ in side effects because the new drug is
Safe:
Ho: µ1 = µ2 (Drug is Safe)
The Research Hypothesis (H1) says the 2 groups will differ in side effects because the new drug is
Dangerous:
H1: µ1 ≠ µ2 (Drug is Dangerous)
Now we can specify the consequences of each type of error:
Type I Error – Ho is incorrectly rejected (Drug is called Dangerous but really is Safe)
Type II Error – Ho is incorrectly accepted (Drug is called Safe but really is Dangerous)
Which is worse? Remember, the drug is Effective, and the disease is fatal. Myself, I’m going to
switch from my earlier decision; I’d rather risk a Type II Error than a Type I Error. With a Type II
Error, a Dangerous drug is given to terminally ill patients, but with a Type I Error, a Safe drug is
withheld from terminally ill patients.
Another scenario: how would our analyses change if we were instead testing a new wrinkle cream
that might also discolor the skin?
By the way, this kind of analysis is not limited just to scientific data analysis. It also applies to
decisions that clinicians make. Here’s a table that parallels the one above:
Reality
Disorder Absent
Disorder Present
Clinician
DX Yes
False Positive
Sensitivity
Decision
DX No
Specificity
False Negative
So the clinical equivalent of a Type I Error is a False Positive – like deciding a depressed patient
needs to be hospitalized, but in truth he only has suicidal ideation, not intention. And the clinical
equivalent of a Type II Error is a False Negative – like deciding a depressed inpatient can be given a
weekend pass because his depression is lifting, but in truth his suicide risk is dangerously high for
that same reason.
Psych 626 Data Analysis & NHST– Dr. Mascolo
3
Problems/Limitations of NHST & Proposed Solutions
I’m returning to the example from my introduction to NHST and the t-test – 8 subjects, 4 in Treatment,
4 in Control – here’s a summary:
So in my tiny example, tobs =
Y1 - Y2 14 -10
=
= 2.619
sy1 -y 2
1.528
We began with the assumption that the Null Hypothesis was true, and we constructed a Sampling
Distribution of t based on that assumption. Now, the extreme tails of this distribution are values of t
that are very improbable given the Null Hypothesis – not impossible, but really unlikely – like less than
5% (.05). So we collected our data and calculated tobs = 2.619, which was large enough to land in last
.05. How did we know? A t table “in the back of the book” shows that a tcrit of 2.447 is the cutoff for
the most extreme .05 when df = 6, and so we decided that our initial assumption (Ho is True) should
be rejected (Ho is False). When you write this up for publication in the APA journal, your audience
reads “t(6) = 2.619, p < .05” – the (6) being your degrees of freedom (8 subjects – 2 parameters that
have to be estimated).
NHST Comes Under Fire
There have grumblings for several decades about psychology’s overreliance on NHST. I’ll explain
specifics in class, but here’s a brief outline:
1. NHST has dominated psychology – to the virtual exclusion any other form of analysis.
2. Scientists have become dependent on NHST – to the point of inappropriately extending its use
(e.g., establishing replicability/reliability statistically instead of repeating the study)
3. Scientists have lapsed into mindless, robotic use of NHST, perhaps beguiled by the availability
of easy-to-use software programs.
4. Scientists have forgotten how to interpret their statistical findings – including foundational
concepts like p value, Type II Error, and Power.
5. Scientists have exploited the Achilles’ Heal of NHST – it can be used to disguise findings that
are statistically significant but practically trivial.
Psych 626 Data Analysis & NHST– Dr. Mascolo
4
One proposed solution: Confidence Intervals
We have been using sample statistics to estimate parameters – like when we want to calculate
m - m2
Cohen’s d, but it contains population parameters: d = 1
which we don’t know but can only
Y -Y
estimate with our sample statistics: d ¢ = 1 2
MSe
s
These sample statistics are being used as Point Estimates – our best single guess – but a “shot in the
dark” is pretty limited. Political surveys report Point Estimates but also a Margin of Error. Combined,
these constitute a Confidence Interval, where Confidence = Probability that the interval includes the
true value of the population parameter being estimated -- in this case, µ1 - µ2.
So Ho says these 2 population parameters are equal, so the difference between them is zero. That
is, µ1 - µ2 = 0. So we can test Ho by building a 95% Confidence Interval based upon our data and
then looking to see whether 0 is included – whether 0 lands somewhere between the Lower Limit and
Upper Limit of the interval. If it does, then we are forced to accept Ho – after all, we are 95% certain
that our interval includes the true value of µ1 - µ2, and if 0 is included, then 0 is a plausible value, and
so Ho must be retained. If instead 0 lands below the Lower Limit or above the Upper Limit – so not in
the interval – then 0 is an implausible value for the true difference between means, and so the Null
Hypothesis can instead be rejected.
OK, so how do we use our data to build this Confidence Interval? We start with the basic structure of
any Confidence Interval – a best single guess (Point Estimate) and then we add and subtract a
margin of error (Error Estimate):
C.I. = Point Estimate +/- Error Estimate
Now the Point Estimate of µ1 - µ2 is of course based on the only thing we have available to us -- our
sample means: π‘ŒΜ…1 − π‘ŒΜ…2 -- which served as the numerator of our tobs calculation.
The Error Estimate is also based upon values from our t-test: tcrit multiplied by the same standard
error which served as the denominator of our tobs calculation. So we end up with:
CI 95 =
The tcrit in the Confidence Interval is the same as the tcrit used in the t-test above because a t-test with
alpha = .05 is equivalent to a 95% Confidence Interval.
For my example:
𝐢𝐼95 = (14 − 10) ± 2.447 × 1.523
so
𝐢𝐼95 = 4 ± 3.728
and finally
CI95 = +0.3 -- +7.7
Psych 626 Data Analysis & NHST– Dr. Mascolo
5
So adding and subtracting the error term (and rounding the results) gives us a Confidence Interval
with a Lower Limit (LL) of +0.3 and an Upper Limit (UL) of +7.7
The Confidence Interval improves upon the best single estimate (point estimate) of the difference
between 2 means by specifying an interval bounded by lower and upper limits and, even better, by
specifying a probability that the interval between these 2 boundaries includes the true difference
between two population parameters -- µ1 - µ2. In my tiny example, the interval clearly does not include
0, so my conclusion is the same as with the t-test: reject Ho.
Confidence Intervals illustrate the fundamental application of all Statistics and Data Analysis:
estimating population parameters. They can be calculated to estimate a population mean or
variance, a difference between two population means or a ratio of two variances. These last two are
examples of how Confidence Intervals can be used for NHST – like a t-test or Analysis of Variance
(ANOVA – next document).
So why are Confidence Intervals at least a partial solution to the NHST criticisms listed above? Like a
t-test, a Confidence Interval can be used as an NHST, but Confidence Intervals are also like R2 and
Cohen’s d in providing information about Effect Size.
So a Confidence Interval is preferable to a t-test because it is more informative – more transparent.
So researchers may be able to reject Ho and claim to have demonstrated a statistically significant
finding because their calculated Confidence Interval does not include 0, but the Confidence Interval
may also expose that finding as trivial.
Here’s an example: researchers hope to show that Cognitive Behavior Therapy (CBT) is more
effective in treating depression if 12 sessions of Yoga are added to the treatment regimen. They
randomly assign depressed patients to be treated with either CBT alone or CBT plus Yoga; the
response (outcome) variable is Beck Depression Inventory score. So here are the results:
CBT
CBT + Yoga
_
_
Y1 = 12.8
Y2 = 12.6
SS1 = 1.64
SS2 = 3.38
n1 = 25
n2 = 25
SStotal = 5.52 MSe = 0.10458 π‘†π‘ŒΜ…1− π‘ŒΜ…2 = 0.09147
tcrit = 2.021 with 𝛼 = .05 and df = 48
tobs = 2.19
So the Results Section in an APA journal would read: “t(48) = 2.19, p < .05” In other words, Ho is
rejected, so the results are statistically significant. However, if the results of a Confident Interval were
added, the statement would be: “t(48) = 2.19, p < .05, 95% CI [0.02, 0.39]” The Results would still
show Ho is rejected but would also show the results came very close to including 0 and so failing to
reject Ho.
Whether or not Ho is rejected, the Confidence Interval shows that at best there is only about a third of
a BDI point. So these results show a great deal of precision (very small LL – UL range) but negligible
Effect Size -- it seems clear there is no real clinical significance in these results.
Psych 626 Data Analysis & NHST– Dr. Mascolo
6
Download