Chapter 09.pptx

advertisement
Chapter 9
Two-Sample Inference
Slide set to accompany "Statistics Using Technology" by Kathryn Kozak (Slides by David H Straayer) is
licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at
http://www.tacomacc.edu/home/dstraayer/published/Statistics/Book/StatisticsUsingTechnology112314b.pdf.
What’s the big deal?
• Comparing: we sometimes forget that
comparing is the single most important reason
for using numbers in the first place.
• Does it work? Does this affect that?
To pair or not to pair?
• In paired inference each measurement matches
with a measurement in the other side.
• There are always exactly the same number of
measurements in each data set – and each
measurement is linked (paired) up with just one
measurement in the other list.
• The reason we care about pairing is this: if they
are paired, we can treat them as single-sample
statistics on the differences.
• Mostly, this chapter will focus on independent
(un-paired) samples.
Section 9.1 Two Proportions
• Hypothesis testing
• Confidence intervals on the difference
• Standard error:
𝑝1 𝑞1
𝑛1
𝑝2 𝑞2
+
𝑛2
• (though, in a hypothesis test, we may assume
that 𝑝1 = 𝑝2 and it follows that 𝑞1 = 𝑞2 )
Hypothesis Test for Two Population Proportions
2-PropZTest
1. Random variables and parameters
x1, x2, p1, p2
2. Hypotheses & 
H0: p1=p2 (that is, their difference is zero)
H1: p1 {<,>, or } p2
3. Assumptions
a. Independent s.r.s.’s
b. Binomial conditions: samples ≪ population
c. Success & failures all > 5
4. Pooled proportion (because we’re assuming
𝑥1 +𝑥2
they are the same): 𝑝 =
,𝑞 =1 −𝑝
𝑛1 +𝑛2
𝑠𝑡𝑑. 𝑒𝑟𝑟. =
𝑝𝑞 𝑝 𝑞
+
𝑛1 𝑛2
𝑝1 − 𝑝2
𝑧=
𝑠𝑡𝑑. 𝑒𝑟𝑟.
p-value = area of left, right, or both tails,
depending on the alternate hypothesis.
That is, the p-value is the probability of getting
results this extreme, assuming H0.
5. Conclusion. As always, if p-value < , reject
the null hypothesis in favor of the alternate.
There is sufficient evidence to support the
alternate hypothesis.
Otherwise, there is not enough evidence to
support the alternate hypothesis at the
stated level .
6. Interpretation: what does this conclusion
imply in the context of the problem?
Confidence Interval (2-PropZint)
x1, x2, p1, p2, 𝑝1 , 𝑞1 as in hypothesis test.
C.I. = point estimate  margin of error
point estimate = 𝑝1 – 𝑝2
margin of error = zc* 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒆𝒓𝒓𝒐𝒓
𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒆𝒓𝒓𝒐𝒓 =
𝑝1 𝑞1 𝑝2 𝑞2
+
𝑛1
𝑛2
Example: Cheating Husbands
Do more husbands cheat on their wives more than
wives cheat on the husbands ("Statistics brain,"
2013)? Suppose you take a group of 1000 randomly
selected husbands and find that 231 had cheated
on their wives. Suppose in a group of 1200
randomly selected wives, 176 cheated on their
husbands. Does the data show that the proportion
of husbands who cheat on their wives is more than
the proportion of wives who cheat on their
husbands? Test at the 5% level.
Conclusion: We have very
strong evidence
(p =1.97 X 10-7) that the
proportion of husbands
cheating on their wives is more
than the proportion of wives
cheating on their husbands.
Estimate difference in cheating rates
Real World Interpretation: The proportion of
husbands who cheat is anywhere from 5.13% to
11.73% higher than the proportion of wives who
cheat. Since this difference interval doesn’t
include zero, we can conclude guys are worse. 
Section 9.2 Paired Samples for Two Means
• Make sure you can differentiate between
matched (paired or dependent) samples and
independent samples.
• Potential shortcut: if the lists are of different
lengths, it’s a sure bet they are independent.
If they’re same length, does the first item in
the first list “go with” the first item in the
second list in some important way? If so, they
are matched.
Shortcut for matched pairs
• Don’t treat them as a two sample problem at
all!
• Just create a new list of differences, and treat
that list as a single-sample statistics problem
(hypothesis or C.I., means or proportions)
• On the T.I.: L1 – L2 STO> L3, and rock-and-roll
with L3
Section 9.3 Independent Samples for Two Means
• This section will look at how to analyze when two
samples are collected that are independent. As
with all other hypothesis tests and confidence
intervals, the process is the same though the
formulas and assumptions are different. The only
difference with the independent t-test, as
opposed to the other tests that have been done,
is that there are actually two different formulas
to use depending on if a particular assumption is
met or not.
Hypothesis Test for Independent t-Test
1. Variables & parameters: x1, x2, 1 & 2
2. Hypotheses & 
H0: 1= 2 (that is, their difference is zero)
H1: 1 {<,>, or } 2
3. Assumptions:
a. Independent s.r.s.’s (try for = sizes)
b. Normally distributed or sample size  30
c. We’re going to skip Ms. Kozak’s pooled standard
deviation as an advanced topic, and assume that the
standard deviations may not be equal. This is more
conservative.
4. Sample statistic, standard error, test statistic,
degrees of freedom and p-value
Sample statistics: 𝑥1 , 𝑥2 , 𝑠1 , 𝑠2
standard error =
𝑥1 −𝑥2
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟
𝑠12
𝑛1
+
𝑠22
𝑛2
𝑡=
(assume 1 = 2)
d.f. = (whew! Massive calculation best left to
technology. Fortunately, most software
gets it right for you.)
p-value = area of left, right, or both tails,
depending on the alternate hypothesis.
That is, the p-value is the probability of getting
results this extreme, assuming H0.
5. Conclusion. As always, if p-value < , reject
the null hypothesis in favor of the alternate.
There is sufficient evidence to support the
alternate hypothesis.
Otherwise, there is not enough evidence to
support the alternate hypothesis at the
stated level .
6. Interpretation: what does this conclusion
imply in the context of the problem?
Example test for 2 means
The cholesterol level of patients who had heart
attacks was measured two days after the heart
attack. The researchers want to see if patients who
have heart attacks have higher cholesterol levels
over healthy people, so they also measured the
cholesterol level of healthy adults who show no
signs of heart disease. ("Cholesterol levels after,"
2013). Does the data show that people who have
had heart attacks have higher cholesterol levels
over patients that have not had heart attacks? Test
at the 1% level.
Cholesterol
Level of
Heart Attack
Patients
270
236
210
142
280
272
160
220
226
242
186
Cholesterol
Level of
Healthy
Individual
196
232
200
242
206
178
184
198
160
182
182
Cholesterol
Level of
Heart Attack
Patients
266
206
318
294
282
234
224
276
282
360
310
Cholesterol
Level of
Healthy
Individual
198
182
238
198
188
166
204
182
178
212
164
Cholesterol
Level of
Heart Attack
Patients
280
278
288
288
244
236
Cholesterol
Level of
Healthy
Individual
230
186
162
182
218
170
200
176
Note: the Pooled question on the calculator is
for whether you are using the pooled standard
deviation or not. In this example, the pooled
standard deviation was not used since we are
not assuming the variances are equal. That is
why the answer to the question is No.
5. Conclusion: reject H0 in favor of H1 (p-value < )
6. This is strong evidence (p = 2.4 X 10-7)to show that
patients who have had heart attacks have higher
cholesterol level on average than healthy individuals.
Confidence Interval for 1 - 2
• Same data as previous example. Find a 99%
confidence interval for the mean difference in
cholesterol levels between heart attack
patients and healthy individuals.
• Conclusion: There is a 99% chance that the
interval (32.66, 85.72) contains the true
difference in means.
• Interpretation: The mean cholesterol level for
patients who had heart attacks is anywhere from
32.66 mg/dL to 85.72 mg/dL more than the mean
cholesterol level for healthy patients. (Though do
realize that many of assumptions are not valid, so
this interpretation may be invalid.) Since this
interval doesn't contain zero, this suggests that
hard attack patients had more serum cholesterol.
Download