Chi-Square Goodness of Fit Test

advertisement
Chi-Square Goodness of Fit Test
DEFINITIONS
Qualitative variables are those which classify the units into
categories. The categories may or may not have a natural ordering to
them. Qualitative variables are also called categorical variables.
Quantitative variables have numerical values that are measurements
(length, weight, and so on) or counts (of how many). Arithmetic
operations on such numerical values do have meaning.
Analysis of Count Data
Three tests
 If we have qualitative data on just one variable, a test of goodnessof-fit is used to assess if the qualitative data “fit” or are consistent
with a particular discrete model for the percentages in each
category. The null hypothesis would state the hypothesized discrete
model.
 A test of homogeneity is used to assess if two or more populations
are homogeneous or alike with respect to the distribution for some
categorical variable. The null hypothesis is that the distributions are
the same across the two or more populations.
 A test of independence determines if two qualitative variables are
related or not for a given population. The null hypothesis is that the
two variables are independent, that there is no apparent association.
Big Idea for Chi-Square Tests
1.
The data consist of observed counts—that is, how many of the
items or subjects fall into each category.
2.
We will compute expected counts under H0 , that is, the counts that
we would expect to see for each category if the corresponding null
hypothesis were true.
3.
We will compare the observed and expected counts to each other
via a test statistic that will be a measure of how close the
observed counts are to the expected counts under H 0 . So if this
“distance” is large, we have some support for rejecting H 0 .
The test statistic that is computed for all three tests is called a chisquare test statistic.
THE CHI-SQUARE STATISTIC
Chi-Square Test Statistic: X 2 
O  E 2

all cells
E
DEFINITIONS
The observed counts are the data, the number of observations that
fall into each category or cell.
The expected counts are the number of observations that would be
expected to fall into each category or cell if the null hypothesis being
tested were true.
The chi-square test statistic measures the distance between the
observed and expected counts across all cells and is computed as:
X 
2

all cells
Think About It
O  E 2
E
Could you get an X
2
statistic that is negative?
The Chi-Square Distribution
Various chi-square distributions
df=1
df=4
df=10
0
5
10
15
X2
20
Properties of the Chi-Square Distribution
 2  df 
 The distribution is not symmetric and is skewed to the right.
 The values are non-negative.
 There is a different chi-square distribution for different degrees of
freedom.
 The mean of the chi-square distribution is equal to its degrees of
freedom and is located to the right of the mode.
 The variance of the chi-square distribution is 2(df).
The 95th percentile of a chi-square distribution with three degrees of
freedom is 7.81 and is denoted by  0.95  3 = 7.81
2
Chi-square Distribution with 3 degrees of freedom
0.05
area to left = 0.95
0
 2 3  7.81
0.95
X2
Example
Working with the Chi -Square Distribution
A study in which researchers wanted to assess whether having a pet
increased the length of survival for coronary heart disease patients.
Approximately 94.3% of the patients with a pet survived for one year,
while only 71.8% of those without a pet survived for one year. From a
descriptive standpoint there seemed to be an advantage to having a
pet .Is this difference of 22.5% significant? Is there a significant
relationship between pet status and survival status?
(a) State the appropriate hypotheses in words.
H0: There is no association between having a pet and survival for coronary heart
disease patient
H1: Having a pet increases the survival for coronary heart disease patients.
(b)
Suppose that the observed chi-square test
Chi-square Distribution
with 1 degree of freedom
statistic value is 8.85.We want to measure
the chance of getting a value of 8.85 or
Area=p-value
larger under the null hypothesis.
2
2
8.85
The distribution for  under Ho is a chi square 0
X
distribution with one degree of freedom. Using your calculator, find
the corresponding p-value.
(c) Using a 5% significance level, what is the decision? State the
conclusion in the context of the problem.
Since the p-value for the test of no association is so small, we would
reject H 0 . The data support that there is a statistically significant
relationship between pet status and survival status.
Let's Do It! Youth and Sports
The Trends & Tudes Newsletter produced by Harris Interactive
presented the model given below for the responses to the question:
“Have you ever participated in organized youth sports outside of
school?”
Response
Percent
1.Yes—currently participate
29%
2.Yes—participated in the past
39%
3. No—have never participated
32%
Suppose a survey of young people aged 8 to 18 attending schools in
Ann Arbor, Michigan gave the following responses to the sport
participation question: n =200
Response
Observed Counts
Expected Counts
1
82
2
64
3
54
Total
200
200
Do the data indicate that youth from Ann Arbor have a different
distribution of sport participation as compared to the national model?
(a)
State the null hypothesis. (Hint: p1 is the proportion of youth in
Ann Arbor stating that they are currently participating in organized
youth sports outside of school, so based on the model being
tested p1 =0.29.
H0: p1 =0.29, p2 
, p3 
.
(b)
Compute the expected counts and enter them in the previous
table.
Compute the observed test statistic.
(c)
X
(d)
(e)
2
OBS


all cells
O  E 2
E

Find the p-value.
State your decision and conclusion using   5%
Let's Do It!
According to USA Today (Mary 7, 1991) here is how sports team
members say athletes do as role models for children:
Response:
Excellent
Good
Fair
Poor
Percent:
16%
38%
41%
5%
A poll of 350 adults within a community was taken and the following
data were obtained:
Response:
Excellent Good
Fair
Poor
Observed Responses: 44
145
133
28
Expected Number:
56
___
___
___
We wish to determine if the data support the conjecture that the
community adults have the same idea about athletes as role models as
do sports team members.
a. What is the appropriate null hypothesis of interest here?
H 0: ______________________________________________
b. Carry out the appropriate test at the 5% significance level.
i. Compute the remaining expected counts and write them in the
above table.
ii. Compute the appropriate test statistic and report (bounds for)
the p-value.
iii. State your decision.
Chi-Square as a Test of Independence
Example High Blood Pressure
Many studies have suggested that there is a link between high blood pressure
and heart attacks. In one study, white male subjects aged 35 to 64 were
classified according to whether their systolic blood pressure was low (less than
140 millimeters of mercury) or high (140 or higher) and then followed for five
years to determine whether or not they suffered from a heart attack during the
five years. The data are summarized in the following table:
Heart Attack?
Yes
Low
High
Blood Pressure
a)
21 (
55 (
No
)
)
2655
3283
(
(
)
)
We wish to assess if these data support the hypothesis that heart attack
status is dependent on blood pressure level. State the hypotheses to be
tested (in the context of this scenario).
H0:
H1:
b)
Compute the expected counts under H 0 .
21
 0.0078 or 0.78%.
2676
55
 0.0165 or 1.65%
Proportion of high blood pressure who had a heart attack:
3338
Proportion of low blood pressure who had heart attack:
Overall, 76 of the 6014 subjects, or 1.26%, had a heart attack.
Expected number of low blood pressure subjects with heart attack
762676  33.82
 0.1262676 
Expected count

column totalrow total
overall total
6014
c) The distribution is approximately a chi-square distribution with df = (r-1) (c-1).
What are the degrees of freedom for our example?
d) Compute the test statistic X2.
X
2
OBS

21  33.822
33.82
2

2655  2642.18

2642.18
2

55  42.18

42.18
2

3283  3295.82

3295.82
 8.87
e) Give the decision and conclusion using a 5% significance level.
df=1
p-value = area to right
of 8.87 = 0.0029
0
8.87
X2
Let's Do It! Hodgkin’s disease and Tonsillectomies
A study investigated whether any relationship exists between Hodgkin's disease
and tonsillectomies. The counts at the right are based on a random sample of 85
patients suffering from Hodgkin's disease and who had a sibling of the same sex
who was free of the disease and whose age was within 5 years of the patient's
age.
Sibling
No Tonsillectomy
Tonsillectomy
No Tonsillectomy 37 (
)
7 (
)
Patient
Tonsillectomy
15
(
)
26 (
)
(a) What is the appropriate hypothesis of interest?
(b)Find the expected counts for each category.
(c) Carry out a test at the 5% significance level, and state your conclusion.
Homework will be posted on my website.
Download