MATH 2275 Chi-Square Tests Tutorial

```Chi-Square Tests Tutorial
Question 1
Among students taking a statistics course, 52 came from families with 3 children. For these threechildren families, the numbers with 0, 1, 2 and 3 female children are 5, 17, 24 and 6 respectively.
Test the hypothesis that the number of female children in a three-children family has a
Binomial(3, 0.6) distribution. Use a significance level of α = 0.05. State your null and alternative
hypotheses clearly.
Let X = number of female children in a 3-child family.
The hypotheses to be tested are:
H0 : X ~ Binomial(n = 3, p = 0.6)
H1 : X does not have a Binomial(n = 3, p = 0.6) distribution
Number of female
children in a 3-child
family, X = x
X=0
X=1
X=2
X=3
Total
Observed number of
families
5
17
24
6
n = 52
Cell probabilities
under H0
pi0
3
3
3
3
( ) 0.60 0.43 ( ) 0.61 0.42 ( ) 0.62 0.41 ( ) 0.63 0.40
0
1
2
3
n
= ( ) px (1 − p)n−x
x
= 0.064
= 0.288
= 0.432
= 0.216
Expected number of
days, npi0
52 &times; 0.064
= 3.328
52 &times; 0.288
= 14.976
52 &times; 0.432
= 22.464
52 &times; 0.216
= 11.232
1
52
Note that the cell for (X = 0) has an expected count that is less than 5. The chi-squared
approximation is suitable when the expected count for each cell is at least 5.
So merge the cells for (X = 0) and (X = 1) to ensure that all cells have an expected count of at
least 5.
1
Number of female
children in a 3-child
family, X = x
X = 0 or
X=1
X=2
X=3
Total
Observed number of
families
5 + 17 = 22
24
6
n = 52
Cell probabilities under
H0
3
( ) 0.60 0.43
0
3
+ ( ) 0.61 0.42
1
3
( ) 0.62 0.41
2
3
( ) 0.63 0.40
3
1
= 0.432
= 0.216
52 &times; 0.432
= 22.464
52 &times; 0.216
= 11.232
n
pi0 = ( ) px (1 − p)n−x
x
Expected number of days,
npi0
= 0.352
52 &times; 0.352
= 18.304
2
(Expected − Observed)2
Expected
2
52
2
χ2obs
(18.304 − 22) (22.464 − 24) (11.232 − 6)
(Expected − Observed)2
=
∑
18.304
22.464
11.232
Expected
all cells
= 0.7463
= 0.1050
= 2.437
= 3.288
Degrees of freedom of the chi-squared statistic = k − m − 1
where k = number of cells = 3
{Note that the table now has 3 cells, since we merged the cells for (X = 0) and (X = 1)}
and m = number of parameters estimated from sample data = 0
So: degrees of freedom = 3 − 0 − 1 = 2
The test is: Reject H0 if χ2obs ≥ χ2α,(2)
α = 0.05. From tables, χ2α,(2) = χ20.05,(2) = 5.991
χ2obs = 3.288 → Do not reject H0 at the α = 0.05 level of significance. The observed number of
female children in a three-children family follows a Binomial(n = 3, p = 0.6) distribution.
2
Question 2
An insurance company has collected data on claim frequency over a period of 365 days.
Number of claims per
day
0
1
2
Observed number of
days
50
290
25
Apply a suitable test with α = 0.001 to evaluate the hypothesis that the number of claims in one
day follows a Poisson distribution. State your null and alternative hypotheses clearly.
We must estimate λ, the parameter of the Poisson distribution, from the sample data.
λ is the population mean number of claims in one day, so we should estimate it by the mean
number of claims per day in the sample.
Estimate of λ = Sample mean number of claims per day
=
Total number of claims
Total number of days
=
(0&times;50)+(1&times;290)+(2&times;25)
(50+290+25)
340
= 365 = 0.9315
Let X = number of claims per day.
The hypotheses to be tested are:
H0 : X ~ Poisson(λ = 0.9315)
H1 : X does not have a Poisson(λ = 0.9315) distribution
Note that we must add a column to the table of observed and expected counts for X ≥ 3, to make
the cell probabilities sum to 1.
Number of claims per day,
X=x
X=0
X=1
X=2
X≥3
Total
Observed number of days
50
290
25
0
n = 365
Cell probabilities under H0
e−0.9315 0.93150
0!
e−0.9315 0.93151
1!
e−0.9315 0.93152
2!
= 0.3940
= 0.3670
= 0.1709
= 0.06814
365 &times; 0.3940
= 143.795
365 &times; 0.3670
= 133.946
365 &times; 0.1709
= 62.386
365 &times; 0.06814
= 24.873
(62.386 − 25)2
62.386
(24.873 − 0)2
24.873
= 22.404
= 24.873
pi0 =
e−λ λx
x!
Expected number of days,
npi0
(Expected − Observed)2
Expected
(143.795 − 50)2 (133.946 − 290)2
143.795
133.946
2
1−∑
x=0
e−λ λx
x!
1
365
χ2obs
= ∑
all cells
= 61.181
= 181.810
(Expected − Observed)2
Expected
= 290.268
3
Degrees of freedom of the chi-squared statistic = k − m − 1
where k = number of cells = 4
and m = number of parameters estimated from sample data = 1
So: degrees of freedom = 4 − 1 − 1 = 2
The test is: Reject H0 if χ2obs ≥ χ2α,(2)
α = 0.001. From tables, χ2α,(2) = χ20.001,(2) = 13.815
χ2obs = 290.268 → Reject H0 at the α = 0.001 level of significance. The number of insurance
policies collected over a 365-day period do not follow a Poisson(λ = 0.9315) distribution.
Question 3
Sixty captured specimens of sharks are classified according to species (Great White, Tiger or
Hammerhead) and the presence or absence of bacterial skin infections. The table below shows the
number of sharks observed in each category.
Great White Sharks
Tiger Sharks
No Infection
14
9
17
Infection
6
6
8
A marine biologist wants to investigate if there is a relationship between the attributes of species
and presence of infection.
a) What is the appropriate test for this situation (ANOVA, goodness-of-fit, test of independence
or test of homogeneity)? Explain briefly.
The appropriate test for this situation is a test of independence. The marine biologist wants to
investigate if there is a relationship between the attributes of species and presence of infection,
i.e. if these attributes are dependent or independent.
One way of recognizing if this situation requires a test of independence or a test of homogeneity
is by looking at the sampling method. If h samples of pre-determined sizes were taken from h
populations, and each individual in each of the h samples is classified into exactly one of k
categories, then this usually requires a test of homogeneity.
If, however, one sample is taken from one population, and each individual in the sample is
classified according to two attributes, of which there are h categories of the first attribute and k
categories of the second attribute, then this usually requires a test of independence.
b) Write down the hypotheses being tested by the test statistic.
H0 : The attributes of species and presence of infection are independent.
4
H1 : The attributes of species and presence of infection are not independent.
OR
H0 : The rows and columns are independent.
H1 : The rows and columns are not independent.
OR
H0 : pij = pi. &times; p.j ; i = 1 … h, j = 1 … k
H1 : At least one pij ≠ pi. &times; p.j
where pij = population proportion of individuals in the (i, j)th cell, i.e. in the ith category of the
first attribute and the jth category of the second attribute
pi. = population proportion of individuals in the ith row, i.e. in the ith category of the first
attribute
p.j = population proportion of individuals in the jth column, i.e. in the jth category of the
the second attribute
c) Derive an expression for the expected count eij in cell (i, j) under the assumption of the null
hypothesis.
Using the last set of hypotheses above:
H0 : pij = pi. &times; p.j ; i = 1 … h, j = 1 … k
H1 : At least one pij ≠ pi. &times; p.j
Let the total number of observations be n.
Estimate pi. , the population proportion of individuals in the ith row, by the sample proportion of
individuals in the ith row.
So pĖi. =
Number of individuals in the ith row
Total number of individuals in the sample
=
ni.
n
Estimate p.j, the population proportion of individuals in the jth column, by the sample proportion
of individuals in the jth column.
So pĖ.j =
Number of individuals in the jth column
Total number of individuals in the sample
=
n.j
n
Expected count eij in cell (i, j) = npij = n &times; pi. &times; p.j, under the assumption of the null hypothesis.
So an estimate of the expected count in cell (i, j), eij = n &times; pĖi. &times; pĖ.j = n &times;
ni.
n
&times;
n.j
n
=
ni. n.j
n
5
d) Find the value of the test statistic and the number of degrees of freedom.
Observed count, ð§ðĒðĢ
Great White Sharks
Tiger Sharks
Row Totals, ð§ðĒ.
Expected count, ððĒðĢ =
ð§ðĒ. ð§.ðĢ
ð§
No Infection
14
9
17
40
Infection
6
6
8
20
No Infection
Infection
13.333
10
16.667
40
6.667
5
8.333
20
Great White Sharks
Tiger Sharks
Row Totals (Check)
(ððąðĐððð­ðð − ðððŽððŦðŊðð)ð
ððąðĐððð­ðð
Great White Sharks
Tiger Sharks
χ2obs
Column Totals, ð§.ðĢ
20
15
25
ð§ = ðð
Column Totals
(Check)
20
15
25
ð§ = ðð
No Infection
Infection
0.0333
0.1
0.00667
0.0667
0.2
0.0133
(Expected − Observed)2
= ∑
= 0.42
Expected
all cells
Degrees of freedom = (h − 1)(k − 1) = (3 − 1)(2 − 1) = 2 &times; 1 = 2
e) Find the critical value(s) for this test, using a significance level of α = 0.05.
From tables, the critical value is χ2α,(2) = χ20.05,(2) = 5.991
f) What do you conclude from this test?
The test is: Reject H0 if χ2obs ≥ χ2α,(2)
Since χ2obs âą χ2α,(2) , do not reject H0 . The sample data do not provide evidence of a relationship
between the attributes of species and presence of infection.
6
Question 4
A researcher wishes to test whether the proportion of university students who own cars is the same
in three different faculties of the University of the West Indies, St. Augustine. She randomly
selects 150 students from each of the three faculties and records the number that own cars. The
results are shown below:
Science and
Technology
Engineering
Social Sciences
Own a car
Don’t own a
car
36
114
23
18
127
132
(ð−ð)ð
For this data, ∑ ð = ð. ððð. Note that since you are given this value, you do not have to
manually calculate expected counts or the chi-squared test statistic.
a) What is the appropriate test for this situation (ANOVA, goodness-of-fit, test of independence
or test of homogeneity)? Explain briefly.
The appropriate test for this situation is a test of homogeneity. The researcher wants to investigate
if proportion of university students who own cars is the same in three different faculties, i.e. if
these three populations of students are homogenous with respect to car ownership.
Notice that h = 3 samples of pre-determined sizes were taken from h = 3 populations, and each
individual in each of the h = 3 samples is classified into exactly one of k = 2 categories (own a
car / don’t own a car). This type of sampling usually requires a test of homogeneity.
b) What is the distribution of the test statistic? How many degrees of freedom does it have?
The test statistic has a chi-squared distribution. Number of rows in the contingency table, h = 3.
Number of columns in the contingency table, k = 2. Degrees of freedom = (h − 1) &times; (k − 1) =
2&times;1= 2
c) Write down the hypotheses being tested by the test statistic.
H0 : The proportion of university students who own cars is the same in all three faculties.
H1 : The proportion of university students who own cars in at least one faculty is different from
the other faculties.
OR
H0 : The distribution of car ownership among students is the same in all three faculties.
H1 : The distribution of car ownership among students in at least one faculty is different from the
other faculties.
7
d) Find the critical value(s) for this test, using a significance level of α = 0.10.
From tables, the critical value is χ2α,(2) = χ20.10,(2) = 4.605
e) What do you conclude from this test?
The chi-squared statistic, χ2obs = ∑
(O−E)2
E
= 8.116
Since χ2obs ≥ χ2α,(2), reject H0 . The sample data provide evidence that the proportion of university
students who own cars in at least one faculty is different from the other faculties
8
```