1 Analysis of Categorical Data

advertisement
1
Analysis of Categorical Data
The techniques presented in previous sections cover the analysis of numerical data and variables. Up to now it is left unanswered how to analyze categorical variables and data.
1.1
Two-way tables
How to present data for categorical variables?
Remember when doing a data description for a categorical variable we choose to do this with
a relative frequency table. But when we have data on two categorical variables and we want
to illustrate how the two variables depend on each other we use two-way tables.
Two Way Tables:
Data resulting from observations made on two categorical variables can be easily summarized
in a two way table.
Example:
Suppose we are interested in the rate of sprouted seeds in two different kinds of water (rainwater, muddy water). The two categorical variables are Sprouted (yes or no) and water type
(rainwater and muddy water).
Suppose rainwater, muddy water, and tap water were used to water 100 seeds each. Then they
were checked and noted how many of those seeds sprouted.
The result can then be easily presented in the following table:
rainwater
muddy water
tap water
Total
sprouted
yes no
64 36
74 26
60 40
138 62
total
100
100
100
200
A natural question to ask at this point would be if the choice of water has an effect on the
probability for a seed to sprout.
Multiple Comparison
The question in the example could be answered by comparing each type of water with each of
the remaining, i.e. one would have to make the following comparisons
muddy – rain
muddy – tap
rain – tap.
But it would be better to first find if there is a difference at all between the probabilities. To
find out we will introduce the χ2 -test for homogeneity.
1.2
χ2 Test of Homogeneity and Independence in a Two-Way-Table
In this section two different situations and questions concerning two categorical variables will
be covered. Both will lead to the same test. So, let us first introduce the two different types
of questions and then introduce the test routine to answer these questions.
1
1. Comparing the distribution of a categorical variable in two or more populations.
We will be looking at two or more samples from different populations and test if the distributions in all populations are the same.
The null hypothesis to be tested in this kind of problem is that the distributions are homogeneous, they are equal for all populations.
H0 : The probabilities for sprouting and nonsprouting is the same for both methods
Ha : The probabilities for sprouting and nonsprouting depend on the treatment
2. Testing two categorical variables for independence
When you study data that involves two variables, one important consideration is the relationship between the two variables. Does the proportion of measurements in the various categories
for factor 1 depend on which category of factor 2 is being observed?
Example: A survey was conducted to evaluate the effectiveness of a new flu vaccine that had
been administered in a small community. It consists of a two–shot sequence in two weeks.
A survey of 1000 residents the following spring provided the following information:
Flu
No Flu
Total
No vaccine One Shot Two Shots
24
9
13
289
100
565
313
109
578
Total
46
954
1000
The question to be answered is, if the flu shot had an impact on the incidence of the flu. Or
we could ask if the incidence of the flu is independent from the vaccine.
In order to answer this question, a χ2 test can be conducted, testing the following hypotheses:
H0 : No relationship between treatment and incidence of flu
Ha : Incidence of flu depends on amount of flu treatment
The two questions lead to the exact same χ2 test.
The χ2 distribution
The χ2 – distribution is neither a normal nor a t-distribution. Table D gives upper tail areas
for different degrees of freedom.
2
The χ2 Test for Homogeneity and Independence
Given are a categorical variable with R categories and one categorical variable with C categories.
Hypotheses:
H0 : The two categorical variables are independent
Ha : H0 is not true.
Assumption: The sample size is large. The sample size is considered large enough as
long as every count is at least 1, and not more than 20% of counts are less than 5.
Test statistic:
Compute for every cell of the two–way table the expected frequency
E = expectedfrequency =
and then
χ20 =
(row total)(column total)
Sample size
(O − E)2
E
cells
X
all
where O is the observed count.
P-value: P (χ2 > χ20 ) the upper tail area for a χ2 distribution with (R − 1)(C − 1) df,
found in Appendix Table D.
Decision: As usual.
Context: As usual.
Continue Flu Example:
3
1. Hypotheses H0 : The flu is independent of the vaccine status
versus
Ha : H0 is not true.
Let’s perform this test at a significance level of α = 0.05.
2. Assumptions are easily met the sample sizes are all greater than 5.
3. Test Statistic
The two-way table gives the observed frequencies O.
Next calculate the expected cell counts E for each cell:
Flu/no vaccine:
Flu/one shot:
Flu/two shots:
No Flu/no vaccine:
No Flu/one shot:
No Flu/two shots:
(46·313)/1000=14.40
(46·109)/1000=5.01
(46·578)/1000=26.59
(954·313)/1000=298.60
(954·109)/1000=103.99
(954·578)/1000=551.41
Put the expected cell counts into the table:
Flu
No Flu
Total
No vaccine One Shot Two Shots
24
9
13
14.40
5.01
26.59
289
100
565
298.60
103.99
551.41
313
109
578
Total
46
954
1000
Now find for each cell (O − E)2 /E and add these fractions:
χ2 = 6.404 + 3.169 + 6.944 + 0.309 + 0.153 + 0.335 = 17.313 and df=(3-1)(2-1)=2
4. P-value
Table D provides us for df = 2 with P-value<0.0005.
5. Decision Since the P-value is less than α reject the null hypothesis.
Context: At significance level of 5% the data do provide sufficient evidence that the
probability of getting the flu is not the same for all three vaccination groups.
The nature of the relationship has still to be explored:
For example estimate the following probabilities
P (flu=yes|vaccine=0) estimate 24/313=0.0767
P (flu=yes|vaccine=1) estimate 9/109=0.0826
P (flu=yes|vaccine=2) estimate 13/578=0.0225
From this we find that the estimated probability to get the flu given a person had 2 shots is
much less than the estimated probabilities to get the flu given that a person had no or only
one shot.
4
Example:
Some time ago there was a report in the news that an AIDS vaccine tested in Thailand didn’t
show any effect.
The data quoted in the news is presented in the two–way table below (including the expected
cell counts):
HIV+
HIVTotal
Placebo Vaccine
105
106
105.5
105.5
1168
1167
1167.5
1167.5
1273
1273
Total
211
2335
2546
χ2 = 0.00237 + 0.00237 + 0.000214 + 0.000214 = 0.005168 and df=(2-1)(2-1)=1
Conduct a test of homogeneity :
1. Hypotheses: H0 : The probability to get AIDS, is the same for the vaccinated and the
placebo group versus Ha : H0 is not true.
α = 0.05
2. The sample sizes are all large enough.
3. Test Statistic:
See calculations above: χ2 = 0.0052, with 1 df,
4. P-value
From Table D we find P-value>0.25
5. Decision: Do not reject H0 since the p-value is greater than α.
6. Context: At significance level α = 0.05 the data do not provide sufficient evidence that
the HIV infection–rate was impacted by the vaccine.
1.3
χ2 Goodness of fit Test
In this section the χ2 test for comparing the relative frequency distribution from a sample with
a given probability distribution is introduced.
Example: A company filling grass seed bags wants to evaluate their filling machine. The
following distribution is advertised on their bags, where K1-K5 are different kinds of grass
seeds:
kind of seeds proportion
K1
0.5
K2
0.25
K3
0.15
K4
0.05
K5
0.05
5
The company wants to check if the seed distribution in the bags fits the advertised distribution.
They take a sample of size 1000 and find the following summarized data:
kind of seeds
K1
K2
K3
K4
K5
count
480
233
160
63
64
In order to check if the label is a truthful description of the contents of the seed bags, we want
to compare the claimed distribution with the sample data.
Notation: for a given categorical random variable
k
p1
p2
= number of categories
= true probability to fall in category 1
= true probability to fall in category 2
..
.
pk
= true probability to fall in category k
In order to compare the observed frequencies with the hypothesized distribution we study
the χ2 goodness–of–fit statistic.
O1 = observed cell count for category 1
O2 = observed cell count for category 2
..
.
Ok = observed cell count for category k
These observed counts shall be compared with the expected count, if the claim is true (the
label is true).
If we examine an event that occurs with probability p, then in a sample of size n we would
expect to see this event about n · p times. The χ2 statistic is based on this fact.
For a given hypothesized population distribution p01 , . . . , p0k (in the example, this is the label)
define:
E1 = expected cell count for category 1 = n · p01
E2 = expected cell count for category 2 = n · p02
..
.
Ek = expected cell count for category k = n · p0k
The χ2 statistic is then:
χ2 =
k
X
(Oi − Ei )2
Ei
i=1
If p01 , . . . , p0k is the true distribution, then the χ2 statistic is χ2 – distributed with df=k − 1
degrees of freedom,
6
Continue Example:
With p01 = 0.5, p02 = 0.25, p03 = 0.15, p04 = 0.05, p05 = 0.05 we get:
kind of seeds
K1
K2
K3
K4
K5
p0i
0.5
0.25
0.15
0.05
0.05
Oi
480
233
160
63
64
n = 1000
Ei = np0i
500
250
150
50
50
(0i − Ei )2 /Ei
(480-500)2 /500=0.8
(233-250)2 /250=1.156
(160-150)2 /150=0.666
(63-50)2 /50=3.38
(64-50)2 /50=3.92
And χ20 = 0.8 + 1.156 + 0.666 + 3.38 + 3.92 = 9.9226. This number now has to be evaluated as
coming from a χ2 distribution with 5-1=4 df.
The χ2 Goodness–of–fit Test
Given a distribution p01 , . . . p0k
Hypotheses: H0 : p1 = p01 , . . . , pk = p0k versus Ha : H0 is not true.
Assumption: The sample size is large. The sample size is large enough for the χ2 test
to be appropriate as long as every observed count is at least 5.
Test statistic:
(Oi − Ei )2
Ei
categories
X
χ20 =
all
with df = k − 1
P-value: P (χ2 > χ20 ) the uppertail area for a χ2 distribution with k − 1 df, found in
Appendix Table D.
Decision:
Is the P-value ≤ α , then reject H0
Is the P-value > α , then do not reject H0
Continue Example:
1. Hypotheses: H0 : p1 = 0.5, p2 = 0.25, p3 = 0.15, p4 = 0.05, p5 = 0.05 versus Ha : H0 is
not true.
2. Assumptions: The observed counts are all large enough.
3. Test Statistic:
The test statistic is χ20 = 9.9226 with df=4 (see calculations above).
4. P-value: From Table D for df=4 find that 9.9922 falls between 9.49 and 11.14, with upper
tail probabilities 0.050 and 0.025.
So we conclude 0.025 <P-value< 0.05.
5. Decision: Since the p-value≤ 0.05 = α, we reject H0 at significance level of α = 0.05.
6. Interpretation: At significance level of 5% we find sufficient evidence in the sample that
the label is not giving a true description of the contents of the seed bag.
7
Download