Lecture 9

advertisement
Categorical Data
Categorical Data Analysis
• To identify any association between two categorical data.
Example:
1,073 subjects of both genders were recruited for a study
where the onset of severe chest pain is recorded for each
subject.
Variables:
- Onset of severe chest pain (+ve / –ve)
-Gender (male / female)
Chi-square tests
• Commonly denoted as 2
• Useful in testing for independence between categorical
variables (e.g. genetic association between cases / controls)
2
(|
O

E
|)
i
2   i
Ei
i 1
K
Comparison of observed, against what is expected under
the null hypothesis.
Assumptions
• Sufficiently large data in each cell in the cross-tabulation
table.
Small Cell Counts
• In general, require
(a)
(b)
Smallest expected count is 1 or more
At least 80% of the cells have an expected count of 5 or more
• Yate’s Continuity Correction
Provides a better approximation of the test statistic when the data is
dichotomous (2  2)
(| Oi  Ei | 0.5) 2
 
Ei
i 1
K
2
Goodness-of-fit test
• Null hypothesis of a hypothesized distribution for the
data.
• Expected frequencies calculated under the
hypothesized distribution.
For example:
The number of outbreaks of flu epidemics is charted over the
period 1500 to 1931, and the number of outbreaks each year is
tabulated. The variable of interest counts the number of
outbreaks occurring in each year of that 432 year period. E.g.
there were 223 years with no flu outbreaks.
Goodness-of-fit test
• Hypotheses:
H0: Data follows a Poisson distribution with mean 0.692
H1: Data does not follow a Poisson distribution with mean
0.692
Note: Mean 0.692 is obtained from the sample mean.
Sample mean
= (0 x 223 + 1 x 142 + 2 x 48 + 3 x 15 + 4 x 4 + 5 x 0) / 432
= 0.692
Expected frequency for X = 0
=
432  P(X = 0), where X ~ Poisson(0.692)
(Oi  Ei ) 2
2

~

Test Statistic

6 1
E
i 1
i
K
, with df = (6 – 1).
This yields a p-value of 0.99, indicating that we will almost
certainly be wrong if we reject the null hypothesis.
Test of independence
Most common usage of the Pearson’s chi-square test.
H0: The two categorical variables are independent
H1: The two categorical variables are associated (i.e. not
independent)
Under the independence assumption, if outcome A is
independent to outcome B, then
P(A and B happen jointly) = P(A happen) x P(B happen)
Calculating expected frequencies
P(Chest pain +) = 83/1073
P(Males)
= 520/1073
P(Chest pain -) = 990/1073
P(Females)
= 553/1073
P(Males with chest pain +) = 83/1073 x 520/1073 = 0.0375
Expected(Males with chest pain +) = 1073 x P(.)
= 1073 x 0.0375
= 40.224
Observed(Males with chest pain +) = 46
Test of independence
• Expected frequencies calculated by: Eij 
• Degrees of freedom = (r – 1)  (c – 1)
Ri C j
n
Chi-square test
Chi-square test
Chi-Square Tests
Pears on Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Ass ociation
N of Valid Cas es
Value
1.744 b
1.456
1.745
1.743
df
1
1
1
1
Asymp. Sig.
(2-s ided)
.187
.228
.186
Exact Sig.
(2-s ided)
Exact Sig.
(1-s ided)
.209
.114
.187
1073
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is
40.22.
Looking at the validity of the assumption of sufficiently large sample
sizes!
Quantification of the effect
•2-test identifies whether there is significant association
between the two categorical variables.
• But does not quantify the strength and direction of the
association.
• Need odds ratio to do this.
• Odds ratio defines “how many times more likely” it is to be
in one category compared to the other:
Example: For the previous example on severe chest pain,
males are about 1.4 times more likely to experience severe
chest pains than females.
Always know what is the outcome/event of
interest, and what is the baseline reference!
Otherwise OR can be interpreted both ways!
Odds ratio and relative risk
Pos. outcome
Neg. outcome
Exposure (+)
a
b
Exposure (-)
c
d
a c a b ad
OR 


b d c d bc
Calculation of odds ratio is pretty straightforward.
- Use the leading diagonal divided by the
antidiagonal.
a ( a  b) a ( a  c )
RR 

c (c  d ) b (b  d )
Relative risk is more tricky though, since
it’s not symmetric! While it’s commonly
used interchangeable with OR, the
interpretation and calculation are very
different!
Exegesis on epidemiology
Case-Control Study
• Compare affected and unaffected individuals
• Usually retrospective in nature
• Temporal sequence cannot be established (timing for the onset
of the disease)
• No information on population incidence of the disease
Cohort Study
•
•
•
•
•
•
Odds ratio is the right metric here!
Usually random sampling of subjects within the population
Prospective, retrospective or both
Relative risk is the
Long follow-up; loss to follow-up
appropriate metric here!
Costly to conduct
Temporal sequence can be established
Provides information on population incidence of the disease
Confidence interval of odds ratio
• Not straightforward to obtain confidence intervals of
odds ratio (due to complexity in obtaining the
variance)
• Straightforward to obtain the variance of the
logarithm of odds ratio.
Varlog( OR )
ˆ1
1 1 1 1p
 log
 
 Var
a b c 1 d p
ˆ


  pˆ 2

  Var log 
1 
  1  pˆ 2
• Odds ratio is always reported together with the pvalues (obtained from Pearson’s Chi-square test),
and the corresponding confidence intervals.
Case study on smoking and lung cancer
Ca (+ve)
Ca (-ve)
Smoking (+)
1,301
1,205
Smoking (-)
56
152
Odds and Odds Ratio
Odds Ratio (OR)
=
(1301/56)/(1205/152)
Pearson’s Chi-square
 p-value
=
=
47.985, on df = 1
0
=
1
1
1
1



1301 56 1205 152
Var[log(OR)]
95% Confidence interval=
=
exp log( 2.93)  1.96 
(2.14, 4.02)
=
2.93
=
0.026
0.026

Beyond 2 x 2 tables
severe chest pain lasting 30 min or more * RACE Crosstabulation
s evere ches t pain las ting
30 min or more
yes
no
Total
Count
% within RACE
Count
% within RACE
Count
% within RACE
Chinese
47
6.2%
707
93.8%
754
100.0%
Chi-Square Tests
Pears on Chi-Square
Likelihood Ratio
Linear-by-Linear
Ass ociation
N of Valid Cases
Value
10.300a
9.170
10.156
2
2
Asymp. Sig.
(2-s ided)
.006
.010
1
.001
df
1073
a. 0 cells (.0%) have expected count les s than 5. The
minimum expected count is 12.07.
RACE
Malay
14
9.0%
142
91.0%
156
100.0%
Indian
22
13.5%
141
86.5%
163
100.0%
Total
83
7.7%
990
92.3%
1073
100.0%
Nominal or ordinal
For categorical variables with two possible outcomes:
- Does not matter whether the variable is nominal or ordinal
For categorical variables with more than 2 outcomes:
- Important to note whether the variable is nominal or ordinal
- Test to use is very different, and thus conclusion reached can
be very different.
Example: Consider the same dataset on severe chest pain,
suppose we have the smoking status of every individual,
classified into:
- Non-smoker
- Daily smoker
- Excessive smoker
Smoking intensity
Chi-square test for trend
severe chest pain lasting 30 min or more * smoking status Crosstabulation
s evere ches t pain las ting
30 min or more
yes
no
Total
Count
% within s moking s tatus
Count
% within s moking s tatus
Count
% within s moking s tatus
Pears on Chi-Square
Likelihood Ratio
Linear-by-Linear
Ass ociation
N of Valid Cases
5.236
Ex-s moker
9
13.2%
59
86.8%
68
100.0%
Total
81
7.7%
969
92.3%
1050
100.0%
ORsmoker = 1.52 (0.88, 2.63), p = 0.180
Chi-Square Tests
Value
5.243 a
4.724
s moking s tatus
non-s moker daily s moker
53
19
6.7%
9.8%
736
174
93.3%
90.2%
789
193
100.0%
100.0%
2
2
Asymp. Sig.
(2-s ided)
.073
.094
1
.022
df
1050
a. 0 cells (.0%) have expected count les s than 5. The
minimum expected count is 5.25.
ORex-smoker= 2.11 (1.00, 4.51), p = 0.081
with non-smoker as reference category.
Linear-by-linear association
Adopts a correlational approach by calculating the Pearson
correlation coefficient between the rows and the columns,
allowing for ordinal outcomes in either.
severe chest pain lasting 30 min or more * smoking status Crosstabulation
s evere ches t pain las ting
30 min or more
yes
no
Total
Count
% within s moking s tatus
Count
% within s moking s tatus
Count
% within s moking s tatus
s moking s tatus
non-s moker daily s moker
53
19
6.7%
9.8%
736
174
93.3%
90.2%
789
193
100.0%
100.0%
Ex-s moker
9
13.2%
59
86.8%
68
100.0%
Recode rows as: yes = 0, no = 1.
Recode columns as:
non-smoker = 0, daily smoker = 1, ex-smoker = 2
Total
81
7.7%
969
92.3%
1050
100.0%
Linear-by-linear association
Consider the test statistic:
T = (N – 1) r2 ~ Chi-square(1)
= (1050 – 1)  (-0.0706)2
= 5.2356
53 observations
19 observations
Chi-Square Tests
Pears on Chi-Square
Likelihood Ratio
Linear-by-Linear
Ass ociation
N of Valid Cases
Value
5.243 a
4.724
5.236
2
2
Asymp. Sig.
(2-s ided)
.073
.094
1
.022
df
1050
a. 0 cells (.0%) have expected count les s than 5. The
minimum expected count is 5.25.
Pearson Correlation = -0.0706
Nominal vs. Ordinal
Importance of recognising the kind of variables we have in order
to identify the right test!
Chi-Square Tests
Pears on Chi-Square
Likelihood Ratio
Linear-by-Linear
Ass ociation
N of Valid Cases
Value
5.243 a
4.724
5.236
2
2
Asymp. Sig.
(2-s ided)
.073
.094
1
.022
df
1050
a. 0 cells (.0%) have expected count les s than 5. The
minimum expected count is 5.25.
Procedure for Categorical Data Analysis
• Summarise data using cross-tabulation tables, with
percentages
• Recognise whether any of the variables are ordinal
• Perform a chi-square of independence to test for
association between the two categorical variables, or the
linear-by-linear test if there is at least one ordinal variable
out of the two variables
• Check the validity of the assumption on the sample size
• Quantify any significant association using odds ratios
• Always report odds ratios with corresponding 95%
confidence interval
Categorical Data Analysis in SPSS
Example: Let’s consider the lung cancer and smoking example:
Ca (+ve)
Ca (-ve)
Smoking (+)
1,301
1,205
Smoking (-)
56
152
1. Establish the relationship between the onset of lung cancer and smoking
status. Quantify this relationship if it is statistically significant.
Data entry
Slightly counter-intuitive, event of
interest and outcome of interest
should be coded as 0, and the
baseline reference outcome/event
coded as 1.
Define what 0 and 1 corresponds to:
Definition of 0s and 1s converted to
what you specified under “Values”.
Percentages are much easier and
more meaningful to interpret than
absolute numbers!
Highly significant,
P < 0.001
Odds ratio of getting
lung cancer with
corresponding 95% CI,
with non-smoker as
baseline
Relative risk of getting
lung cancer with
corresponding 95% CI,
with non-smoker as
baseline
Students should be able to
• understand the use of a chi-square test for testing
independence between two categorical outcomes
• understand the assumptions on sample sizes for the use of a
chi-square test
• know how to quantify the association using odds ratio/relative
risk, with corresponding 95% confidence intervals
• differentiate between the tests to be used for nominal
categorical and ordinal categorical variables.
• perform the appropriate analyses in SPSS and RExcel
Download