The Analysis of Categorical Data

advertisement
The Analysis of Categorical Data
Categorical variables
• When both predictor and response variables are
categorical:
• Presence/absence
• Colors
• The data in such a study represents counts –or
frequencies - of observations in each category…
Analysis
Data
Analysis
A single categorical
predictor variable
Organized as two way
contingency tables,
and tested with chisquare or G-test
Organized as a multiway contingency
tables, and analyzed
using either log-linear
models or classification
trees
Multiple predictor
variables (or complex
models)
Two way Contingency Tables
• Analysis of contingency tables is done correctly
only on the raw counts, not on the percentages,
proportions, or relative frequencies of the data
Wildebeest carcasses from the Serengeti
(Sinclair and Arcese 1995)
Variables
•
Sex (males / females)
•
Cause of death (predation / other)
•
Bone marrow type:
1.
2.
3.
Solid white fatty (healthy)
Opaque gelatinous
Translucent gelatinous
Data
Sex
Marrow
Predation
Male
SWF
Yes
Male
OG
Yes
Male
TG
Yes
…
…
…
Brief format
SEX
MARROW DEATH
COUNT
FEMALE
SWF
PRED
26
MALE
SWF
PRED
14
FEMALE
OG
PRED
32
MALE
OG
PRED
43
FEMALE
TG
PRED
8
MALE
TG
PRED
10
FEMALE
SWF
NPRED
6
MALE
SWF
NPRED
7
FEMALE
OG
NPRED
26
MALE
OG
NPRED
12
FEMALE
TG
NPRED
16
MALE
TG
NPRED
26
Contingency table
Sex * Death Crosstabulation
Sex
Dead
NPRED PRED
Total
FEMALE
48
66
114
MALE
Total
45
93
67
133
112
226
Contingency table
Sex * Marrow Crosstabulation
Sex
FEMALE
MALE
Total
Marrow
SWF
OG
TG
Total
58
32
24
114
55
113
21
53
36
60
112
226
Contingency table
Death * Marrow Crosstabulation
Death
NPRED
PRED
Total
OG
Marrow
SWF
TG
Total
38
13
42
93
75
113
40
53
18
60
133
226
Are the variables independent?
We want to know, for example, whether males are more
likely to die by predation than females…
• Our null hypothesis is that the predictor and
response variables are not associated with each
other i.e. the two variables are independent of each
other and the observed degree of association is not
stronger than we would expect by chance or random
sampling
Calculating the expected values
• The expected value is the total number of
observations (N) times the probability of a
population being both males and dead by
predation…
Yˆm a le, d ea d  b y  p red a tio n  N xP ( ma le  d ea d _ b y _ p red a tio n)
The probability of two
independent events
P ( m a le, d ea d _ b y _ p red a tio n)  P ( m a le ) xP ( d ea d _ b y _ p red a tio n)
Because we have no other information than the
data, we estimate the probabilities of each of the
right hand terms from the equation from the
marginal totals…
Contingency table
Sex * Death expected values
Sex
FEMALE
MALE
P
Dead
NPRED PRED
46.91
P
67.09
114
0.5044
46.09 65.91
93
133
0.4115 0.5885
112
0.4956
N=226
row _ total  column _ total
ˆ
Yij 
sample _ size
Yˆfemale_ no _ predated  N  P( female  No _ predated )
Testing the hypothesis:
Pearson’s Chi-square test
2
X Pearson


Observed  Expected 2
all _ cells
Expected
= 0.0866, P=0.7685
 Observed  Expected  0.5
2
X
2
Yates
  all _ cells
= 0.0253, P=0.8736
Expected
The degrees of freedom
d f    ( n u m b er _ o f _ ro w s  1) x ( n u m b er _ o f _ co lu m n s  1 )
=1
Calculating the P-value
• We find the probability of obtaining a
value of Χ2 as large or larger than
0.0866 relative to a Χ2 distribution with
1 degree of freedom
• P = 0.769
<-4
-4:-2
-2:0
0:2
2:4
non predator
>4
female
Standardized
Residuals:
predator
tcount
male
An alternative
• The likelihood ratio test: It compares observed
values with the distribution of expected values
based on the multinomial probability distribution

 Observed  
 
G  2   all _ cells Observed  ln 
 Expected  

= 0.0866
Two way contingency tables
• Sex * Death
Crosstabulation:
X
2
Pearson
 0.087, d . f .  1, P  0.769
G  0 . 0 8 7, d . f .  1, P  0 . 7 6 9
• Sex * Marrow
Crosstabulation: X P2 ea rso n  4 . 745, d . f .  2 , P  0 . 093
G  4 . 7 7 8, d . f .  2 , P  0 . 0 9 2
• Marrow * Death
2
X
Crosstabulation: P ea rso n  29 . 30 8, d . f .  2 , P  0 . 00 1
G  2 9 . 5 2 0, d . f .  2 , P  0 . 0 0 1
Which test to chose?
Model
Rows/ Columns Sample
size
Test
I
II
Not fixed
Fixed/not fixed
small
G-test, with
corrections
I
II
III
Not fixed
Fixed/not fixed
Fixed
large
G-test, Chi
square test
Fisher
exact test
Log-linear models
Multi-way Contingency Tables
Multiple two-way tables
Females
Death
PRED
NPRED
Total
Marrow
OG SWF
TG
32
26
26
6
58
32
Males
Death
PRED
NPRED
Total
Marrow
OG SWF
TG
43
14
12
7
55
21
Total
8
16
24
66
48
114
Total
10
26
36
67
45
112
Log-linear models
• They treat the cell frequencies as counts
distributed as a Poisson random variable
• The expected cell frequencies are modeled
against the variables using the log-link and
Poisson error term
• They are fit and parameters estimated using
maximum likelihood techniques
Log-linear models
• Do not distinguish response and
predictor variables: all the variables
are considered equally as response
variables
However
• A logit model with categorical variables
can be analyzed as a log-linear model
Two way tables
• For a two way table (I by J) we can fit two loglinear models
• The first is a saturated (full) model
• Log fij= constant + λix+ λky+ λjkxy
• fij= is the expected frequency in cell ij
• λix = is the effect of category i of variable X
• λky = is the effect of category k of variable Y
• λjkxy = is the effect any interaction between X and Y
• This model fits the observed frequencies perfectly!
Note
• The effect does not imply any causality,
just the influence of a variable or
interaction between variables on the log of
the expected number of observations in a
cell…
Two way tables
• The second log-linear model represents
independence of the two variables (X and Y)
and is a reduced model:
• Log fij= constant + λix+ λky
• The interpretation of this model is that the log of
the expected frequency in any cell is a function
of the mean of the log of all the expected
frequencies plus the effect of variable x and the
effect of variable y. This is an additive linear
model with no interactions between the two
variables
Interpretation
• The parameters of the log-linear models are the
effects of a particular category of each variable
on the expected frequencies:
• i.e. a larger λ means that the expected
frequencies will be larger for that variable.
• These variables are also deviations from the
mean of all expected frequencies.
Null hypothesis of independence
• The Ho is that the sampling or experimental units
come from a population of units in which the two
variables (rows and columns) are independent
of each other in terms of the cell frequencies
• It is also a test that λjkxy =0:
• There is NO interaction between two variables
Test
• We can test this Ho by comparing the fit of the
model without this term to the saturated model
that includes this term
• We determine the fit of each model by
calculating the expected frequencies under
each model, comparing the observed and
expected frequencies and calculating the loglikelihood of each model
Test
• We then compare the fit of the two models with the
likelihood ratio test statistic ∆
• However the sampling distribution of this ratio (∆ )
is not well known, so instead we calculate G2
statistic
• G2 =-2log∆
• G2 Follows a Χ2 distribution for reasonable sample
sizes and can be generalized to
• =- 2(log-likelihood reduced model -- log-likelihood
full model)
Degrees of freedom
• The calculated G2 is compared to a Χ2
distribution with (i-1)(j-1) df.
• This df (i-1)(j-1) is the difference between the df
for the full model (ij-1) and the df for the reduced
model [(i-1)+(j-1)]
Akaike information criteria


ˆ
AIC  2 log L( | data)  2 K
Hirotugu Akaike
The full model
log fijk  C  death  sex  marrow
 deatsex  deathmarrow  sexmarrow
 deathsexmarrow
2
AIC  Gparticular
_ mod el  2df particular_ mod el
Complete table
1
2
3
4
5
6
7
8
9
Model
D+S+M
D*S
D*M
S*M
D*S+D*M
D*S+S*M
D*M+S*M
D*S+D*M+S*M
Saturated full model
G2
42.76
42.68
13.24
37.98
13.16
37.89
8.46
7.19
0
df
7
6
5
5
4
4
3
2
0
P
0.001
0.001
0.021
0.001
0.01
0.001
0.037
0.027
AIC
28.76
30.68
3.24
27.98
5.16
29.89
2.46
3.19
Two way interactions
(marginal independence)
D+S+M
42.76
reference
d.f
P
D*S
1vs 2
42.6759
42.76-42.68=0.084
7-6
=1
0.769
D*M
1vs 3
13.24
42.76-13.24=29.520
7-5
=2
<0.001
S*M
1 vs 4
37.98
42.76-37.98=4.778
7-5
=2
0.092
Three way interaction
•
•
•
•
•
Death*Sex*Marrow
Models compared 8 vs 9
G2= 7.19
df 2
P=0.027
Conditional independence
term
Models compared
G2
df
P
D*S
7 vs 8
1.28
1
0.259
D*M
6 vs 8
30.71 2
0.001
S*M
5 vs 8
5.97
0.051
2
Death and marrow have a partial association
Conditional independence
Females
Death
Marrow
OG
SWF
TG
Total
PRED
32
26
8
66
NPRED
26
6
16
48
Total
58
32
24
114
Males
Death
Marrow
OG
SWF
TG
Total
PRED
43
14
10
67
NPRED
12
7
26
45
Total
55
21
36
112
ˆ
XY ( k )
n11k n22k

n12k n21k
1
1
1
1



XY ( k ) ) 
n11 n12 n21 n22
ASE (log ˆ
CI  e
log(ˆXY ( k )  z0.95* ASE (logˆXY ( k ) )
Experimental Design and Data
Analysis for Biologists
Gerry P. Quinn
Monash University
Michael J. Keough
University of Melbourne
26 * 26
14 *12
SW FvsOG
SW FvsOG
ˆ
ˆ
 3.521
 male

 0.558  female 
32 * 6
43 * 7
Males
95 % CI
Females
TG vs OG
0.107
0.041-0.283
0.406
0.150-1.097
TG vs SWF
0.192
0.060-0.616
0.115
0.034-0.395
SWF vs OG
0.558
0.184-1.693
3.521
1.261-9.836
12
SWF vs OG
Frequentist
8
M
4
6
F
TG vs SWF
2
TG vs OG
0
Tab_odds_fem[, 1]
10
Bayesian
1
2
3
4
a
5
6
7
Complete independence
•
•
•
•
Models compared 1 vs 8
G2=35.57
df= 5
P=<0.001
Warning
• Always fit a saturated model first,
containing all the variables of interest and
all the interactions involving the (potential)
nuisance variables. Only delete from the
model the interactions that involve the
variables of interest.
Download