20131011_Analysis_of_Categorical_Data_Jackson

advertisement
Analysis of Categorical Data
Nick Jackson
University of Southern California
Department of Psychology
10/11/2013
1
Overview
Data Types
 Contingency Tables
 Logit Models

◦ Binomial
◦ Ordinal
◦ Nominal
2
Things not covered (but still fit into the topic)

Matched pairs/repeated measures
◦ McNemar’s Chi-Square

Reliability
◦ Cohen’s Kappa
◦ ROC
Poisson (Count) models
 Categorical SEM

◦ Tetrachoric Correlation

Bernoulli Trials
3
Data Types (Levels of Measurement)
Discrete/Categorical/
Qualitative
Nominal/Multinomial:
Properties:
Values arbitrary (no magnitude)
No direction (no ordering)
Example:
Race: 1=AA, 2=Ca, 3=As
Measures:
Mode, relative frequency
Continuous/
Quantitative
Rank Order/Ordinal:
Binary/Dichotomous/
Binomial:
Properties:
Properties:
Values semi-arbitrary (no magnitude?)
2 Levels
Have direction (ordering)
Special case of Ordinal or
Example:
Multinomial
Lickert Scales (LICK-URT):
Examples:
1-5, Strongly Disagree to Strongly
Gender (Multinomial)
Agree
Disease (Y/N)
Measures:
Measures:
Mode, relative frequency, median
Mode, relative frequency,
Mean?
Mean?
4
Code 1.1
Contingency Tables
Often called Two-way tables or Cross-Tab
 Have dimensions I x J
 Can be used to test hypotheses of
association between categorical variables

2 X 3 Table
Age Groups
Gender
<40 Years
40-50 Years
>50 Year
Female
25
68
63
Male
240
223
201
5
Contingency Tables: Test of Independence

Chi-Square Test of Independence (χ2)
◦ Calculate χ2
◦ Determine DF: (I-1) * (J-1)
◦ Compare to χ2 critical value for given DF.
2 X 3 Table
Age Groups
Gender
<40 Years
40-50 Years
>50 Year
Female
25
68
63
Male
240
223
201
R1=156
R2=664
C2=331
C3=264
N=820
C1=265
𝑛
χ2 =
𝑖=1
𝑂𝑖 − 𝐸𝑖
𝐸𝑖
2
𝐸𝑖,𝑗
𝑅𝑖 ∗ 𝐶𝑗
=
𝑁
Where: Oi = Observed Freq
Ei = Expected Freq
n = number of cells in table
6
Code 1.2
Contingency Tables: Test of Independence

Pearson Chi-Square Test of Independence (χ2)
◦ H0: No Association
◦ HA: Association….where, how?

χ2 𝑑𝑓 2 = 23.39, 𝑝 < 0.001
Not appropriate when Expected (Ei) cell size freq < 5
◦ Use Fisher’s Exact Chi-Square
2 X 3 Table
Age Groups
Gender
<40 Years
40-50 Years
>50 Year
Female
25
68
63
Male
240
223
201
R1=156
R2=664
C2=331
C3=264
N=820
C1=265
7
Contingency Tables

2x2
Disorder (Outcome)
Risk Factor/
Exposure
Yes
No
Yes
a
b
a+b
No
c
d
c+d
a+c
b+d
a+b+c+d
8
Contingency Tables:
Depression
Measures of Association
Alcohol Use
Yes
No
Yes
a=
25
No
c=
20
b=
10
d=
45
45
Probability :
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑎
25
𝑷 𝑫𝑨 =
=
= 0.714
𝑎 + 𝑏 35
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑐
20
𝑷 𝑫𝑨 =
=
= 0.308
𝑐 + 𝑑 65
Odds:
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑃 𝐷𝐴
0.714
𝑶𝒅𝒅𝒔 𝑫 𝑨 =
=
= 2.5
1−𝑃 𝐷 𝐴
1 − 0.714
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑃 𝐷𝐴
0.308
𝑶𝒅𝒅𝒔 𝑫 𝑨 =
=
= 0.44
1 − 0.308
1−𝑃 𝐷 𝐴
55
35
65
100
Contrasting Probability:
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑅𝑖𝑠𝑘 (𝑅𝑅) =
𝑃 𝐷 𝐴)
0.714
=
= 2.31
0.308
𝑃(𝐷|𝐴)
Individuals who used alcohol were 2.31
times more likely to have depression
than those who do not use alcohol
Contrasting Odds:
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅) =
𝑂𝑑𝑑𝑠 𝐷 𝐴)
2.5
=
= 5.62
0.44
𝑂𝑑𝑑𝑠(𝐷|𝐴)
The odds for depression were 5.62
times greater in Alcohol users compared
to nonusers.
9
Depression
Why Odds Ratios?
Alcohol Use
Yes
Yes
a=
25
No
c=
20
i=1 to 45
b=
(25 + 10*i)
10*i
d=
45*i (20 + 45*i)
55*i (45 + 55*i)
4
3
2
OR / RR
5
6
45
No
0
.1
.2
.3
Overall Probability of Depression
RR
.4
.5
OR
10
The Generalized Linear Model

General Linear Model (LM)
◦ Continuous Outcomes (DV)
◦ Linear Regression, t-test, Pearson correlation,
ANOVA, ANCOVA

Generalized Linear Model (GLM)
◦
◦
◦
◦
John Nelder and Robert Wedderburn
Maximum Likelihood Estimation
Continuous, Categorical, and Count outcomes.
Distribution Family and Link Functions
 Error distributions that are not normal
11
Logistic Regression




“This is the most important model for
categorical response data” –Agresti
(Categorical Data Analysis, 2nd Ed.)
Binary Response
Predicting Probability (related to the Probit
model)
Assume (the usual):
◦
◦
◦
◦
Independence
NOT Homoscedasticity or Normal Errors
Linearity (in the Log Odds)
Also….adequate cell sizes.
12
Logistic Regression

The Model
◦𝑌= 𝜋 𝑥 =
𝑒 𝛼+ 𝛽1 𝑥1
1+𝑒 𝛼+ 𝛽1 𝑥1
 In terms of probability of success π(x)
◦ 𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥
= l𝑛
𝜋(𝑥)
1−𝜋(𝑥)
= 𝛼 + 𝛽1 𝑥1
 In terms of Logits (Log Odds)
 Logit transform gives us a linear equation
13
Code 2.1
Logistic Regression: Example
The Output as Logits
◦ Logits: H0: β=0
Y=Depressed
Coef
α (_constant) -1.51
Freq.
672
148
Not Depressed
Depressed
SE
Z
P
0.091
-16.7
<0.001 -1.69, -1.34
Conversion to Probability:
𝑒𝛽
𝑒 −1.51
=
= 0.1805
−1.51
𝛽
1+𝑒
1+𝑒
Percent
81.95
18.05
CI
What does H0: β=0 mean?
𝑒𝛽
1+𝑒 𝛽
=
𝑒0
1+𝑒 0
= 0.5
Conversion to Odds
𝑒 𝛽 = 𝑒 −1.51 = 0.22
Also=0.1805/0.8195=0.22
14
Code 2.2
Logistic Regression: Example

The Output as ORs
◦ Odds Ratios: H0: β=1
Y=Depressed
OR
α (_constant) 0.220
Freq.
672
148
Not Depressed
Depressed
Percent
81.95
18.05
SE
Z
P
CI
0.020
-16.7
<0.001 0.184, 0.263
◦ Conversion to Probability:

𝑂𝑅
1+𝑂𝑅
=
0.220
1+0.220
= 0.1805
◦ Conversion to Logit (log odds!)
 Ln(OR) = logit
 Ln(0.220)=-1.51
15
Code 2.3
Logistic Regression: Example
Logistic Regression w/ Single Continuous Predictor:
◦ log
𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
= 𝛼 + 𝛽(𝑎𝑔𝑒)
AS LOGITS:
Y=Depressed
Coef
SE
Z
P
CI
α (_constant) -2.24
0.489
-4.58
<0.001 -3.20, -1.28
β (age) 0.013
0.009
1.52
0.127
-0.004, 0.030
Interpretation:
A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.
Hmmmm….I have no concept of what a log-odds is. Interpret as something else.
Logit > 0 so as age increases the risk of depression increases.
OR=e^0.013 = 1.013
For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.
We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of
depression[ (1-OR)*100 % change]
16
Logistic Regression: GOF
• Overall Model Likelihood-Ratio Chi-Square
• Omnibus test for the model
• Overall model fit?
• Relative to other models
• Compares specified model with Null model (no
predictors)
• Χ2=-2*(LL0-LL1), DF=K parameters estimated
17
Code 2.4
Logistic Regression: GOF
(Summary Measures)
 Pseudo-R2
◦ Not the same meaning as linear regression.
◦ There are many of them (Cox and Snell/McFadden)
◦ Only comparable within nested models of the same outcome.

Hosmer-Lemeshow
◦
◦
◦
◦
Models with Continuous Predictors
Is the model a better fit than the NULL model. X2
H0: Good Fit for Data, so we want p>0.05
Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of
Group * Outcome using. Df=g-2
◦ Conservative (rarely rejects the null)

Pearson Chi-Square
◦ Models with categorical predictors
◦ Similar to Hosmer-Lemeshow

ROC-Area Under the Curve
◦ Predictive accuracy/Classification
18
Code 2.5
Logistic Regression: GOF
(Diagnostic Measures)
 Outliers in Y (Outcome)
◦ Pearson Residuals
 Square root of the contribution to the Pearson χ2
◦ Deviance Residuals
 Square root of the contribution to the likeihood-ratio test statistic of a
saturated model vs fitted model.

Outliers in X (Predictors)
◦ Leverage (Hat Matrix/Projection Matrix)
 Maps the influence of observed on fitted values

Influential Observations
◦ Pregibon’s Delta-Beta influence statistic
◦ Similar to Cook’s-D in linear regression

Detecting Problems
◦ Residuals vs Predictors
◦ Leverage Vs Residuals
◦ Boxplot of Delta-Beta
19
Logistic Regression: GOF
log
𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
= 𝛼 + 𝛽1 (𝑎𝑔𝑒)
1 − 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
L-R χ2 (df=1): 2.47, p=0.1162
McFadden’s R2: 0.0030
H-L GOF:
Number of Groups:
H-L Chi2:
DF:
P:
Y=Depressed
Coef
10
7.12
8
0.5233
SE
Z
P
CI
α (_constant) -2.24
0.489
-4.58
<0.001 -3.20, -1.28
β (age) 0.013
0.009
1.52
0.127
-0.004, 0.030
20
Code 2.6
Logistic Regression: Diagnostics

Linearity in the Log-Odds
◦ Use a lowess (loess) plot
◦ Depressed vs Age
Lowess smoother
-1
-2
-3
Depressed (Logit)
0
1
Logit transformed smooth
20
40
60
80
age
bandwidth = .8
21
Code 2.7
Logistic Regression: Example
Logistic Regression w/ Single Categorical Predictor:
◦ log
𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
= 𝛼 + 𝛽1 (𝑔𝑒𝑛𝑑𝑒𝑟)
AS OR:
Y=Depressed
OR
SE
Z
P
CI
α (_constant) 0.545
0.091
-3.63
<0.001 0.392, 0.756
β (male) 0.299
0.060
-5.99
<0.001 0.202, 0.444
Interpretation:
The odds of depression are 0.299 times lower for males compared to females.
We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males
compared to females.
Or…why not just make males the reference so the OR is positive. Or we could just take
the inverse and accomplish the same thing. 1/0.299 = 3.34.
22
Ordinal Logistic Regression
Also called Ordered Logistic or Proportional
Odds Model
 Extension of Binary Logistic Model
 >2 Ordered responses
 New Assumption!

◦ Proportional Odds
 BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)
 The predictors effect on the outcome is the same across
levels of the outcome.
 Bmi3grp (1 vs 2,3) = B(age)
 Bmi3grp (1,2 vs 3) = B(age)
23
Ordinal Logistic Regression

The Model
◦ A latent variable model (Y*)
◦ j= number of levels-1
◦
𝑌∗
𝛽𝑥
= 𝑙𝑜𝑔𝑖𝑡(𝑝1 + 𝑝2 + 𝑝𝑗 ) = 𝑙𝑛
𝑝1 +𝑝2 +𝑝𝑗
1−𝑝1 − 𝑝2 −𝑝𝑗
= 𝛼𝑗
∗
+
◦ From the equation we can see that the odds ratio is
assumed to be independent of the category j
24
Code 3.1
Ordinal Logistic Regression Example
AS LOGITS:
Y=bmi3grp
Coef
SE
Z
P
CI
β1 (age)
-0.026
0.006
-4.15
<0.001
-0.381, -0.014
β2 (blood_press)
0.012
0.005
2.48
0.013
0.002, 0.021
Threshold1/cut1
-0.696 0.6678
-2.004, 0.613
Threshold2/cut2 0.773 0.6680
-0.536, 2.082
For a 1 unit increase in Blood Pressure there is a 0.012 increase in the
log-odds of being in a higher bmi category
AS OR:
Y=bmi3grp
OR
SE
Z
P
CI
β1 (age)
0.974
0.006
-4.15
<0.001
0.962, 0.986
β2 (blood_press)
1.012
0.005
2.48
0.013
1.002, 1.022
Threshold1/cut1
-0.696 0.6678
-2.004, 0.613
Threshold2/cut2
0.773 0.6680
-0.536, 2.082
For a 1 unit increase in Blood Pressure the odds of being in a higher bmi
category are 1.012 times greater.
25
Code 3.2
Ordinal Logistic Regression: GOF

Assessing Proportional Odds Assumptions
◦ Brant Test of Parallel Regression
 H0: Proportional Odds, thus want p >0.05
 Tests each predictor separately and overall
◦ Score Test of Parallel Regression
 H0: Proportional Odds, thus want p >0.05
◦ Approx Likelihood-ratio test
 H0: Proportional Odds, thus want p >0.05
26
Code 3.3
Ordinal Logistic Regression: GOF
Pseudo R2
 Diagnostics Measures

◦ Performed on the j-1 binomial logistic
regressions
27
Multinomial Logistic Regression
Also called multinomial logit/polytomous
logistic regression.
 Same assumptions as the binary logistic
model
 >2 non-ordered responses

◦ Or You’ve failed to meet the parallel odds
assumption of the Ordinal Logistic model
28
Multinomial Logistic Regression

The Model
◦ j= levels for the outcome
◦ J=reference level
◦ 𝜋𝑗 𝑥 = 𝑃 𝑌 = 𝑗 𝑥) where x is a fixed setting of an
explanatory variable
◦ 𝑙𝑜𝑔𝑖𝑡 𝜋𝑗 (𝑥) = l𝑛
𝜋𝑗 (𝑥)
𝜋𝐽 (𝑥)
= 𝛼 + 𝛽𝑗1 𝑥1 + … 𝛽𝑗𝑝 𝑥𝑝
◦ Notice how it appears we are estimating a Relative
Risk and not an Odds Ratio. It’s actually an OR.
◦ Similar to conducting separate binary logistic models,
but with better type 1 error control
29
Code 4.1
Multinomial Logistic Regression
Example
Does degree of supernatural belief indicate a religious preference?
AS OR:
Y=religion
(ref=Catholic(1))
OR
SE
Z
P
CI
Protestant (2)
β (supernatural)
1.126
0.090
1.47
0.141
0.961, 1.317
α (_constant)
1.219
0.097
2.49
0.013
1.043, 1.425
β (supernatural)
1.218
0.117
2.06
0.039
1.010, 1.469
α (_constant)
0.619
0.059
-5.02
<0.001
0.512, 0.746
Evangelical (3)
For a 1 unit increase in supernatural belief, there is a (1-OR= %change)
21.8% increase in the probability of being an Evangelical compared to
Catholic.
30
Multinomial Logistic Regression
GOF

Limited GOF tests.
◦ Look at LR Chi-square and compare nested
models.
◦ “Essentially, all models are wrong, but some
are useful” –George E.P. Box
Pseudo R2
 Similar to Ordinal

◦ Perform tests on the j-1 binomial logistic
regressions
31
Resources
“Categorical Data Analysis” by Alan Agresti
UCLA Stat Computing:
http://www.ats.ucla.edu/stat/
32
Download