Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types Contingency Tables Logit Models ◦ Binomial ◦ Ordinal ◦ Nominal 2 Things not covered (but still fit into the topic) Matched pairs/repeated measures ◦ McNemar’s Chi-Square Reliability ◦ Cohen’s Kappa ◦ ROC Poisson (Count) models Categorical SEM ◦ Tetrachoric Correlation Bernoulli Trials 3 Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Nominal/Multinomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Continuous/ Quantitative Rank Order/Ordinal: Binary/Dichotomous/ Binomial: Properties: Properties: Values semi-arbitrary (no magnitude?) 2 Levels Have direction (ordering) Special case of Ordinal or Example: Multinomial Lickert Scales (LICK-URT): Examples: 1-5, Strongly Disagree to Strongly Gender (Multinomial) Agree Disease (Y/N) Measures: Measures: Mode, relative frequency, median Mode, relative frequency, Mean? Mean? 4 Code 1.1 Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 5 Contingency Tables: Test of Independence Chi-Square Test of Independence (χ2) ◦ Calculate χ2 ◦ Determine DF: (I-1) * (J-1) ◦ Compare to χ2 critical value for given DF. 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 R1=156 R2=664 C2=331 C3=264 N=820 C1=265 𝑛 χ2 = 𝑖=1 𝑂𝑖 − 𝐸𝑖 𝐸𝑖 2 𝐸𝑖,𝑗 𝑅𝑖 ∗ 𝐶𝑗 = 𝑁 Where: Oi = Observed Freq Ei = Expected Freq n = number of cells in table 6 Code 1.2 Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ2) ◦ H0: No Association ◦ HA: Association….where, how? χ2 𝑑𝑓 2 = 23.39, 𝑝 < 0.001 Not appropriate when Expected (Ei) cell size freq < 5 ◦ Use Fisher’s Exact Chi-Square 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 R1=156 R2=664 C2=331 C3=264 N=820 C1=265 7 Contingency Tables 2x2 Disorder (Outcome) Risk Factor/ Exposure Yes No Yes a b a+b No c d c+d a+c b+d a+b+c+d 8 Contingency Tables: Depression Measures of Association Alcohol Use Yes No Yes a= 25 No c= 20 b= 10 d= 45 45 Probability : 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑎 25 𝑷 𝑫𝑨 = = = 0.714 𝑎 + 𝑏 35 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑐 20 𝑷 𝑫𝑨 = = = 0.308 𝑐 + 𝑑 65 Odds: 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑃 𝐷𝐴 0.714 𝑶𝒅𝒅𝒔 𝑫 𝑨 = = = 2.5 1−𝑃 𝐷 𝐴 1 − 0.714 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑃 𝐷𝐴 0.308 𝑶𝒅𝒅𝒔 𝑫 𝑨 = = = 0.44 1 − 0.308 1−𝑃 𝐷 𝐴 55 35 65 100 Contrasting Probability: 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑅𝑖𝑠𝑘 (𝑅𝑅) = 𝑃 𝐷 𝐴) 0.714 = = 2.31 0.308 𝑃(𝐷|𝐴) Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: 𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅) = 𝑂𝑑𝑑𝑠 𝐷 𝐴) 2.5 = = 5.62 0.44 𝑂𝑑𝑑𝑠(𝐷|𝐴) The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9 Depression Why Odds Ratios? Alcohol Use Yes Yes a= 25 No c= 20 i=1 to 45 b= (25 + 10*i) 10*i d= 45*i (20 + 45*i) 55*i (45 + 55*i) 4 3 2 OR / RR 5 6 45 No 0 .1 .2 .3 Overall Probability of Depression RR .4 .5 OR 10 The Generalized Linear Model General Linear Model (LM) ◦ Continuous Outcomes (DV) ◦ Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) ◦ ◦ ◦ ◦ John Nelder and Robert Wedderburn Maximum Likelihood Estimation Continuous, Categorical, and Count outcomes. Distribution Family and Link Functions Error distributions that are not normal 11 Logistic Regression “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): ◦ ◦ ◦ ◦ Independence NOT Homoscedasticity or Normal Errors Linearity (in the Log Odds) Also….adequate cell sizes. 12 Logistic Regression The Model ◦𝑌= 𝜋 𝑥 = 𝑒 𝛼+ 𝛽1 𝑥1 1+𝑒 𝛼+ 𝛽1 𝑥1 In terms of probability of success π(x) ◦ 𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥 = l𝑛 𝜋(𝑥) 1−𝜋(𝑥) = 𝛼 + 𝛽1 𝑥1 In terms of Logits (Log Odds) Logit transform gives us a linear equation 13 Code 2.1 Logistic Regression: Example The Output as Logits ◦ Logits: H0: β=0 Y=Depressed Coef α (_constant) -1.51 Freq. 672 148 Not Depressed Depressed SE Z P 0.091 -16.7 <0.001 -1.69, -1.34 Conversion to Probability: 𝑒𝛽 𝑒 −1.51 = = 0.1805 −1.51 𝛽 1+𝑒 1+𝑒 Percent 81.95 18.05 CI What does H0: β=0 mean? 𝑒𝛽 1+𝑒 𝛽 = 𝑒0 1+𝑒 0 = 0.5 Conversion to Odds 𝑒 𝛽 = 𝑒 −1.51 = 0.22 Also=0.1805/0.8195=0.22 14 Code 2.2 Logistic Regression: Example The Output as ORs ◦ Odds Ratios: H0: β=1 Y=Depressed OR α (_constant) 0.220 Freq. 672 148 Not Depressed Depressed Percent 81.95 18.05 SE Z P CI 0.020 -16.7 <0.001 0.184, 0.263 ◦ Conversion to Probability: 𝑂𝑅 1+𝑂𝑅 = 0.220 1+0.220 = 0.1805 ◦ Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51 15 Code 2.3 Logistic Regression: Example Logistic Regression w/ Single Continuous Predictor: ◦ log 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) 1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) = 𝛼 + 𝛽(𝑎𝑔𝑒) AS LOGITS: Y=Depressed Coef SE Z P CI α (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28 β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030 Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16 Logistic Regression: GOF • Overall Model Likelihood-Ratio Chi-Square • Omnibus test for the model • Overall model fit? • Relative to other models • Compares specified model with Null model (no predictors) • Χ2=-2*(LL0-LL1), DF=K parameters estimated 17 Code 2.4 Logistic Regression: GOF (Summary Measures) Pseudo-R2 ◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden) ◦ Only comparable within nested models of the same outcome. Hosmer-Lemeshow ◦ ◦ ◦ ◦ Models with Continuous Predictors Is the model a better fit than the NULL model. X2 H0: Good Fit for Data, so we want p>0.05 Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 ◦ Conservative (rarely rejects the null) Pearson Chi-Square ◦ Models with categorical predictors ◦ Similar to Hosmer-Lemeshow ROC-Area Under the Curve ◦ Predictive accuracy/Classification 18 Code 2.5 Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) ◦ Pearson Residuals Square root of the contribution to the Pearson χ2 ◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) ◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values Influential Observations ◦ Pregibon’s Delta-Beta influence statistic ◦ Similar to Cook’s-D in linear regression Detecting Problems ◦ Residuals vs Predictors ◦ Leverage Vs Residuals ◦ Boxplot of Delta-Beta 19 Logistic Regression: GOF log 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) = 𝛼 + 𝛽1 (𝑎𝑔𝑒) 1 − 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) L-R χ2 (df=1): 2.47, p=0.1162 McFadden’s R2: 0.0030 H-L GOF: Number of Groups: H-L Chi2: DF: P: Y=Depressed Coef 10 7.12 8 0.5233 SE Z P CI α (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28 β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030 20 Code 2.6 Logistic Regression: Diagnostics Linearity in the Log-Odds ◦ Use a lowess (loess) plot ◦ Depressed vs Age Lowess smoother -1 -2 -3 Depressed (Logit) 0 1 Logit transformed smooth 20 40 60 80 age bandwidth = .8 21 Code 2.7 Logistic Regression: Example Logistic Regression w/ Single Categorical Predictor: ◦ log 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) 1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) = 𝛼 + 𝛽1 (𝑔𝑒𝑛𝑑𝑒𝑟) AS OR: Y=Depressed OR SE Z P CI α (_constant) 0.545 0.091 -3.63 <0.001 0.392, 0.756 β (male) 0.299 0.060 -5.99 <0.001 0.202, 0.444 Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34. 22 Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! ◦ Proportional Odds BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age) 23 Ordinal Logistic Regression The Model ◦ A latent variable model (Y*) ◦ j= number of levels-1 ◦ 𝑌∗ 𝛽𝑥 = 𝑙𝑜𝑔𝑖𝑡(𝑝1 + 𝑝2 + 𝑝𝑗 ) = 𝑙𝑛 𝑝1 +𝑝2 +𝑝𝑗 1−𝑝1 − 𝑝2 −𝑝𝑗 = 𝛼𝑗 ∗ + ◦ From the equation we can see that the odds ratio is assumed to be independent of the category j 24 Code 3.1 Ordinal Logistic Regression Example AS LOGITS: Y=bmi3grp Coef SE Z P CI β1 (age) -0.026 0.006 -4.15 <0.001 -0.381, -0.014 β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021 Threshold1/cut1 -0.696 0.6678 -2.004, 0.613 Threshold2/cut2 0.773 0.6680 -0.536, 2.082 For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category AS OR: Y=bmi3grp OR SE Z P CI β1 (age) 0.974 0.006 -4.15 <0.001 0.962, 0.986 β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022 Threshold1/cut1 -0.696 0.6678 -2.004, 0.613 Threshold2/cut2 0.773 0.6680 -0.536, 2.082 For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater. 25 Code 3.2 Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions ◦ Brant Test of Parallel Regression H0: Proportional Odds, thus want p >0.05 Tests each predictor separately and overall ◦ Score Test of Parallel Regression H0: Proportional Odds, thus want p >0.05 ◦ Approx Likelihood-ratio test H0: Proportional Odds, thus want p >0.05 26 Code 3.3 Ordinal Logistic Regression: GOF Pseudo R2 Diagnostics Measures ◦ Performed on the j-1 binomial logistic regressions 27 Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses ◦ Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28 Multinomial Logistic Regression The Model ◦ j= levels for the outcome ◦ J=reference level ◦ 𝜋𝑗 𝑥 = 𝑃 𝑌 = 𝑗 𝑥) where x is a fixed setting of an explanatory variable ◦ 𝑙𝑜𝑔𝑖𝑡 𝜋𝑗 (𝑥) = l𝑛 𝜋𝑗 (𝑥) 𝜋𝐽 (𝑥) = 𝛼 + 𝛽𝑗1 𝑥1 + … 𝛽𝑗𝑝 𝑥𝑝 ◦ Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR. ◦ Similar to conducting separate binary logistic models, but with better type 1 error control 29 Code 4.1 Multinomial Logistic Regression Example Does degree of supernatural belief indicate a religious preference? AS OR: Y=religion (ref=Catholic(1)) OR SE Z P CI Protestant (2) β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317 α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425 β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469 α (_constant) 0.619 0.059 -5.02 <0.001 0.512, 0.746 Evangelical (3) For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30 Multinomial Logistic Regression GOF Limited GOF tests. ◦ Look at LR Chi-square and compare nested models. ◦ “Essentially, all models are wrong, but some are useful” –George E.P. Box Pseudo R2 Similar to Ordinal ◦ Perform tests on the j-1 binomial logistic regressions 31 Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/ 32