20131011_Analysis_of_Categorical_Data_Jackson

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types  Contingency Tables  Logit Models  ◦ Binomial ◦ Ordinal ◦ Nominal 2 Things not covered (but still fit into the topic)  Matched pairs/repeated measures ◦ McNemar’s Chi-Square  Reliability ◦ Cohen’s Kappa ◦ ROC Poisson (Count) models  Categorical SEM  ◦ Tetrachoric Correlation  Bernoulli Trials 3 Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Nominal/Multinomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Continuous/ Quantitative Rank Order/Ordinal: Binary/Dichotomous/ Binomial: Properties: Properties: Values semi-arbitrary (no magnitude?) 2 Levels Have direction (ordering) Special case of Ordinal or Example: Multinomial Lickert Scales (LICK-URT): Examples: 1-5, Strongly Disagree to Strongly Gender (Multinomial) Agree Disease (Y/N) Measures: Measures: Mode, relative frequency, median Mode, relative frequency, Mean? Mean? 4 Code 1.1 Contingency Tables Often called Two-way tables or Cross-Tab  Have dimensions I x J  Can be used to test hypotheses of association between categorical variables  2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 5 Contingency Tables: Test of Independence  Chi-Square Test of Independence (χ2) ◦ Calculate χ2 ◦ Determine DF: (I-1) * (J-1) ◦ Compare to χ2 critical value for given DF. 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 R1=156 R2=664 C2=331 C3=264 N=820 C1=265 𝑛 χ2 = 𝑖=1 𝑂𝑖 − 𝐸𝑖 𝐸𝑖 2 𝐸𝑖,𝑗 𝑅𝑖 ∗ 𝐶𝑗 = 𝑁 Where: Oi = Observed Freq Ei = Expected Freq n = number of cells in table 6 Code 1.2 Contingency Tables: Test of Independence  Pearson Chi-Square Test of Independence (χ2) ◦ H0: No Association ◦ HA: Association….where, how?  χ2 𝑑𝑓 2 = 23.39, 𝑝 < 0.001 Not appropriate when Expected (Ei) cell size freq < 5 ◦ Use Fisher’s Exact Chi-Square 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 R1=156 R2=664 C2=331 C3=264 N=820 C1=265 7 Contingency Tables  2x2 Disorder (Outcome) Risk Factor/ Exposure Yes No Yes a b a+b No c d c+d a+c b+d a+b+c+d 8 Contingency Tables: Depression Measures of Association Alcohol Use Yes No Yes a= 25 No c= 20 b= 10 d= 45 45 Probability : 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑎 25 𝑷 𝑫𝑨 = = = 0.714 𝑎 + 𝑏 35 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑐 20 𝑷 𝑫𝑨 = = = 0.308 𝑐 + 𝑑 65 Odds: 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑃 𝐷𝐴 0.714 𝑶𝒅𝒅𝒔 𝑫 𝑨 = = = 2.5 1−𝑃 𝐷 𝐴 1 − 0.714 𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒 𝑃 𝐷𝐴 0.308 𝑶𝒅𝒅𝒔 𝑫 𝑨 = = = 0.44 1 − 0.308 1−𝑃 𝐷 𝐴 55 35 65 100 Contrasting Probability: 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑅𝑖𝑠𝑘 (𝑅𝑅) = 𝑃 𝐷 𝐴) 0.714 = = 2.31 0.308 𝑃(𝐷|𝐴) Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: 𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅) = 𝑂𝑑𝑑𝑠 𝐷 𝐴) 2.5 = = 5.62 0.44 𝑂𝑑𝑑𝑠(𝐷|𝐴) The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9 Depression Why Odds Ratios? Alcohol Use Yes Yes a= 25 No c= 20 i=1 to 45 b= (25 + 10*i) 10*i d= 45*i (20 + 45*i) 55*i (45 + 55*i) 4 3 2 OR / RR 5 6 45 No 0 .1 .2 .3 Overall Probability of Depression RR .4 .5 OR 10 The Generalized Linear Model  General Linear Model (LM) ◦ Continuous Outcomes (DV) ◦ Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA  Generalized Linear Model (GLM) ◦ ◦ ◦ ◦ John Nelder and Robert Wedderburn Maximum Likelihood Estimation Continuous, Categorical, and Count outcomes. Distribution Family and Link Functions  Error distributions that are not normal 11 Logistic Regression     “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): ◦ ◦ ◦ ◦ Independence NOT Homoscedasticity or Normal Errors Linearity (in the Log Odds) Also….adequate cell sizes. 12 Logistic Regression  The Model ◦𝑌= 𝜋 𝑥 = 𝑒 𝛼+ 𝛽1 𝑥1 1+𝑒 𝛼+ 𝛽1 𝑥1  In terms of probability of success π(x) ◦ 𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥 = l𝑛 𝜋(𝑥) 1−𝜋(𝑥) = 𝛼 + 𝛽1 𝑥1  In terms of Logits (Log Odds)  Logit transform gives us a linear equation 13 Code 2.1 Logistic Regression: Example The Output as Logits ◦ Logits: H0: β=0 Y=Depressed Coef α (_constant) -1.51 Freq. 672 148 Not Depressed Depressed SE Z P 0.091 -16.7 <0.001 -1.69, -1.34 Conversion to Probability: 𝑒𝛽 𝑒 −1.51 = = 0.1805 −1.51 𝛽 1+𝑒 1+𝑒 Percent 81.95 18.05 CI What does H0: β=0 mean? 𝑒𝛽 1+𝑒 𝛽 = 𝑒0 1+𝑒 0 = 0.5 Conversion to Odds 𝑒 𝛽 = 𝑒 −1.51 = 0.22 Also=0.1805/0.8195=0.22 14 Code 2.2 Logistic Regression: Example  The Output as ORs ◦ Odds Ratios: H0: β=1 Y=Depressed OR α (_constant) 0.220 Freq. 672 148 Not Depressed Depressed Percent 81.95 18.05 SE Z P CI 0.020 -16.7 <0.001 0.184, 0.263 ◦ Conversion to Probability:  𝑂𝑅 1+𝑂𝑅 = 0.220 1+0.220 = 0.1805 ◦ Conversion to Logit (log odds!)  Ln(OR) = logit  Ln(0.220)=-1.51 15 Code 2.3 Logistic Regression: Example Logistic Regression w/ Single Continuous Predictor: ◦ log 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) 1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) = 𝛼 + 𝛽(𝑎𝑔𝑒) AS LOGITS: Y=Depressed Coef SE Z P CI α (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28 β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030 Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16 Logistic Regression: GOF • Overall Model Likelihood-Ratio Chi-Square • Omnibus test for the model • Overall model fit? • Relative to other models • Compares specified model with Null model (no predictors) • Χ2=-2*(LL0-LL1), DF=K parameters estimated 17 Code 2.4 Logistic Regression: GOF (Summary Measures)  Pseudo-R2 ◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden) ◦ Only comparable within nested models of the same outcome.  Hosmer-Lemeshow ◦ ◦ ◦ ◦ Models with Continuous Predictors Is the model a better fit than the NULL model. X2 H0: Good Fit for Data, so we want p>0.05 Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 ◦ Conservative (rarely rejects the null)  Pearson Chi-Square ◦ Models with categorical predictors ◦ Similar to Hosmer-Lemeshow  ROC-Area Under the Curve ◦ Predictive accuracy/Classification 18 Code 2.5 Logistic Regression: GOF (Diagnostic Measures)  Outliers in Y (Outcome) ◦ Pearson Residuals  Square root of the contribution to the Pearson χ2 ◦ Deviance Residuals  Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model.  Outliers in X (Predictors) ◦ Leverage (Hat Matrix/Projection Matrix)  Maps the influence of observed on fitted values  Influential Observations ◦ Pregibon’s Delta-Beta influence statistic ◦ Similar to Cook’s-D in linear regression  Detecting Problems ◦ Residuals vs Predictors ◦ Leverage Vs Residuals ◦ Boxplot of Delta-Beta 19 Logistic Regression: GOF log 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) = 𝛼 + 𝛽1 (𝑎𝑔𝑒) 1 − 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) L-R χ2 (df=1): 2.47, p=0.1162 McFadden’s R2: 0.0030 H-L GOF: Number of Groups: H-L Chi2: DF: P: Y=Depressed Coef 10 7.12 8 0.5233 SE Z P CI α (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28 β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030 20 Code 2.6 Logistic Regression: Diagnostics  Linearity in the Log-Odds ◦ Use a lowess (loess) plot ◦ Depressed vs Age Lowess smoother -1 -2 -3 Depressed (Logit) 0 1 Logit transformed smooth 20 40 60 80 age bandwidth = .8 21 Code 2.7 Logistic Regression: Example Logistic Regression w/ Single Categorical Predictor: ◦ log 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) 1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑) = 𝛼 + 𝛽1 (𝑔𝑒𝑛𝑑𝑒𝑟) AS OR: Y=Depressed OR SE Z P CI α (_constant) 0.545 0.091 -3.63 <0.001 0.392, 0.756 β (male) 0.299 0.060 -5.99 <0.001 0.202, 0.444 Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34. 22 Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model  Extension of Binary Logistic Model  >2 Ordered responses  New Assumption!  ◦ Proportional Odds  BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)  The predictors effect on the outcome is the same across levels of the outcome.  Bmi3grp (1 vs 2,3) = B(age)  Bmi3grp (1,2 vs 3) = B(age) 23 Ordinal Logistic Regression  The Model ◦ A latent variable model (Y*) ◦ j= number of levels-1 ◦ 𝑌∗ 𝛽𝑥 = 𝑙𝑜𝑔𝑖𝑡(𝑝1 + 𝑝2 + 𝑝𝑗 ) = 𝑙𝑛 𝑝1 +𝑝2 +𝑝𝑗 1−𝑝1 − 𝑝2 −𝑝𝑗 = 𝛼𝑗 ∗ + ◦ From the equation we can see that the odds ratio is assumed to be independent of the category j 24 Code 3.1 Ordinal Logistic Regression Example AS LOGITS: Y=bmi3grp Coef SE Z P CI β1 (age) -0.026 0.006 -4.15 <0.001 -0.381, -0.014 β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021 Threshold1/cut1 -0.696 0.6678 -2.004, 0.613 Threshold2/cut2 0.773 0.6680 -0.536, 2.082 For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category AS OR: Y=bmi3grp OR SE Z P CI β1 (age) 0.974 0.006 -4.15 <0.001 0.962, 0.986 β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022 Threshold1/cut1 -0.696 0.6678 -2.004, 0.613 Threshold2/cut2 0.773 0.6680 -0.536, 2.082 For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater. 25 Code 3.2 Ordinal Logistic Regression: GOF  Assessing Proportional Odds Assumptions ◦ Brant Test of Parallel Regression  H0: Proportional Odds, thus want p >0.05  Tests each predictor separately and overall ◦ Score Test of Parallel Regression  H0: Proportional Odds, thus want p >0.05 ◦ Approx Likelihood-ratio test  H0: Proportional Odds, thus want p >0.05 26 Code 3.3 Ordinal Logistic Regression: GOF Pseudo R2  Diagnostics Measures  ◦ Performed on the j-1 binomial logistic regressions 27 Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression.  Same assumptions as the binary logistic model  >2 non-ordered responses  ◦ Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28 Multinomial Logistic Regression  The Model ◦ j= levels for the outcome ◦ J=reference level ◦ 𝜋𝑗 𝑥 = 𝑃 𝑌 = 𝑗 𝑥) where x is a fixed setting of an explanatory variable ◦ 𝑙𝑜𝑔𝑖𝑡 𝜋𝑗 (𝑥) = l𝑛 𝜋𝑗 (𝑥) 𝜋𝐽 (𝑥) = 𝛼 + 𝛽𝑗1 𝑥1 + … 𝛽𝑗𝑝 𝑥𝑝 ◦ Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR. ◦ Similar to conducting separate binary logistic models, but with better type 1 error control 29 Code 4.1 Multinomial Logistic Regression Example Does degree of supernatural belief indicate a religious preference? AS OR: Y=religion (ref=Catholic(1)) OR SE Z P CI Protestant (2) β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317 α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425 β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469 α (_constant) 0.619 0.059 -5.02 <0.001 0.512, 0.746 Evangelical (3) For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30 Multinomial Logistic Regression GOF  Limited GOF tests. ◦ Look at LR Chi-square and compare nested models. ◦ “Essentially, all models are wrong, but some are useful” –George E.P. Box Pseudo R2  Similar to Ordinal  ◦ Perform tests on the j-1 binomial logistic regressions 31 Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/ 32

20131011_Analysis_of_Categorical_Data_Jackson

Related documents

Products

Support

20131011_Analysis_of_Categorical_Data_Jackson

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib