– Logistic regression is an alternative to multiple regression – Used to predict outcome variable that is a categorical dichotomy from a set of categorical or continuous predictor variables – Used because with the categorical dichotomy outcome variable violates the assumption of linearity in normal regression – Logistic regression emphasizes the probability of a particular outcome for each case – Dependent variable must be non-metric/categorical (nominal or ordinal scaled) – Independent variables can be combination of metric and/or non-metric – Logistic regression requires less assumptions than discriminant analysis – It does not require assumptions on multivariate normality nor homogeneity of variance-covariance matrices Kinnear, P. R. (2011) – The outcome variable (Ŷ) is the probability of having one outcome or another based on the best linear combination of predictors using maximumlikelihood estimation – Probability of Y is calculated based on the following formula: P(Y ) Yˆ where eu 1 e Formula 1 u u b0 b1 X1 b2 X 2 ....... bp X p e thebase of naturallogarithms p number of predictor variables e b0 b1–X 1 With one predictor variable, the formula will be: P(Y ) Yˆi 1 e b0b 1X 1 – With multiple predictor variables (p), the formula will be: P(Y ) Yˆi e b0 b1 X 1 b2 X 2 ...... b p X p 1 e b0 b1 X 1 b2 X 2 .......bp X p – The resulting value from the above computing (probability) ranges between 0 and 1 :: A value close to 0 means Y is very unlikely to occur :: A value close to 1 means Y is very likely to occur – Can outcome be predicted from a set of predictor variables? – Which predictor variables predict the outcome? – How strong is the relationship between outcome and the predictor variables? Assessing the Model Assessing the Predictor Relationship between Predictors - Outcome Odds Ratio Classification of Cases – Use the observed and predicted value of the outcome to assess the fit of the model. – The statistic used to measure the fit of the model is called log-likelihood: Log - likelihood Yln (Y ) (1Y ) ln (1Y ) N ˆ Formula 2 i i i i i1 – The log-likelihood is the summation of probabilities associated with the predicted and actual outcomes – This log-likelihood statistic is comparable to residual sum of squares (SSE) in multiple regression – Log-likelihood will be calculated for two different models (bigger and smaller) – The two models are compared by computing the difference in their log-likelihood using Chi-square (χ2) 2 2LL(B) LL(0) Formula 3 – LL(B) is log-likelihood for the bigger model which includes all the predictors – LL(0) is log-likelihood for the smaller model which includes only the intercept – degrees of freedom (df) = kB – k0 where k is number of parameters – Test the null hypothesis that βi = 0 – Test the individual contribution of predictor variables using Wald statistic – The Wald statistic is comparable to t-test in multiple regression – Wald statistic is the squared ratio of the unstandardized logistic coefficient to its standard error. – The Wald statistic and its corresponding p probability level is part of SPSS output in the "Variables in the Equation" table. – A number of statistics can be used as measures of association between predictors and outcome – The measures include: 1. R-Statistic 2. Cox and Snell R2 3. Nagelkerke R2 4. Hosmer and Lemeshow’s R2 – R-statistic is comparable to multiple correlation coefficient – Formula: Wald (2* df ) R 2LL(0) Formula 4 – R-statistic ranges between -1 to +1 – A positive value: as the predictor increases, likelihood of the outcome occurring increases, vice versa – R2cs is comparable to R2 in multiple regression – The value is displayed in SPSS Logistic Regression – Formula: 2 (LL( B)LL(0)) n 2 RCS 1 e Formula 5 – However the value of R2cs never reaches its theoretical maximum of 1 – Nagelkerke suggested for amendment to the earlier R2CS – The value is displayed in SPSS Logistic Regression – Formula: R2 CS R 2( LL(0)) 2 N 1 e n Formula 6 – Formula to calculate R2L RL2 2LL(B) 2LL(0) Formula 7 – Odds ratio is an indicator of the change in odds resulting from a unit change in the predictor – The odds ratio is the increase (or decrease if the ratio is less than 1) in odds of being in one outcome category when the predictor increases by one unit. – It is similar to b-coefficient but is easier to interpret (it does not involve logarithmic transformation) – The odd of an event occurring are defined as the probability of an event occurring divided by the probability of the event not occurring Odds P (event) P (no event) Formula 8 – The coefficients (b) are the natural logs of the odds ratio, thus odds ratio can be calculated using the following formula: Odds eb Formula 9 – Odds ratio indicates the change in odds resulting from a unit change in the predictor – Odds ratio > 1 Predictor ↑, Probability of outcome occurring ↑ – Odds ratio < 1 Predictor ↑, Probability of outcome occurring ↓ Example Dummy variable: Gender (1=Male, 0=Female) – If Odd ratio = 1.25 1.25 – 1.0 = .25 Males have 25% higher odds than Females – If Odd ratio = .8 .8 – 1.0 = -.20 Odds for Males are 20% less than Females – One method of assessing the success of a model is to evaluate its ability to predict correctly the outcome – The cut-off value for classification is .50 – A case is assigned to category “1” if the model predicts an outcome probability of greater than .5 i.e. Y = 1 if Ŷ > .5 Y = 0 if Ŷ < .5 – SPSS provides: 1. Percentage of correctly classify category “1” 2. Percentage of correctly classify category “0” 3. Overall percentage 1. Enter – All variables entered simultaneously 2. Sequential/Hierarchical – Variables entered in blocks – Blocks should be based on past research or theory being tested 3. Stepwise – Variables entered on the basis of statistical criteria (relative contribution to predict outcome) – Should be employed only for exploratory analysis (From Tabachnick) The following data set include three variables: 1. FALL 0 - Not falling 1 - Falling 2. DIFFICULTY Rated on 1 to 3 scale 3. SEASON 1 - autumn 2 - winter 3 - spring Data set: Fall Difficulty 1 3 1 1 0 1 1 2 1 3 0 2 0 1 1 3 1 2 1 2 0 2 0 2 1 3 1 2 0 3 Season 1 1 3 3 2 2 2 1 3 1 2 3 2 2 1 Data: Logistic Regression Tabachnick SKI 1. Model Fit 1.776(1.010)( DIFF )(0.928)( SEAS1)(0.418)( SEAS 2) e Prob(Fall ) Yˆi 1 e 1.776(1.010)( DIFF )(0.928)( SEAS1)(0.418)( SEAS 2) Log - likelihood N Yi ln (Yˆi ) (1Yi ) ln (1Yi ) Formula 1 Formula 2 i1 2 2LL(B) LL(0) Formula 3 Excel Computation 2. Significance of Predictors and Odds Ratio Wald b SE(b) 2 Excel Computation 3. Relationship between Predictors and Outcome 2 (LL( B)LL(0)) n 2 RCS 1 e RN2 CS Formula 5 R2 2( LL(0)) n 1 e Formula 6 4. Classification of Cases Table 1: Logistic Regression Analysis of Falling on a Ski Run as a Function of Difficulty of Run and Season Variables B Wald Test p Odds ratio Constant -1.776 0.88 .347 .169 Difficulty 1.010 1.27 .259 2.747 Season(1) .927 0.34 .560 2.527 Season(2) -.418 0.09 .763 .658 Note: R2 = .165 (Cox & Snell), .227 (Nagelkerke) Model χ2 (3)= 2.710, p = .439 May want to also report CI for Odds ratio Data: Logistic Regression PERFORM (Adapted from Andy) Variable Label/Value PERFORM Performance in Subject 1 No 2 Yes Interest in the Subject 1 No 2 Yes Age in years INTEREST AGE Andy Field (2005). Discovering Statistics Using SPSS. 2nd Edition. London: Sage PublicationsLtd Table 2: Logistic Regression Analysis of Performance as a Function of Interest and Age Variables B Wald Test Constant Interest Age Note: R2 = (Cox & Snell), Model χ2 (_)= ,p= (Nagelkerke) p Odds ratio (From Tabachnick) Variable Label WorkStatus Work status 1 Working 2 Housewives Presence of children 1 No 2 Yes Locus of control Attitudes toward current marriage Attitudes toward housework Attitudes toward role of women Age group Years of education Children Control AttMar AttHouse AttRole Age Educ Value Data: Logistic Regression Tabachnick WORK STATUS Table 3: Logistic Regression Analysis of Work Status as a Function of Attitudinal Variables Variables B Wald Test Constant Locus of control Attitude towards marital status Attitude towards role of women Attitude towards housework Note: R2 = (Cox & Snell), Model χ2 (_)= ,p= (Nagelkerke) p Odds ratio Table : Logistic Regression Analysis of Work Status as a Function of Attitudinal Variables and Children Variables B Wald Test Constant Presence of children Locus of control Attitude towards marital status Attitude towards role of women Attitude towards housework Note: R2 = (Cox & Snell), Model χ2 (_)= ,p= (Nagelkerke) p Odds ratio