Advanced Models and Methods in Behavioral Research • Chris Snijders • c.c.p.snijders@gmail.com ToDo (if not done yet): Enroll in 0a611 • 3 ects • http://www.chrissnijders.com/ammbr (=studyguide) • literature: Field book + separate course material • laptop exam (+ assignments) Advanced Methods and Models in Behavioral Research – The methods package • MMBR (6 ects) – Blumberg: questions, reliability, validity, research design – Field: SPSS: factor analysis, multiple regression, ANcOVA, sample size etc • AMMBR (3 ects) - Field (1 chapter): logististic regression - literature through website: conjoint analysis multi-level regression Advanced Methods and Models in Behavioral Research – Models and methods: topics • t-test, Cronbach's alpha, etc • multiple regression, analysis of (co)variance and factor analysis • logistic regression • conjoint analysis / repeated measures – Stata next to SPSS – “Finding new questions” – Some data collection In the background: “now you should be able to deal with data on your own” Advanced Methods and Models in Behavioral Research – Methods in brief (1) • Logistic regression: target Y, predictors Xi. Y is a binary variable (0/1). - Why not just multiple regression? Interpretation is more difficult goodness of fit is non-standard ... (and it is a chapter in Field) Advanced Methods and Models in Behavioral Research – Methods in brief (2) • Conjoint analysis Underlying assumption: for each user, the "utility" of an offer can be written as -10 Euro p/m - 2 years fixed - free phone - ... How attractive is this offer to you? U(x1,x2, ... , xn) = c0 + c1 x1 + ... + cn xn Advanced Methods and Models in Behavioral Research – Conjoint analysis as an “in between method” Between Which phone do you like and why? What would your favorite phone be? And: Let’s keep track of what people buy. We have: Advanced Methods and Models in Behavioral Research – Local Master Thesis example: Fiber to the home Speed: Price: Installation: Your neighbors: really fast sort of high free! are in! (Roel Schuring) How attractive is this to you? Advanced Methods and Models in Behavioral Research – Coming up with new ideas (3) “More research is necessary” But on what? YOU: come up with sensible new ideas, given previous research Advanced Methods and Models in Behavioral Research – Stata next to SPSS • It’s just better • Multi-level regression is much easier than in SPSS • It’s good to be exposed to more than just a single statistics package (your knowledge (faster, better written, more possibilities, better programmable …) should not be based on “where to click” arguments) • More stable • BTW Supports OSX as well… (anybody?) Advanced Methods and Models in Behavioral Research – Every advantage has a disadvantage • Output less “polished” • It takes some extra work to get you started • The Logistic Regression chapter in the Field book uses SPSS (but still readable for the larger part) • (and it’s not campus software, but subfaculty software) • Installation … Advanced Methods and Models in Behavioral Research – If on Windows, try downloading • www.chrissnijders.com/ammbr/TUeStata12-zip.exe Advanced Methods and Models in Behavioral Research – Logistic Regression Analysis That is: your Y variable is 0/1: Now what? The main points 1. Why do we have to know and sometimes use logistic regression? 2. What is the underlying model? What is maximum likelihood estimation? 3. Logistics of logistic regression analysis 1. 2. 3. 4. 4. Estimate coefficients Assess model fit Interpret coefficients Check residuals An SPSS example Advanced Methods and Models in Behavioral Research Suppose we have 100 observations with information about an individuals age and wether or not this indivual had some kind of a heart disease (CHD) ID age CHD 1 2 3 4 … 98 99 100 20 23 24 25 0 0 0 1 64 65 69 0 1 1 A graphic representation of the data CHD Age Let’s just try regression analysis pr(CHD|age) = -.54 +.022*Age ... linear regression is not a suitable model for probabilities pr(CHD|age) = -.54 +.0218107*Age In this graph for 8 age groups, I plotted the probability of having a heart disease (proportion) A nonlinear model is probably better here Something like this This is the logistic regression model Pr( Y | X ) 1 1 e ( b 0 b1 X 1 1 ) Predicted probabilities are always between 0 and 1 Pr( Y | X ) 1 1 e ( b 0 b1 X 1 1 ) similar to classic regression analysis Side note: this is similar to MMBR … Suppose Y is a percentage (so between 0 and 1). Then consider …which will ensure that the estimated Y will vary between 0 and 1 and after some rearranging this is the same as Advanced Methods and Models in Behavioral Research – … (continued) And one “solution” might be: - Change all Y values that are 0 to 0.001 - Change all Y values that are 1 to 0.999 Now run regression on log(Y/(1-Y)) … … but that really is sort of higgledy-piggledy … Advanced Methods and Models in Behavioral Research – Logistics of logistic regression 1. 2. 3. 4. How do we estimate the coefficients? How do we assess model fit? How do we interpret coefficients? How do we check regression assumptions? Kinds of estimation in regression • Ordinary Least Squares (we fit a line through a cloud of dots) • Maximum likelihood (we find the parameters that are the most likely, given our data) We never bothered to consider maximum likelihood in standard multiple regression, because you can show that they lead to exactly the same estimator (in MR, that is, normally they differ). Actually, maximum likelihood has superior statistical properties (efficiency, consistency, invariance, …) Advanced Methods and Models in Behavioral Research – Maximum likelihood estimation • Method of maximum likelihood yields values for the unknown parameters that maximize the probability of obtaining the observed set of data Pr( Y | X ) 1 1 e ( b 0 b1 X 1 1 ) Unknown parameters Maximum likelihood estimation • First we have to construct the “likelihood function” (probability of obtaining the observed set of data). Likelihood = pr(obs1)*pr(obs2)*pr(obs3)…*pr(obsn) Assuming that observations are independent Log-likelihood • For technical reasons the likelihood is transformed in the log-likelihood (then you just maximize the sum of the logged probabilities) LL= ln[pr(obs1)]+ln[pr(obs2)]+ln[pr(obs3)]…+ln[pr(obsn)] Some subtleties • In OLS, we did not need stochastic assumptions to be able to calculate a best-fitting line (only for the estimates of the confidence intervals we need that). With maximum likelihood estimation we need this from the start (and let us not be bothered at this point by how the confidence intervals are calculated in maximum likelihood) Advanced Methods and Models in Behavioral Research – Note: optimizing log-likelihoods is difficult • It’s iterative (“searching the landscape”) it might not converge it might converge to the wrong answer Advanced Methods and Models in Behavioral Research – Nasty implication: extreme cases should be left out (some handwaving here) Advanced Methods and Models in Behavioral Research – SPSS output Advanced Methods and Models in Behavioral Research – Estimation of coefficients: SPSS Results Pr( Y | X ) 1 1 e ( 5 . 3 . 11 X 1 ) Variables in the Equation B Step 1a age Constant S.E. Wald df Sig. Exp(B) ,111 ,024 21,254 1 ,000 1,117 -5,309 1,134 21,935 1 ,000 ,005 a. Variable(s) entered on step 1: age. Pr( Y | X ) 1 1 e ( 5 . 3 . 11 X 1 ) This function fits best: other values of b0 and b1 give worse results (that is, other values have a smaller likelihood value) Pr( Y | X ) 1 1 e ( 5 . 3 . 11 X 1 ) Illustration 1: suppose we chose .05X instead of .11X Pr( Y | X ) 1 1 e ( 5 . 3 . 05 X 1 ) Illustration 2: suppose we chose .40X instead of .11X Pr( Y | X ) 1 1 e ( 5 . 3 . 40 X 1 ) Logistics of logistic regression • Estimate the coefficients (and their conf.int.) • Assess model fit – Between model comparisons – Pseudo R2 (similar to multiple regression) – Predictive accuracy • Interpret coefficients • Check regression assumptions Model fit: comparisons between models The log-likelihood ratio test statistic can be used to test the fit of a model 2[ LL ( New ) LL ( baseline )] 2 The test statistic has a chi-square distribution full model reduced model NOTE This is sort of similar to the variance decomposition tables you see in MR! 41 Advanced Methods and Models in Behavioral Research Between model comparisons: the likelihood ratio test 2[ LL ( New ) LL ( baseline )] 2 full model P (Y ) 1 1 e ( b 0 b1 X 1 ) reduced model P (Y ) 1 1 e ( b0 ) The model including only an intercept Is often called the empty model. SPSS uses this model as a default. Between model comparison: SPSS output 2 LL ( New ) 2 LL ( baseline )] 2 Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 29,310 1 ,000 Block 29,310 1 ,000 Model 29,310 1 ,000 Model Summary Step 1 -2 Log likelihood 107,353a Cox & Snell R Nagelkerke R Square Square ,254 ,341 a. Estimation terminated at iteration number 5 because parameter estimates changed by less than ,001. This is the test statistic, and it’s associated significance Overall model fit pseudo R2 log-likelihood of the model that you want to test R 2 LOGIT 2 LL ( Model ) 2 LL ( Empty ) Just like in multiple regression, pseudo R2 ranges 0.0 to 1.0 – Cox and Snell • cannot theoretically reach 1 – Nagelkerke log-likelihood of model before any predictors were entered • adjusted so that it can reach 1 NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression 45 Overall model fit: Classification table Classification Table a Predicted chd Percentage Observed Step 1 chd 0 1 Correct 0 45 12 78,9 1 14 29 67,4 Overall Percentage 74,0 a. The cut value is ,500 We predict 74% correctly 46 Overall model fit: Classification table Classification Table a Predicted chd Percentage Observed Step 1 chd 0 1 Correct 0 45 12 78,9 1 14 29 67,4 Overall Percentage 74,0 a. The cut value is ,500 14 cases had a CHD while according to our model this shouldnt have happened 47 Overall model fit: Classification table Classification Table a Predicted chd Percentage Observed Step 1 chd 0 1 Correct 0 45 12 78,9 1 14 29 67,4 Overall Percentage 74,0 a. The cut value is ,500 12 cases didn’t have a CHD while according to our model this should have happened 48 Logistics of logistic regression • Estimate the coefficients • Assess model fit • Interpret coefficients – Direction – Significance – Magnitude • Check regression assumptions The Odds Ratio We had: p (Y ) 1 1 e ( b 0 b1 X 11 ... b n X n ) e ( b 0 b1 X 11 ... b n X n ) 1 e ( b 0 b1 X 11 ... b n X n ) And after some rearranging we can get 50 Magnitude of association: Percentage change in odds Odds i prob event 1 prob event Probability Odds 25% 0.33 50% 1 75% 3 Interpreting coefficients: direction • original b reflects changes in logit: b>0 implies positive relationship logit ln p( y) 1 p( y) b0 b1 x1 b 2 x 2 ... b n x n • exponentiated b reflects the “changes in odds”: exp(b) > 1 implies a positive relationship 52 3. Interpreting coefficients: magnitude • The slope coefficient (b) is interpreted as the rate of change in the "log odds" as X changes … not very useful. logit ln p( y) 1 p( y) b0 b1 x1 b 2 x 2 ... b n x n • exp(b) is the effect of the independent variable on the odds, more useful for calculating the size of an effect Odds 53 p( y) 1 p( y) e b0 e b1 x1 e b2 x 2 ... e bn x n Magnitude of association Ref=0 Ref=1 Variables in the Equation B Step 1a age Constant S.E. Wald df Sig. Exp(B) ,111 ,024 21,254 1 ,000 1,117 -5,309 1,134 21,935 1 ,000 ,005 a. Variable(s) entered on step 1: age. • For the age variable: – Percentage change in odds = (exponentiated coefficient – 1) * 100 = 12%, or “the odds times 1,117” – A one unit increase in age will result in 12% increase in the odds that the person will have a CHD – So if a soccer player is one year older, the odds that (s)he will have CHD is 12% higher Another way to get an idea of the size of effects: Calculating predicted probabilities Pr( Y | X ) 1 1 e ( 5 . 3 . 11 X 1 ) For somebody of 20 years old, the predicted probability is .04 For somebody of 70 years old, the predicted probability is .91 But this gets more complicated when you have more than a single X-variable Pr(Y | X) = 1 1+ e -(-5.3+.11X1+1*X2 ) (see blackboard) Conclusion: if you consider the effect of a variable on the predicted probability, the size of the effect of X1 depends on the value of X2! (yuck!) Advanced Methods and Models in Behavioral Research – Testing significance of coefficients • In linear regression analysis this statistic is used to test significance b • In logistic regression something similar exists SE b • however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely) estimate Wald t-distribution standard error of estimate Note: This is not the Wald Statistic SPSS presents!!! Interpreting coefficients: significance • SPSS presents Wald b 2 SE 2 b • While Andy Field thinks SPSS presents this (at least in the 2nd version of the book): Wald b SE b Advanced Methods and Models in Behavioral Research – Logistics of logistic regression • • • • Estimate the coefficients Assess model fit Interpret coefficients Check regression assumptions Checking assumptions • Influential data points & Residuals – Follow Samanthas tips • Hosmer & Lemeshow – Divides sample in subgroups – Checks whether there are differences between observed and predicted between subgroups – Test should not be significant, if so: indication of lack of fit Hosmer & Lemeshow Test divides sample in subgroups, checks whether difference between observed and predicted is about equal in these groups Test should not be significant (indicating no difference) Examining residuals in logistic regression 1. Isolate points for which the model fits poorly 2. Isolate influential data points Residual statistics: Field’s rules of thumb Advanced Methods and Models in Behavioral Research – Logistic regression • Y = 0/1 • Multiple regression (or ANcOVA) is not right • You consider either the odds or the log(odds) • It is estimated through “maximum likelihood” • Interpretation is a bit more complicated than normal • Assumption testing is a bit more concrete than in multiple regression Advanced Methods and Models in Behavioral Research – Advanced Methods and Models in Behavioral Research Make sure to • enroll in studyweb (0a611) • Read the Field chapter on logistic regression • Go through the slides as well • Bring your laptop next time: we’ll go through a logistic regression in Stata Advanced Methods Advanced andMethods Models in and Behavioral Models inResearch Behavioral – 2008/2009 Research – 68 Illustration with SPSS (without the outlier part) • Penalty kicks data, variables: – Scored: outcome variable, • 0 = penalty missed, and 1 = penalty scored – Pswq: degree to which a player worries – Previous: percentage of penalties scored by a particular player in their career 69 SPSS OUTPUT Logistic Regression Case Processing Summary Unweighted Cas es Selected C ases a N Included in Analysis Miss ing Cas es Total Unselected Cases Total Percent 75 100,0 0 ,0 75 100,0 0 ,0 75 100,0 a. If weight is in effect, s ee classification table for the total number of cases . Dependent Variable Encoding Original Value Miss ed Penalty Scored Penalty Internal Value 0 1 Tells you something about the number of observations and missings 70 this table is based on the empty model, i.e. only the constant in the model Block 0: Beginning Block Classification Table a,b Predicted Result of Penalty Kick Step 0 Observed Result of Penalty Kick Miss ed Penalty Scored Penalty Percentage Correct Miss ed Penalty 0 35 ,0 Scored Penalty 0 40 100,0 Overall Percentage P (Y ) 1 1 e ( b0 ) 53,3 a. Constant is included in the model. b. The cut value is ,500 Variables in the Equation B Step 0 Constant S.E. ,134 ,231 Wald df Sig. ,333 1 ,564 Variables not in the Equation Score Step 0 Variables Overall Statis tics df Sig. previous 34,109 1 ,000 ps wq 34,193 1 ,000 41,558 2 ,000 71 Exp(B) 1,143 these variables will be entered in the model later on Block is useful to check significance of individual coefficients, see Field Block 1: Method = Enter Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 54,977 2 ,000 Block 54,977 2 ,000 Model 54,977 2 ,000 this is the test statistic 2[ LL ( New ) LL ( baseline )] 2 Note: Nagelkerke is larger than Cox after dividing by -2 Model Summary New model Step 1 -2 Log likelihood 48,662 a Cox & Snell R Square ,520 Nagelkerke R Square ,694 a. Es timation terminated at iteration number 6 becaus e parameter estimates changed by less than ,001. 72 Block 1: Method = Enter (Continued) Classification Table a Predicted Result of Penalty Kick Step 1 Miss ed Penalty Observed Result of Penalty Kick Scored Penalty Percentage Correct Miss ed Penalty 30 5 85,7 Scored Penalty 7 33 82,5 Overall Percentage 84,0 a. The cut value is ,500 Predictive accuracy has improved (was 53%) Variables in the Equation B Step a 1 S.E. Wald df Sig. Exp(B) previous ,065 ,022 8,609 1 ,003 1,067 ps wq -,230 ,080 8,309 1 ,004 ,794 Constant 1,280 1,670 ,588 1 ,443 3,598 a. Variable(s) entered on step 1: previous, pswq. estimates standard error estimates significance based on Wald statistic change in odds 73 How is the classification table constructed? # cases not predicted corrrectly a Classification Table Predicted Result of Penalty Kick Step 1 Miss ed Penalty Observed Result of Penalty Kick Scored Penalty Percentage Correct Miss ed Penalty 30 5 85,7 Scored Penalty 7 33 82,5 Overall Percentage 84,0 a. The cut value is ,500 # cases not predicted corrrectly Variables in the Equation B Step a 1 S.E. Wald df Sig. Exp(B) previous ,065 ,022 8,609 1 ,003 1,067 ps wq -,230 ,080 8,309 1 ,004 ,794 Constant 1,280 1,670 ,588 1 ,443 3,598 a. Variable(s) entered on step 1: previous, pswq. Pred. P (Y ) 1 1 e (1 , 28 0 , 065 * previous 0 , 230 * pswq ) 74 How is the classification table constructed? Pred. P (Y ) 1 1 e (1 , 28 0 , 065 * previous 0 , 230 * pswq ) pswq previous scored 18 56 1 Predict. prob. .68 17 35 1 .41 20 45 0 .40 10 42 0 .85 75 How is the classification table constructed? pswq previo us scored 18 17 20 10 56 35 45 42 1 1 0 0 Classification Table Predict predict . prob. ed .68 .41 .40 .85 1 0 0 1 a Predicted Result of Penalty Kick Step 1 Observed Result of Penalty Kick Miss ed Penalty Scored Penalty Percentage Correct Miss ed Penalty 30 5 85,7 Scored Penalty 7 33 82,5 Overall Percentage 84,0 a. The cut value is ,500 76