Multiple Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission Announcements • None! The Multiple Regression Model • Regression model for K independent variables: Yi a b1 X 1i b2 X 2i bk X ki ei Multiple Regression Slopes • Let’s look more closely at the slope formulas: sY b1 sX 1 rYX1 rYX 2 rX1 X 2 s Y versus bYX rYX 2 1 rX X s X 1 2 • What happens to b1 if X1 and X2 are totally uncorrelated? • Answer: The formula reduces to the bivariate • What if X1 and X2 are correlated with each other AND X2 is more correlated with Y than X1? • Answer: b1 gets smaller (compared to bivariate) Regression Slopes • So, if two variables (X1, X2) are correlated and both predict Y: • The X variable that is more correlated with Y will have a higher slope in multivariate regression – The slope of the less-correlated variable will shrink • Thus, slopes for each variable are adjusted to how well the other variable predicts Y – It is the slope “controlling” for other variables. Multiple Regression Slopes • One last thing to keep in mind… sY b1 sX 1 rYX1 rYX 2 rX1 X 2 s Y versus bYX rYX 2 1 rX X s X 1 2 • What happens to b1 if X1 and X2 are almost perfectly correlated? • Answer: The denominator approaches Zero • The slope “blows up”, approaching infinity • Highly correlated independent variables can cause trouble for regression models… watch out Interpreting Results • (Over)Simplified rules for interpretation – Assumes good sample, measures, models, etc. • Multivariate regression with two variables: A, B • If slopes of A, B are the same as bivariate, then each has an independent effect • If A remains large, B shrinks to zero we typically conclude that effect of B was spurious, or operates through A • If both A and B shrink a little, each has an effect, but some overlap or mediation is occurring Interpreting Multivariate Results • Things to watch out for: • 1. Remember: Correlation is not causation – Ability to “control” for many variables can help detect spurious relationships… but it isn’t perfect. – Be aware that other (omitted) variables may be affecting your model. Don’t over-interpret results. • 2. Reverse causality – Many sociological processes involve bi-directional causality. Regression slopes (and correlations) do not identify which variable “causes” the other. • Ex: self-esteem and test scores. Standardized Regression Coefficients • Regression slopes reflect the units of the independent variables • Question: How do you compare how “strong” the effects of two variables if they have totally different units? • Example: Education, family wealth, job prestige – Education measured in years, b = 2.5 – Family wealth measured on 1-5 scale, b = .18 – Which is a “bigger” effect? Units aren’t comparable! • Answer: Create “standardized” coefficients Standardized Regression Coefficients • Standardized Coefficients – Also called “Betas” or Beta Weights” – Symbol: Greek b with asterisk: b* – Equivalent to Z-scoring (standardizing) all independent variables before doing the regression • Formula of coeficient for Xj: sX j b s Y * j • Result: The unit is standard deviations • Betas: Indicates the effect a 1 standard deviation change in Xj on Y b j Standardized Regression Coefficients • Ex: Education, family income, and job prestige: Coefficientsa Model 1 (Cons tant) HIGHEST YEAR OF SCHOOL COMPLETED RS FAMILY INCOME WHEN 16 YRS OLD Uns tandardized Coefficients B Std. Error 8.977 1.629 Standardi zed Coefficien ts Beta t 5.512 Sig. .000 2.487 .111 .520 22.403 .000 .178 .394 .011 .453 .651 a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1970) An increase of 1 standard deviation in Education results in a .52 standard deviation increase in job prestige What is the interpretation of the “family income” beta? Betas give you a sense of which variables “matter most” R-Square in Multiple Regression • Multivariate R-square is much like bivariate: SS REGRESSION R SSTOTAL 2 • But, SSregression is based on the multivariate regression • The addition of new variables results in better prediction of Y, less error (e), higher R-square. R-Square in Multiple Regression • Example: Model 1 Model Summary R R Square .522 a .272 Adjus ted R Square .271 Std. Error of the Es timate 12.41 a. Predictors : (Constant), INCOM16, EDUC • R-square of .272 indicates that education, parents wealth explain 27% of variance in job prestige • “Adjusted R-square” is a more conservative, more accurate measure in multiple regression – Generally, you should report Adjusted R-square. Dummy Variables • Question: How can we incorporate nominal variables (e.g., race, gender) into regression? • Option 1: Analyze each sub-group separately – Generates different slope, constant for each group • Option 2: Dummy variables – “Dummy” = a dichotomous variables coded to indicate the presence or absence of something – Absence coded as zero, presence coded as 1. Dummy Variables • Strategy: Create a separate dummy variable for all nominal categories • Ex: Gender – make female & male variables – DFEMALE: coded as 1 for all women, zero for men – DMALE: coded as 1 for all men • Next: Include all but one dummy variables into a multiple regression model • If two dummies, include 1; If 5 dummies, include 4. Dummy Variables • Question: Why can’t you include DFEMALE and DMALE in the same regression model? • Answer: They are perfectly correlated (negatively): r = -1 – Result: Regression model “blows up” • For any set of nominal categories, a full set of dummies contains redundant information – DMALE and DFEMALE contain same information – Dropping one removes redundant information. Dummy Variables: Interpretation • Consider the following regression equation: Yi a b1INCOMEi b2 DFEMALEi ei • Question: What if the case is a male? • Answer: DFEMALE is 0, so the entire term becomes zero. – Result: Males are modeled using the familiar regression model: a + b1X + e. Dummy Variables: Interpretation • Consider the following regression equation: Yi a b1INCOMEi b2 DFEMALEi ei • Question: What if the case is a female? • Answer: DFEMALE is 1, so b2(1) stays in the equation (and is added to the constant) – Result: Females are modeled using a different regression line: (a+b2) + b1X + e – Thus, the coefficient of b2 reflects difference in the constant for women. Dummy Variables: Interpretation • Remember, a different constant generates a different line, either higher or lower – Variable: DFEMALE (women = 1, men = 0) – A positive coefficient (b) indicates that women are consistently higher compared to men (on dep. var.) – A negative coefficient indicated women are lower • Example: If DFEMALE coeff = 1.2: – “Women are on average 1.2 points higher than men”. Dummy Variables: Interpretation • Visually: Women = blue, Men = red Overall slope for all data points 10 9 8 Note: Line for men, women have same slope… but one is high other is lower. The constant differs! 7 6 5 4 3 HAPPY 2 1 0 0 INCOME 20000 40000 60000 80000 If women=1, men=0: The constant (a) reflects men only. Dummy 100000 coefficient (b) reflects increase for women (relative to men) Dummy Variables • What if you want to compare more than 2 groups? • Example: Race – Coded 1=white, 2=black, 3=other (like GSS) • Make 3 dummy variables: – “DWHITE” is 1 for whites, 0 for everyone else – “DBLACK” is 1 for Af. Am., 0 for everyone else – “DOTHER” is 1 for “others”, 0 for everyone else • Then, include two of the three variables in the multiple regression model. Dummy Variables: Interpretation • Ex: Job Prestige Model 1 Coefficientsa Uns tandardized Coefficients B Std. Error (Cons tant) 9.666 1.672 EDUC 2.476 .111 INCOM16 6.282E-02 .397 DBLACK -2.666 1.117 DOTHER 1.114 1.777 Standardi zed Coefficien ts Beta .517 .004 -.055 .014 t 5.780 22.271 .158 -2.388 .627 Sig. .000 .000 .874 .017 .531 a. Dependent Variable: PRESTIGE • Negative coefficient for DBLACK indicates a lower level of job prestige compared to whites – T- and P-values indicate if difference is significant. Dummy Variables: Interpretation • Comments: • 1. Dummy coefficients shouldn’t be called slopes – Referring to the “slope” of gender doesn’t make sense – Rather, it is the difference in the constant (or “level”) • 2. The contrast is always with the nominal category that was left out of the equation – If DFEMALE is included, the contrast is with males – If DBLACK, DOTHER are included, coefficients reflect difference in constant compared to whites. Interaction Terms • Question: What if you suspect that a variable has a totally different slope for two different subgroups in your data? • Example: Income and Happiness – Perhaps men are more materialistic -- an extra dollar increases their happiness a lot – If women are less materialistic, each dollar has a smaller effect on income (compared to men) • Issue isn’t men = “more” or “less” than women – Rather, the slope of a variable (income) differs across groups Interaction Terms • Issue isn’t men = “more” or “less” than women – Rather, the slope of a variable coefficient (for income) differs across groups • Again, we want to specify a different regression line for each group – We want lines with different slopes, not parallel lines that are higher or lower. Interaction Terms • Visually: Women = blue, Men = red Overall slope for all data points 10 9 8 Note: Here, the slope for men and women differs. 7 6 5 The effect of income on happiness (X1 on Y) varies with gender (X2). This is called an “interaction effect” 4 3 HAPPY 2 1 0 0 INCOME 20000 40000 60000 80000 100000 Interaction Terms • Examples of interaction: – Effect of education on income may interact with type of school attended (public vs. private) • Private schooling has bigger effect on income – Effect of aspirations on educational attainment interacts with poverty • Aspirations matter less if you don’t have money to pay for college • Question: Can you think of examples of two variables that might interact? • Either from your final project? Or anything else? Interaction Terms • Interaction effects: Differences in the relationship (slope) between two variables for each category of a third variable • Option #1: Analyze each group separately • Look for different sized slope in each group • Option #2: Multiply the two variables of interest: (DFEMALE, INCOME) to create a new variable – Called: DFEMALE*INCOME – Add that variable to the multiple regression model. Interaction Terms • Consider the following regression equation: Yi a b1INCOMEi b2 DFEM * INCi ei • Question: What if the case is male? • Answer: DFEMALE is 0, so b2(DFEM*INC) drops out of the equation – Result: Males are modeled using the ordinary regression equation: a + b1X + e. Interaction Terms • Consider the following regression equation: Yi a b1INCOMEi b2 DFEM * INCi ei • Question: What if the case is female? • Answer: DFEMALE is 1, so b2(DFEM*INC) becomes b2*INCOME, which is added to b1 – Result: Females are modeled using a different regression line: a + (b1+b2) X + e – Thus, the coefficient of b2 reflects difference in the slope of INCOME for women. Interpreting Interaction Terms • Interpreting interaction terms: • A positive b for DFEMALE*INCOME indicates the slope for income is higher for women vs. men – A negative effect indicates the slope is lower – Size of coefficient indicates actual difference in slope • Example: DFEMALE*INCOME. Observed b’s: – Income: b = .5 – DFEMALE * INCOME: b = -.2 • Interpretation: Slope is .5 for men, .3 for women. Interpreting Interaction Terms • Example: Interaction of Race and Education affecting Job Prestige: Coefficientsa Model 1 Uns tandardized Coefficients B Std. Error (Cons tant) 8.855 1.744 EDUC 2.541 .118 INCOM16 6.636E-02 .396 DBLACK 4.293 4.193 BL_EDUC -.576 .332 Standardi zed Coefficien ts Beta .531 .004 .088 -.149 t 5.076 21.563 .167 1.024 -1.735 Sig. .000 .000 .867 .306 .083 a. Dependent Variable: PRESTIGE DBLACK*EDUC has a negative effect (nearly significant). Coefficient of -.576 indicates that the slope of education and job prestige is .576 points lower for Blacks than for non-blacks. Continuous Interaction Terms • Two continuous variables can also interact • Example: Effect of education and income on happiness – Perhaps highly educated people are less materialistic – As education increases, the slope between between income and happiness would decrease • Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME” • And add it to the model. Interpreting Interaction Terms • How do you interpret continuous variable interactions? • Example: EDUCATION*INCOME: Coefficient = 2.0 • Answer: For each unit change in education, the slope of income vs. happiness increases by 2 – Note: coefficient is symmetrical: For each unit change in income, education slope increases by 2 • Dummy interactions effectively estimate 2 slopes: one for each group • Continuous interactions result in many slopes: Each value of education*income yields a different slope. Interpreting Interaction Terms • Interaction terms alters the interpretation of “main effect” coefficients • Including “EDUC*INCOME changes the interpretation of EDUC and of INCOME • See Allison p. 166-9 – Specifically, coefficient for EDUC represents slope of EDUC when INCOME = 0 • Likewise, INCOME shows slope when EDUC=0 – Thus, main effects are like “baseline” slopes • And, the interaction effect coefficient shows how the slope grows (or shrinks) for a given unit change. Dummy Interactions • It is also possible to construct interaction terms based on two dummy variables – Instead of a “slope” interaction, dummy interactions show difference in constants • Constant (not slope) differs across values of a third variable – Example: Effect of of race on school success varies by gender • African Americans do less well in school; but the difference is much larger for black males. Dummy Interactions • Strategy for dummy interaction is the same: Multiply both variables – Example: Multiply DBLACK, DMALE to create DBLACK*DMALE • Then, include all 3 variables in the model – Effect of DBLACK*DMALE reflects difference in constant (level) for black males, compared to white males and black females • You would observe a negative coefficient, indicating that black males fare worse in schools than black females or white males. Interaction Terms: Remarks • 1. If you make an interaction you should also include the component variables in the model: – A model with “DFEMALE * INCOME” should also include DFEMALE and INCOME • There are rare exceptions. But when in doubt, include them • 2. Sometimes interaction terms are highly correlated with its components • That can cause problems (multicollinearity – which we’ll discuss more soon) Interaction Terms: Remarks • 3. Make sure you have enough cases in each group for your interaction terms – Interaction terms involve estimating slopes for subgroups (e.g., black females vs black males). • If you there are hardly any black females in the dataset, you can have problems • 4. “Three-way” interactions are also possible! • An interaction effect that varies across categories of yet another variable – Ex: DMale*DBlack interaction may vary across class • They are mainly used in experimental research settings with large sample sizes… but they are possible.