SPS 580 Lecture 6 Controls Z-P multiple regression interaction I. LANGUAGE FOR INTERPRETING SLOPES Score on Neighborhood Pessimism Scale .68 .80 .60 .39 .40 Income Neighborhood Pessimism DAS: X (0,1) Y(int) Slope = -.293 Higher income people score, on average, .29 points lower on neighborhood pessimism than lower income people. .20 .00 0 Below median income 1 Above median income 0.80 Impact of Household Income on Neighborhood Pessimism Score 0.60 DAS: X (int 0,3) Y(int) Slope = -.152 0.40 Observed With each unit increase in income quartile, the neighborhood pessimism score drops by .15 points. Predicted 0.20 0.00 0 Lowest Income Quarter 1 Second qtr 2 Third qtr Percent pessimistic 60% DAS: X (int 0,3) Y(0,1) 48% 36% 40% 4 Top qtr Slope = -.091 30% 20% 20% With each unit increase in income quartile, the neighborhood pessimism score drops by 9 percent. 0% 0 Lowest Income Quarter 1 Second qtr 2 Third qtr 4 Top qtr Any negative perception of neighborhood 50% 42% 40% 25% 30% DAS: X (0,1) Y(0,1) Slope = - .176 Higher income people are 18% less likely to be pessimistic about their neighborhood than lower income people. 20% 10% 0% 0 Below median income 1 Above median income 1 SPS 580 Lecture 6 Controls Z-P multiple regression interaction II. INTRODUCING . . . CONTROL VARIABLES A control variable enters the picture when the theory/idea says there is another factor that explains the X Y relationship. For example . . . The reason higher income people are 18% less likely to be pessimistic about their neighborhood is because higher income people live in places where there is less fear of crime and fear of crime causes pessimism about the neighborhood Income causes Fear of Crime, which causes Neighborhood Pessimism X1 X2 Y If the control variable is measured in the same survey, then there are statistical procedures to find out whether that control variable explains the X Y relationship III. GETTING A CONTROL VARIABLE . . . Fear of Crime crimnbr Amount Of Neighborhood Crime victim1 Likelihood Resp Will Be A Crime Victim 1 A Lot 11% 1 High 7% 2 Some 27% 2 Moderate 23% 3 Only A Little 56% 3 Low 29% 4 Near Zero 23% 15% 3% 4 None 6% 5 Zero 8 Do Not Know 1% 8 Don't Know 100% 100% crimnbr Amount Of Neighborhood Crime 56% 27% 11% 1 A Lot 2 Some 3 Only A Little 6% 1% 4 None 8 Do Not Know victim1 Likelihood Resp Will Be A Crime Victim 29% 23% 23% 15% 7% 1 High 3% 2 Moderate 3 Low 4 Near Zero 2 variables available on the data set, asked same year, etc will code each (0,1) and create a scale (0,2) need to pick a cut point for each (0,1) coding that results in a NICE Scale 5 Zero Is there a policy relevant group? . . . the policy relevant group is usually the extreme end – in this case “a lot” or “none.” Coding it this way would result in bad skew. There is not a good reason to do this. The (low, high) coding should match the language of the theory . . . Fear causes Pessimism, so 0 = low fear 1 = high fear The coding of “don’t knows” should match the language of the theory 1 =high fear 2 =other, DK The (0,1) coding should mean the same thing for each variable in the scale . . . if crimnbr (1) = a lot + some then victim1 (1) = high + moderate What coding produces minimal skew (larger variance)? (1+2 =1) (3-8=0) 8 Don't Know 2 SPS 580 Lecture 6 Controls Z-P multiple regression interaction crimscale Fear of Crime Scale PRETTY NICE scale 50% 31% 0 50% 1 31% 2 19% 19% 0 1 100% And it makes a GORGEOUS dichotomy 2 crimscaleDICHOT 0 50% 1 50% 100% IV. TESTING THE IMPACT OF A CONTROL VARIABLE A. When you control for a variable, it means you hold it constant. B. So if you want to look at the causal impact of Income (X1) on Neighborhood Pessimism (Y), controlling for Fear of Crime (X2), it means you need to separate the survey sample into two groups (low fear, high fear) and look at the causal impact of Income (X1) on Neighborhood Pessimism within each of these two groups. C. Like so many other things in life, this is pretty easy to do with crosstabs . . . CROSSTABS Layer 1 = Fear of crime (X2) Row variable = income (X1) Column variable = pessimism (Y) income50pct above or below median * nbhdscaleDICHOT bad vs other * crimscaleDICHOT nbrhd crime + victimization likelihood Crosstabulation % within income50pct above or below median crimscaleDICHOT nbrhd crime + victimization likelihood nbhdscaleDICHOT bad vs other .00 Low fear 1.00 high fear .00 all good neutral dk 1.00 any bad perception Total income50pct above or below median .00 below median 73% 27% 100% 1.00 above median 84% 16% 100% income50pct above or below median .00 below median 46% 54% 100% 1.00 above median 62% 38% 100% Three way crosstabulation: Income by Neighborhood Pessimism, Controlling for Fear of Neighborhood Two groups, hold constant the level of fear FEAR 0 Low Fear INCOME 0 Below median 1 Above median 1 High Fear 0 Below median 1 Above median Percent pessimism = 1 27% 16% 54% 38% 3 PQ version Causal impact controlling for Fear B(Income Pessimism) = -11% when Fear = Low B(Income Pessimism) = -17% when Fear = High SPS 580 Lecture 6 Controls Z-P multiple regression interaction V. HOW TO DETERMINE WHETHER THE CONTROL VARIABLE “EXPLAINS” THE ORIGINAL X1 Y RELATIONSHIP A. The average conditional difference shows the amount of the X1 Y relationship that remains when the explanatory variable X2 is controlled Three way crosstabulation: Income by Neighborhood Pessimism, Controlling for Fear of Neighborhood FEAR INCOME 0 Low Fear 0 Below median 1 Above median Percent pessimism = 1 27% 16% 1 High Fear 0 Below median 54% 1 Above median 38% Weighted average of conditional differences = Three way table Conditional Differences -11% conditional differences -17% -14% B. Question: does Fear of Crime explain the relationship between Income and Pessimism? a) Total Bivariate Relationship = Zero Order effect (difference, slope) = -.18 (Because zero variables are controlled) b) Direct effect = Partial (difference, slope) = -.14 c) Amount explained by third variable = -.04 i. Intervening effect . . . if X1 causes X2 X1 X2 ii. Spurious effect . . . . . . if X2 causes X1 X2 X1 iii. We’ll talk about Causal Order among X variables next week C. Answer: Somewhat, Fear of crime explains 22% of the original relationship. Controlling fear of crime there is still a direct effect of income on pessimism of -.14 which means that controlling for fear of crime, higher income people are 14% less likely than low income people to be pessimistic about their neighborhood D. PQ way to report significance of the partial slopes Predictor Income (0,1) Fear of Crime (0,1) Impact on Neighborhood Pessimism Slope T-test significant? -.138 -11.3 yes .231 19.008 yes E. PQ way to summarize the impact of controlling a third variable Impact of Household Income on Neighborhood Pessimism Zero order -.18 100% Partial (Direct Effect) -.14 78% Intervening effect of Fear of Crime -.04 22% 4 SPS 580 Lecture 6 Controls Z-P multiple regression interaction VI. SIGN ME UP . . .HOW DO I GET THE AVERAGE OF THE CONDITIONAL DIFFERENCES (aka THE PARTIAL, or THEDIRECT EFFECT) ? A. It would be nice if you just add up the conditionals and get the simple average by dividing by however many conditionals there are (in this case there are two conditionals because fear is (0,1) , but there could be more conditionals if X2 had 3+ categories) B. But Nooooooo . . . the partial is a WEIGHTED AVERAGE of conditionals 1.(THE NEXT COMMENT IS FOR EXTRA CREDIT, SKIP IT IF YOU ARE HAVING TROUBLE IN THIS CLASS) 2. The weights depend on the variance of the difference in each conditional table PARTIAL = Sum of ( weight * conditional difference) C. So let’s just have PASW calculate it for us . . . ANALYZE REGRESSION LINEAR Dependent Neighborhood Pessimism (Y) Independent(s) Income (X1) Fear of Crime (X2) OPTIONS Exclude cases pairwise Two Independent variables Coefficientsa Model Standardized Unstandardized Coefficients B 1 (Constant) income50pct above or below Std. Error .284 .011 -.138 .012 .231 .012 Coefficients Beta t Sig. 25.768 .000 -.146 -11.308 .000 .246 19.008 .000 median crimscaleDICHOT nbrhd crime + victimization likelihood a. Dependent Variable: nbhdscaleDICHOT bad vs other VII. This is a multiple regression (more than one X variable) The slope for X1 is the partial (direct) effect a. This is where the -.14 comes from b. It is the impact of X1 Y for a regression model that also includes X2 If the partial is NOT statistically significant, then it could = 0 and the control variable is said to fully explain the original X1 Y relationship That didn’t happen here . . . significance test . . . (partial/SE) = t-test = -11 p <.05 We already saw that 78% of the original relationship remains – i.e., is not explained by X2 – and now we also learn that the partial is statistically significant 5 SPS 580 Lecture 6 Controls Z-P multiple regression interaction VIII. PREDICTED AVERAGE SCORES FOR Y A. The regression equation predicts the average on Y as a function of scores on two X variables Predicted average on Y = a + B1 * (x1) + .B2 * (x2) Predicted average on Y = .284 -.138 * (x1) + .231 * (x2) B. The prediction is an equation for two lines on a graph . . . Predicted % Pessimistic About the Neighborhood 60% 51% 50% 38% 40% 28% 30% 1 High Fear 15% 20% 0 Low Fear 10% 0% 0 Below median income IX. one line shows the linear relationship between income and pessimism among those with high fear the other line shows the linear relationship among those with low fear 1 Above median the slope (impact of X1 Y) is the same for each of these lines because the partial is the weighted average of the conditional differences and is assumed to be the same within each condition STATISTICAL INTERACTION A. In almost every analysis, however, the actual slope is not going to be the same in each condition. You can find out how different they are by graphing observed data: Observed % Pessimistic About the Neighborhood 60% 54% The slope is a little steeper among those with High Fear (-17%) than it is for those with Low Fear (-11%) 50% 38% 40% 30% 27% 1 High Fear 16% 20% 0 Low Fear 10% 0% 0 Below median income It is OK if the slopes are A LITTLE different because the regression program is a robot that treats them as separate estimates of the partial and assumes the weighted average of the two is the best overall estimate of the partial. 80% of the time this is what happens, the slopes are A LITTLE different, no worry 1 Above median B. Which means that 20% of the time the slopes are A LOT different. Let’s imagine that the control variable is Place of Residence and the theory is: The reason higher income people are less likely to be pessimistic about their neighborhood is because higher income people are more likely to live in the suburbs and and suburban residents are generally less pessimistic about their neighborhood . . . Income (X1) causes Place of Residence (X2), which causes Neighborhood Pessimism (Y) 6 SPS 580 Lecture 6 Controls Z-P multiple regression interaction Observed % Pessimistic About the Neighborhood 54% 60% And let’s imagine the observed data look like this 50% 38% 40% 1 Chicago 30% 16% 16% 0 Below median income 1 Above median 20% 0 Suburbs Slope for Chicago = -17% Slope for Suburbs = 0% 10% 0% The regression robot will calculate the partial slope as the weighted average of the conditional slopes . . . i.e., about -8% But the predicted average scores for Y will always be pretty far off The regression equation would understate the income difference in the city and overstate the income difference in the suburbs. C. When the conditional slopes are A LOT different from each other it is called a statistical INTERACTION. When an interaction is present: 1. The partial slope calculated by the regression program is WRONG 2.The regression equation is WRONG 3. The predictions from the regression equation don’t fit the data very well Q1: How can you tell if you have a statistical INTERACTION? A1: Graph the observed data and see if the lines are parallel A2: Make a table that compares observed with predicted average on Y: Observed and Expected Scores: Income, Fear and Neighborbood Pessimism FEAR INCOME 0 Low Fear 0 Below median 1 Above median 1 High Fear 0 Below median 1 Above median Observed Pessimism 27% 16% 54% 38% Predicted Pessimism 28% 15% 51% 38% Residual O-E -1% 2% 3% 0% residuals show where they disagree, large Q2: What do you do if you have a statistical INTERACTION A1: Right now – note the problem and proceed A2: In a couple of weeks -- test the INTERACTION TERM and include it in the equation if the t-test is significant (TBA) 7 SPS 580 Lecture 6 Controls Z-P multiple regression interaction X. ANOTHER EXAMPLE X1 (interval 0,3) X2 (interval (0,2) Y (interval 0,3) A. THEORY: Income causes Fear of Crime, which causes Neighborhood Pessimism B. EXPLAIN THE VARIABLES . . . DESCRIPTIVES Descriptive Statistics N incomeQUARTER quarter nbhdscale crimscale nbrhd crime + victimization likelihood Valid N (listwise) Minimum Maximum Mean 31954 0 3 1.5038 6112 0 3 0.5239 9143 0 2 0.6883 Descriptives, range, mean bar charts to show how NICE they are 5574 C. ZERO ORDER RELATIONSHIP TO BE EXPLAINED 0.80 Table of means (not shown) Neighborhood pessimism 0.78 Graph to explore curvilinearity 0.57 0.60 Slope = -.152 T = 15 p < .05 0.46 0.40 0.31 Equation Y = .752 - .152 (X1) 0.20 0.00 0 Lowest $ qtr 1 Second $ qtr 2 Third $ qtr 3 Top qtr D. INTRODUCE CONTROL VARIABLE ANALYZE / COMPARE MEANS / MEANS Dependent Y Independent List X2 Next X1 Options Mean CONTINUE OK Neighborhood Pessimism Scale Score 2 High Fear 1 Moderate Fear 0 Low Fear 0 Lowest $ qtr 1 Second $ qtr 2 Third $ qtr 3 Top $ qtr 1.39 .77 .41 1.09 .59 .34 1.07 .56 .25 .79 .46 .18 8 Table of Means SPS 580 Lecture 6 Controls Z-P multiple regression interaction E. REPORT THE RESULTS Neighborhood Pessimism 1.39 1.09 1.07 1. Plot the means carefully label everything Explore interaction Conditional slopes 2 High Fear -.18 1 Moderate Fear -.09 0 Low Fear -.08 .79 .77 .59 .41 2 High Fear .56 .34 .46 1 Moderate Fear .25 .18 a little different 0 Low Fear 0 Lowest $ qtr 1 Second $ qtr 2 Third $ qtr Be sure to ask me how to do a regression in Excel to solve for the conditional slopes 3 Top $ qtr .2 Report Direct effect, significance . . . Partial = -.101 T = -10 p < .05 .3 Report the regression equation Y = .422 - .101 ( X1 ) + .369 * ( X2 ) Predicted Pessimism Scale Score 1.160 1.059 .958 .857 .791 .690 .589 2 High Fear .488 .422 4. Make a table of predicted means OR a graph of the predicted means, use it to talk through the findings from the multiple regression 1 Moderate Fear .321 0 Low Fear .220 .119 0 Lowest $ qtr 1 Second $ qtr 2 Third $ qtr 3 Top $ qtr .5 Make a table that summarizes the impact of the control variable Impact of Household Income on Neighborhood Pessimism Zero order -.15 100% Partial (Direct Effect) -.10 66% Intervening effect of Fear of Crime-.05 34% 9 Explain the impact of the control variable . . . i.e., controlling for Fear explains 34% of the zero order relationship between income and neighborhood pessimism SPS 580 Lecture 6 Controls Z-P multiple regression interaction ASSIGNMENT 6: Part 1: Calculate a regression slope in Excel. In the Excel File WEEK 6 SUPPORT MATERIALS there is a spreadsheet called ASSGT 6 part 1 which shows the results of a recent survey of SPS graduates who studied hard and did well in SPS 580. The X variable is the number of years since graduation, the Y variable is the average salary. a. Write a seven-word poem in the space provided, When you are satisfied with the poem, freeze the Y variable. b. Use Excel and your brain to fill in the boxes: what is the XY slope, mean(X), mean(Y) and the intercept c. Use the slope and intercept to fill in the predicted average(Y) as f(X) d. Make a graph of the observed average(Y) and the predicted avg(Y) as f(X) e. All you need to turn in is the graph, with two sentences max commenting on the results. FROM THIS POINT ON follow guidelines for writing reports and rules for PQ exhibits Part 2: Develop an X1 Y theory in a population of interest. Test the impact of an intervening variable X2. 1. Choose/calculate/recode X1 (interval) and Y (interval) a. Don’t go beyond 5 categories for X1 to keep the graphs tidy b. Y can have any number of categories c. Explain the variables in English, use bar charts to show they are NICE d. Explain the zero order results 2. Choose/calculate/recode the intervening variable X2 – i.e, the theory is that X1 causes X2 and X2 causes a. YX2 can be dichotomous or interval If interval b. Don’t go beyond 4 categories in order to keep the graphs tidy 3. Explain the impact of the control variable, following the 5 steps in the lecture, and providing the PQ documentation that goes along with those steps. 10