Lecture 6: Multiple regressoin, interaction

advertisement
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
I.
LANGUAGE FOR INTERPRETING SLOPES
Score on Neighborhood
Pessimism Scale
.68
.80
.60
.39
.40
Income  Neighborhood Pessimism
DAS: X (0,1)  Y(int)
Slope = -.293
Higher income people score, on average, .29
points lower on neighborhood pessimism than
lower income people.
.20
.00
0 Below median income
1 Above median income
0.80
Impact of Household Income on
Neighborhood Pessimism Score
0.60
DAS: X (int 0,3)  Y(int)
Slope = -.152
0.40
Observed
With each unit increase in income
quartile, the neighborhood pessimism
score drops by .15 points.
Predicted
0.20
0.00
0 Lowest Income
Quarter
1 Second qtr
2 Third qtr
Percent pessimistic
60%
DAS: X (int 0,3)  Y(0,1)
48%
36%
40%
4 Top qtr
Slope = -.091
30%
20%
20%
With each unit increase in income quartile, the
neighborhood pessimism score drops by 9 percent.
0%
0 Lowest
Income
Quarter
1 Second
qtr
2 Third qtr
4 Top qtr
Any negative perception of
neighborhood
50%
42%
40%
25%
30%
DAS: X (0,1)  Y(0,1)
Slope = - .176
Higher income people are 18% less likely to be
pessimistic about their neighborhood than lower
income people.
20%
10%
0%
0 Below median income
1 Above median income
1
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
II.
INTRODUCING . . . CONTROL VARIABLES
A control variable enters the picture when the theory/idea says there is another factor that
explains the X  Y relationship. For example . . .
 The reason higher income people are 18% less likely to be pessimistic about their
neighborhood is because higher income people live in places where there is less fear of
crime and fear of crime causes pessimism about the neighborhood
Income causes Fear of Crime, which causes Neighborhood Pessimism
X1

X2

Y
If the control variable is measured in the same survey, then there are statistical procedures to find
out whether that control variable explains the X  Y relationship
III.
GETTING A CONTROL VARIABLE . . . Fear of Crime
crimnbr Amount Of
Neighborhood Crime
victim1 Likelihood Resp Will Be A
Crime Victim
1 A Lot
11%
1 High
7%
2 Some
27%
2 Moderate
23%
3 Only A Little
56%
3 Low
29%
4 Near Zero
23%
15%
3%
4 None
6%
5 Zero
8 Do Not Know
1%
8 Don't Know
100%
100%
crimnbr Amount Of
Neighborhood Crime
56%
27%
11%
1 A Lot
2 Some
3 Only A Little
6%
1%
4 None
8 Do Not
Know
victim1 Likelihood Resp Will Be A
Crime Victim
29%
23%
23%
15%
7%
1 High
3%
2 Moderate
3 Low
4 Near Zero
 2 variables available on the data set,
asked same year, etc
 will code each (0,1) and create a
scale (0,2)
 need to pick a cut point for each
(0,1) coding that results in a NICE Scale
5 Zero
 Is there a policy relevant group? . . . the policy
relevant group is usually the extreme end – in this
case “a lot” or “none.” Coding it this way would
result in bad skew. There is not a good reason to
do this.
 The (low, high) coding should match the
language of the theory . . . Fear causes Pessimism,
so 0 = low fear 1 = high fear
 The coding of “don’t knows” should match the
language of the theory 1 =high fear 2 =other, DK
 The (0,1) coding should mean the same thing
for each variable in the scale . . . if crimnbr (1) = a
lot + some then victim1 (1) = high + moderate
 What coding produces minimal skew (larger
variance)? (1+2 =1) (3-8=0)
8 Don't
Know
2
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
crimscale
Fear of Crime Scale
 PRETTY NICE scale 
50%
31%
0
50%
1
31%
2
19%
19%
0
1
100%
And it makes a
GORGEOUS
dichotomy 
2
crimscaleDICHOT
0
50%
1
50%
100%
IV.
TESTING THE IMPACT OF A CONTROL VARIABLE
A. When you control for a variable, it means you hold it constant.
B. So if you want to look at the causal impact of Income (X1) on Neighborhood Pessimism
(Y), controlling for Fear of Crime (X2), it means you need to separate the survey sample
into two groups (low fear, high fear) and look at the causal impact of Income (X1) on
Neighborhood Pessimism within each of these two groups.
C. Like so many other things in life, this is pretty easy to do with crosstabs . . .
CROSSTABS
Layer 1 = Fear of crime (X2)
Row variable = income (X1) Column variable = pessimism (Y)
income50pct above or below median * nbhdscaleDICHOT bad vs other * crimscaleDICHOT nbrhd crime +
victimization likelihood Crosstabulation
% within income50pct above or below median
crimscaleDICHOT nbrhd crime + victimization likelihood
nbhdscaleDICHOT bad vs other
.00 Low fear
1.00 high fear
.00 all good
neutral dk
1.00 any bad
perception
Total
income50pct above or
below median
.00 below median
73%
27%
100%
1.00 above median
84%
16%
100%
income50pct above or
below median
.00 below median
46%
54%
100%
1.00 above median
62%
38%
100%
Three way crosstabulation: Income by Neighborhood
Pessimism, Controlling for Fear of Neighborhood
Two
groups,
hold
constant
the level
of fear
FEAR
0 Low Fear
INCOME
0 Below median
1 Above median
1 High Fear
0 Below median
1 Above median
Percent
pessimism = 1
27%
16%
54%
38%
3
 PQ version
Causal impact controlling for Fear
B(Income  Pessimism) = -11% when
Fear = Low
B(Income  Pessimism) = -17% when
Fear = High
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
V.
HOW TO DETERMINE WHETHER THE CONTROL VARIABLE “EXPLAINS”
THE ORIGINAL X1  Y RELATIONSHIP
A. The average conditional difference shows the amount of the X1  Y relationship
that remains when the explanatory variable X2 is controlled
Three way crosstabulation: Income by Neighborhood Pessimism,
Controlling for Fear of Neighborhood
FEAR
INCOME
0 Low Fear 0 Below median
1 Above median
Percent
pessimism = 1
27%
16%
1 High Fear 0 Below median
54%
1 Above median
38%
Weighted average of conditional differences =
 Three way table
Conditional
Differences
-11%
 conditional differences
-17%
-14%
B. Question: does Fear of Crime explain the relationship between Income and
Pessimism?
a) Total Bivariate Relationship = Zero Order effect (difference, slope) = -.18
(Because zero variables are controlled)
b) Direct effect = Partial (difference, slope) = -.14
c) Amount explained by third variable = -.04
i. Intervening effect . . . if X1 causes X2 X1  X2
ii. Spurious effect . . . . . . if X2 causes X1 X2  X1
iii. We’ll talk about Causal Order among X variables next week
C. Answer: Somewhat, Fear of crime explains 22% of the original relationship.
Controlling fear of crime there is still a direct effect of income on pessimism of -.14
which means that controlling for fear of crime, higher income people are 14% less
likely than low income people to be pessimistic about their neighborhood
D. PQ way to report significance of the partial slopes
Predictor
Income (0,1)
Fear of Crime (0,1)
Impact on Neighborhood Pessimism
Slope
T-test
significant?
-.138
-11.3
yes
.231
19.008
yes
E. PQ way to summarize the impact of controlling a third variable
Impact of Household Income on Neighborhood
Pessimism
Zero order
-.18 100%
Partial (Direct Effect)
-.14 78%
Intervening effect of Fear of Crime
-.04 22%
4
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
VI.
SIGN ME UP . . .HOW DO I GET THE AVERAGE OF THE CONDITIONAL
DIFFERENCES (aka THE PARTIAL, or THEDIRECT EFFECT) ?
A. It would be nice if you just add up the conditionals and get the simple average by
dividing by however many conditionals there are (in this case there are two
conditionals because fear is (0,1) , but there could be more conditionals if X2 had 3+
categories)
B. But Nooooooo . . . the partial is a WEIGHTED AVERAGE of conditionals
1.(THE NEXT COMMENT IS FOR EXTRA CREDIT, SKIP IT IF YOU ARE
HAVING TROUBLE IN THIS CLASS)
2. The weights depend on the variance of the difference in each conditional
table PARTIAL = Sum of ( weight * conditional difference)
C. So let’s just have PASW calculate it for us . . .
ANALYZE REGRESSION LINEAR
Dependent
Neighborhood Pessimism (Y)
Independent(s)
Income (X1)
Fear of Crime (X2)
OPTIONS Exclude cases pairwise
 Two Independent variables
Coefficientsa
Model
Standardized
Unstandardized Coefficients
B
1
(Constant)
income50pct above or below
Std. Error
.284
.011
-.138
.012
.231
.012
Coefficients
Beta
t
Sig.
25.768
.000
-.146
-11.308
.000
.246
19.008
.000
median
crimscaleDICHOT nbrhd
crime + victimization
likelihood
a. Dependent Variable: nbhdscaleDICHOT bad vs other
VII.
This is a multiple regression (more than one X variable)
 The slope for X1 is the partial (direct) effect
a. This is where the -.14 comes from
b. It is the impact of X1  Y for a regression model that also includes X2
 If the partial is NOT statistically significant, then it could = 0 and the control variable
is said to fully explain the original X1  Y relationship
 That didn’t happen here . . . significance test . . . (partial/SE) = t-test = -11 p <.05
 We already saw that 78% of the original relationship remains – i.e., is not explained
by X2 – and now we also learn that the partial is statistically significant
5
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
VIII. PREDICTED AVERAGE SCORES FOR Y
A. The regression equation predicts the average on Y as a function of scores on two X
variables
Predicted average on Y = a + B1 * (x1) + .B2 * (x2)
Predicted average on Y = .284 -.138 * (x1) + .231 * (x2)
B. The prediction is an equation for two lines on a graph . . .
Predicted % Pessimistic About the Neighborhood
60%
51%
50%
38%
40%
28%
30%
1 High Fear
15%
20%
0 Low Fear
10%
0%
0 Below median income
IX.
 one line shows the linear relationship
between income and pessimism among
those with high fear
 the other line shows the linear
relationship among those with low fear
1 Above median
 the slope (impact of X1  Y) is the
same for each of these lines because the
partial is the weighted average of the
conditional differences and is assumed to
be the same within each condition
STATISTICAL INTERACTION
A. In almost every analysis, however, the actual slope is not going to be the same in
each condition. You can find out how different they are by graphing observed data:
Observed % Pessimistic About the Neighborhood
60%
54%
The slope is a little steeper among those
with High Fear (-17%)
than it is for those with Low Fear (-11%)
50%
38%
40%
30%
27%
1 High Fear
16%
20%
0 Low Fear
10%
0%
0 Below median income
It is OK if the slopes are A LITTLE
different because the regression
program is a robot that treats them as
separate estimates of the partial and
assumes the weighted average of the two
is the best overall estimate of the partial.
80% of the time this is what happens, the
slopes are A LITTLE different, no worry
1 Above median
B. Which means that 20% of the time the slopes are A LOT different.
Let’s imagine that the control variable is Place of Residence and the theory is: The reason
higher income people are less likely to be pessimistic about their neighborhood is because
higher income people are more likely to live in the suburbs and and suburban residents are
generally less pessimistic about their neighborhood . . .
Income (X1) causes Place of Residence (X2), which causes Neighborhood Pessimism (Y)
6
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
Observed % Pessimistic About the Neighborhood
54%
60%
 And let’s imagine the observed
data look like this
50%
38%
40%
1 Chicago
30%
16%
16%
0 Below median income
1 Above median
20%
0 Suburbs
 Slope for Chicago = -17%
 Slope for Suburbs = 0%
10%
0%
The regression robot will calculate the partial slope as the weighted average of the
conditional slopes . . . i.e., about -8%
But the predicted average scores for Y will always be pretty far off
The regression equation would understate the income difference in the city and
overstate the income difference in the suburbs.
C. When the conditional slopes are A LOT different from each other it is called a
statistical INTERACTION. When an interaction is present:
1. The partial slope calculated by the regression program is WRONG
2.The regression equation is WRONG
3. The predictions from the regression equation don’t fit the data very well
Q1: How can you tell if you have a statistical INTERACTION?
A1: Graph the observed data and see if the lines are parallel
A2: Make a table that compares observed with predicted average on Y:
Observed and Expected Scores: Income, Fear and Neighborbood Pessimism
FEAR
INCOME
0 Low Fear
0 Below median
1 Above median
1 High Fear
0 Below median
1 Above median
Observed
Pessimism
27%
16%
54%
38%
Predicted
Pessimism
28%
15%
51%
38%
Residual
O-E
-1%
2%
3%
0%
 residuals show where
they disagree, large
Q2: What do you do if you have a statistical INTERACTION
A1: Right now – note the problem and proceed
A2: In a couple of weeks -- test the INTERACTION TERM and include it in the
equation if the t-test is significant (TBA)
7
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
X.
ANOTHER EXAMPLE X1 (interval 0,3)
X2 (interval (0,2) Y (interval 0,3)
A. THEORY: Income causes Fear of Crime, which causes Neighborhood Pessimism
B. EXPLAIN THE VARIABLES . . . DESCRIPTIVES
Descriptive Statistics
N
incomeQUARTER quarter
nbhdscale
crimscale nbrhd crime + victimization
likelihood
Valid N (listwise)
Minimum
Maximum
Mean
31954
0
3
1.5038
6112
0
3
0.5239
9143
0
2
0.6883
 Descriptives, range,
mean
 bar charts to show how
NICE they are
5574
C. ZERO ORDER RELATIONSHIP TO BE EXPLAINED
0.80
Table of means (not shown)
Neighborhood pessimism
0.78
 Graph to explore curvilinearity
0.57
0.60
Slope = -.152 T = 15 p < .05
0.46
0.40
0.31
Equation Y = .752 - .152 (X1)
0.20
0.00
0 Lowest $ qtr
1 Second $ qtr
2 Third $ qtr
3 Top qtr
D. INTRODUCE CONTROL VARIABLE
ANALYZE / COMPARE MEANS / MEANS
Dependent Y
Independent List X2
Next X1
Options Mean
CONTINUE OK
Neighborhood Pessimism Scale Score
2 High Fear
1 Moderate Fear
0 Low Fear
0 Lowest $ qtr
1 Second $ qtr
2 Third $ qtr
3 Top $ qtr
1.39
.77
.41
1.09
.59
.34
1.07
.56
.25
.79
.46
.18
8
 Table of Means
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
E. REPORT THE RESULTS
Neighborhood Pessimism
1.39
1.09
1.07
 1. Plot the means carefully
label everything
 Explore interaction
Conditional slopes
2 High Fear
-.18
1 Moderate Fear
-.09
0 Low Fear
-.08
.79
.77
.59
.41
2 High Fear
.56
.34
.46
1 Moderate Fear
.25
.18
 a little different
0 Low Fear
0 Lowest $ qtr
1 Second $ qtr
2 Third $ qtr
Be sure to ask me how to do a
regression in Excel to solve for
the conditional slopes
3 Top $ qtr
.2 Report Direct effect, significance . . . Partial = -.101 T = -10 p < .05
.3 Report the regression equation Y = .422 - .101 ( X1 ) + .369 * ( X2 )
Predicted Pessimism Scale Score
1.160
1.059
.958
.857
.791
.690
.589
2 High Fear
.488
.422
 4. Make a table of
predicted means OR a graph
of the predicted means, use it
to talk through the findings
from the multiple regression
1 Moderate Fear
.321
0 Low Fear
.220
.119
0 Lowest $ qtr
1 Second $ qtr
2 Third $ qtr
3 Top $ qtr
.5 Make a table that summarizes the impact of the control variable
Impact of Household Income on Neighborhood Pessimism
Zero order
-.15
100%
Partial (Direct Effect)
-.10
66%
Intervening effect of Fear of Crime-.05
34%
9
 Explain the impact of the control
variable . . . i.e., controlling for Fear
explains 34% of the zero order
relationship between income and
neighborhood pessimism
SPS 580 Lecture 6 Controls Z-P multiple regression interaction
ASSIGNMENT 6:
Part 1: Calculate a regression slope in Excel.
In the Excel File WEEK 6 SUPPORT MATERIALS there is a spreadsheet called ASSGT 6 part
1 which shows the results of a recent survey of SPS graduates who studied hard and did well in
SPS 580. The X variable is the number of years since graduation, the Y variable is the average
salary.
a. Write a seven-word poem in the space provided, When you are satisfied with the poem,
freeze the Y variable.
b. Use Excel and your brain to fill in the boxes: what is the XY slope, mean(X), mean(Y) and
the intercept
c. Use the slope and intercept to fill in the predicted average(Y) as f(X)
d. Make a graph of the observed average(Y) and the predicted avg(Y) as f(X)
e. All you need to turn in is the graph, with two sentences max commenting on the results.
FROM THIS POINT ON follow guidelines for writing reports and rules for PQ exhibits
Part 2: Develop an X1 Y theory in a population of interest. Test the impact of an intervening
variable X2.
1. Choose/calculate/recode X1 (interval) and Y (interval)
a. Don’t go beyond 5 categories for X1 to keep the graphs tidy
b. Y can have any number of categories
c. Explain the variables in English, use bar charts to show they are NICE
d. Explain the zero order results
2. Choose/calculate/recode the intervening variable X2 – i.e, the theory is that X1 causes X2
and X2 causes
a. YX2 can be dichotomous or interval If interval
b. Don’t go beyond 4 categories in order to keep the graphs tidy
3. Explain the impact of the control variable, following the 5 steps in the lecture, and providing
the PQ documentation that goes along with those steps.
10
Download