Slides for Week 2

advertisement
Logic of Multivariate Analysis
Multiple Regression

Why multivariate analysis?

Nothing happens by a single cause
 If it did – it would imply perfect determinism
it would imply perfect/divine measurement

it would be impossible to separate cause from effect
(where does effect start and where does cause end)


Social reality is notoriously multi-causal even more
than certain physical/chemical/biological processes
 People are not just objects but also subjects of causal
processes – reflexivity, agency, framing etc. (Some of these
are hard to capture in statistical models.)
#1. Empirical Association
#2. Appropriate Time Order
#3. Non-Spuriousness (Excluding other Forms of Causation)




Mill tells us that even individual causal relationships cannot be established without multivariate analysis
(#3).
Suppose we suspect X causes Y Y=f(X,e)
Suppose we establish that X is related to Y (#1) and X precedes Y (#2).
But what if both X and Y are the result of Z a third variable:




E.g. Academic Performance=f( Poverty, e)


If that were true redistributing income should help academic achievements.
But maybe both are the result of parents education (a confounding factor)
-
Poverty
Poverty
Academic Performance
e2
e
e1
Poverty
Academic Performance
-
+
Parents’ Education

Eliminating or “controlling for” other, confounding factors (Z)

Experiments -- treatment (X) is introduced by researcher:

1. Physical control
 Excluding factors by physical design – physical control of Zs

2. Randomization
 Random assignment to treatment and control – randomized control Zs

Observational research – no manipulation by researcher

3. Quasi-experiments
 Found experiments – choice of cases that are “minimum pairs”: they are the same on
most confounding factors (Zs) but they are different in the treatment (X)

4. Statistical manipulation
 Removing the effect of Z from the relationship between Y and X
 Organizing data into groups homogenous by the control variable Z and looking at the relationship
between treatment X and response Y
 if Y still moves together with X it cannot be because they are moved by Z: Z is constant. If Z is
the cause of Y and Z is constant Y must be constant too.
 Residualizing X on Z then residualizing Y on Z. That leaves us with that part of X and Y that is
unrelated to Z. If the two residualized variables still move together, that cannot be because they are
moved by Z.


Remember: in a regression the error is always
unrelated to the independent variable(s)
Residualizing – (we ‘take out,’ ‘eliminate’ Z from both Y
and X)
Y i=a'+b' Z
i
+e
yz
i
X i=a' ' +b' ' Z i+e
yz
e i=a * +b * e
xz
i
xz
i
 ei
Yi=a+b 1 X i  b 2 Z i+e i
a*  0
b*  b1
The temporal
position of Z
vis-à-vis X
Conditional Effect of X on Y Controlling for Z
No
change/
Zero or
statistically not
significant
Weaker but
statistically
significant
Stronger than the
unconditional
effect
Uneven among the
categories of Z
Antecedent
variable
(Z precedes both
X and Y
Z is not
a factor
Spurious
association
X is not a
factor
(Z is their
common cause)
X is a factor but
some of its
original effect is
spurious
Suppression
Statistical
Interaction
(X works differently
depending on the
values of Z)
Intervening
variable
(Z precedes Y
but not X)
Z is not
a factor
Explanation or
chain
relationship
X is a factor
but only
through Y
(X does not
have a direct or
independent
effect)
X is a factor and
it effects Y both
through Z and
directly (or
through other
variables missing
from the model
Suppression
Statistical
Interaction
(X works differently
depending on the
values of Z)
Yi=a+b1Xi+b2Zi+ei

or

Yi=a+b1X1i+b2X2i+ei
To obtain a, b1, and b2 we first calculate β*1 and β*2 from the standardized
regression.
Then we transform them into their metric equivalents
Finally we obtain a with the help of the means of Y, X1 and X2 .




Z Yi   1 * Z X 1i   2 * Z X 2 i  e i

We multiply each side by ZX1i

We sum across all cases and divide by n
rYX 1   1   2 rX 2 X 1

 1  rYX   2 rX

We get our first normal equation (for the correlation
between Y and X1 ).
We get an expression for β*1 .
Z X 1 i Z Yi   1 * Z X 1 i Z X 1 i   2 * Z X 1 i Z X 2 i  Z X 1 i e i

1.
Z Y1 i Z X 1 i
 1
*
n
*

Z X1i Z X1i
n
 2
*

n
*
*
Z X1i Z X 2i


Z X1i ei
n
*
1
2X1

rYX 2   1 r X 2 X 1   2
*
2.
*

rYX 2  ( rYX 1   2 r X 2 X 1 ) r X 1 X 2   2
*
2 
*
1 
*
rYX 2  rYX 1 r X 2 X 1
1  rX 2 X 1
2
rYX 1  rYX 2 r X 2 X 1
1  rX 2 X 1
2
*


We multiply each side by ZX2i . Repeat what
did.
We get our second normal equation (for the
correlation between Y and X2 ).
Plugging in for β*1 .
Both standardized coefficients can be
expressed in terms of the three correlations
among Y, X1 and X2 .
We multiply each standardized coefficient by the ratio of the standard deviation of the dependent variable and the
independent variable to which it belongs.

b1 
b2 
SY
1
*
S X1
SY
2
*
SX2
Take the two normal equations:

rYX 1   1   2 r X 2 X 1
*
*
rYX 2   1 r X 2 X 1   2
*
*

What do we learn from the normal equations?
If either β*2 =0 or rx1x2=0 , the unconditional effect does not change once we control for X2.

We get suppression only if β*2 ≠0 and rx1x2 ≠ and
of the opposite signs if the unconditional effect is positive and of the same signs if the unconditional effect is negative.
The correlation (unconditional effect) of X1 or X2 on Y can be decomposed into two parts. Take X1





The direct (or net) effect of X1 on Y (β*1 ) controlling for X2
and something else that is the product of the direct (or net) effect of X2 (β*2 ) on Y and the correlation between X1 and X2 (rx1x2), the measure of
multicollinearity between the two independent variables.

http://www.miabella-llc.com/demo.html


AP=f(P,e1)
ZAP= β*’1 ZP+e1
AP=f(P,PE,e) ZAP= β*1 ZP+ β*2 ZPE+ e
Poverty
β*’1
e1
Academic Performance
e
Poverty
β*1
Academic Performance
β*2
Parents’ Education












. correlate AVG_ED API13 MEALS, means
(obs=10173)
Variable |
Mean
Std. Dev.
Min
Max
- ------------+-----------------------------------------------------------------------AVG_ED |
2.781778
.758739
1
5
API13 |
784.182
102.2096
311
999
MEALS |
58.57338
27.9053
0
100
| AVG_ED API13 MEALS
------------------+--------------------------AVG_ED | 1.0000
API13 | 0.6706
1.0000
MEALS | -0.8178 -0.4743 1.0000
. regress API13 AVG_ED MEALS, beta
Source |
SS
df
MS
-------------+------------------------------ ------------------Model | 49544993
2 24772496.5
Residual | 56719871.2 10170
5577.17514
-------------+------------------------------ --------------------Total | 106264864
10172 10446.8014
Number of obs
F( 2, 10170)
Prob > F
R-squared
Adj R-squared
Root MSE
---------------------------------------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t P>|t|
Beta
-------------+-------------------------------------------------------------------------------------AVG_ED | 114.9596 1.695597 67.80 0.000
.853387
MEALS | .8187537 .0461029 17.76 0.000
.2235364
_ cons | 416.4326 7.135849 58.36 0.000
.
------------------------------------------------------------------------------------------------------------
= 10173
= 4441.76
= 0.0000
= 0.4662
= 0.4661
= 74.68
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of API13
-400
-200
0
Residuals
200
400
chi2(1)
= 1332.01
Prob > chi2 = 0.0000
600
700
800
Fitted values
900
1000
ryx1 =β*’1 =-.4743
e1
β*’1 =-.4743
Poverty
Academic Performance
e
β*1=.2235364 Academic Performance
Poverty
rYX 1   1   2 rX 2 X 1
*
rx1x2 =β*’1= -.8178
*
β*2=.853387
Parents’ Education
rYX 1  . 2235364  . 853387 *   . 8178
. 2235364  . 69789989
  . 4743
Spurious indirect effect

ryx2 =β*’2 =. 6706
e1
Parents’ Education
β*’2 =. 6706
Academic Performance
e
β*1=.2235364 Academic Performance
Poverty
rYX
rx1x2 =β*’1= -.8178
  2   1 rX 2 X 1
*
2
*
β*2=.853387
Parents’ Education
rYX
2
 . 853387  . 2235364 *   . 8178  
. 853387 - .18280807
  . 6706
Indirect effect
Venn diagram


R-square= Unique contribution by X1 + unique contribution by X2 + common
contribution by both X1 and X2
y
x2
x1
Multicollinearity
Unique contributions are small, statistically non-significant, still R-square is
large because of the common contribution is large.


y
x2
x1



Comparing theories
How much a theory adds to an already existing one
Calculating the contribution of a set of variables ----- R2
2
F ( K 2  K 1 ), ( N  K 2  1 ) 



2
( R 2  R 1 ) /( K 2  K 1 )
2
( 1  R 2 ) /( N  K 2  1 )
Where R12 is the fit of the reduced/smaller model and R22 is the fit of the full/complete model
and K1 is the number of independent variables in the reduced model and K2 is the number of
independent variables in the complete model
and N is the sample size.

Warning: You have to make sure you use the exact same cases for each model!


Adding a new independent variable will always improve fit even if it is unrelated to the
dependent variable.
We have to consider the parsimony (number of independent variables) of the model relative to
the sample size.


For N=2, a simple regression will always have a perfect fit
General rule: N-1 independent variables will always result in R-squared of 1 no matter what those
variables are

R
2
adj
Adjusted R-square
2

(
K
)(
1

R
)
2

 R  

(
N

K

1
)


Yi=a+b1X1i+b2X2i+....+bkXki+ei
If we standardized Y, X1… Xk turning them into Z scores we can re-write the
equation as

Zyi=β*1Zx1i+ β*2Zx2i+… +β*kZxki+ei







To find the coefficients we have to write out k number of normal equations one
for each correlation between each independent variable and the dependent
variable
ryx1=β*1+ β*2 rx1x2+…..+β*k rx1xk
ryx2= β*1rx1x2+ β*2+…..+β*k rx2xk
……………….
ryxk= β*1rx1xk + β*2 rx2xk+…..+β*k
and solve k equations for k unknowns (β*1, β*2…. β*k)
. correlate API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR
(obs=10082)
| API13
MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS
----------------+-----------------------------------------------------------------------------------------API13 | 1.0000
MEALS | -0.4876 1.0000
AVG_ED | 0.6736 -0.8232 1.0000
P_EL | -0.3039 0.6149 -0.6526
1.0000
P_GATE | 0.2827 -0.1631 0.2126
-0.1564 1.0000
EMER | -0.0987 0.0197 -0.0407
-0.0211 -0.0541 1.0000
DMOB | 0.5413 -0.0693 0.2123
0.0231 0.2198 -0.0487 1.0000
PCT_AA | -0.2215 0.1625 -0.1057
-0.0718 0.0334 0.1380 -0.1306 1.0000
PCT_AI | -0.1388 0.0461 -0.0246
-0.1510 -0.0812 0.0180 -0.1138 -0.0684 1.0000
PCT_AS | 0.3813 -0.3031 0.3946
-0.0954 0.2321 -0.0247 0.1620 -0.0475 -0.0902 1.0000
PCT_FI | 0.1646 -0.1221 0.1687
-0.0526 0.1281 0.0007 0.1203 0.0578 -0.0788 0.2485
PCT_HI | -0.4301 0.6923 -0.8007
0.7143 -0.1296 -0.0192 -0.0193 -0.0911 -0.1834 -0.3733
PCT_PI | -0.0598 0.0533 -0.0228
0.0286 0.0091 0.0315 -0.0202 0.2195 -0.0311 0.0748
PCT_MR | 0.1468 -0.3714 0.3933
-0.3322 0.0052 0.0102 -0.0928 -0.0053 0.0667 0.0904
| PCT_FI
PCT_HI PCT_PI PCT_MR
-----------------+-----------------------------------PCT_FI | 1.0000
PCT_HI | -0.1488 1.0000
PCT_PI | 0.2769 -0.0763 1.0000
PCT_MR | 0.0928 -0.4700 0.0611 1.0000
API13 Academic Performance Index 2013
MEALS Percent Free or Reduced Price Meal Eligible
AVG_ED Average Parent Education Level (1-5)
P_EL Percent English Learner
P_GATE Percent in Gifted And Talented Education Program
EMER Percent Teachers with Emergency Credentials
DMOB Percent Students Enrolled in District w/o 30 Gap in Enrollment
PCT_AA Percent African American
PCT_AI Percent American Indian or Alaska Native
PCT_AS Percent Asian
PCT_FI Percent Filipino
PCT_HI Percent Hispanic or Latino
PCT_PI Percent Native Hawaiian or Pacific Islander
PCT_MR Percent Mixed Race
. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta
Source |
SS
df
MS
--------------+------------------------------ -------------------------------------Model | 65503313.6
6
10917218.9
Residual | 37321960.3
10075
3704.41293
-------------+---------------------------------------------------------------------Total | 102825274
10081
10199.9081
Number of obs
F( 6, 10075)
Prob > F
R-squared
Adj R-squared
Root MSE
= 10082
= 2947.08
= 0.0000
= 0.6370
= 0.6368
= 60.864
-----------------------------------------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------------------------------------MEALS | .1843877
.0394747 4.67 0.000
.0508435
AVG_ED | 92.81476
1.575453 58.91 0.000
.6976283
P_EL | .6984374
.0469403 14.88 0.000
.1225343
P_GATE | .8179836
.0666113 12.28 0.000
.0769699
EMER | -1.095043
.1424199 -7.69 0.000
-.046344
DMOB | 4.715438
.0817277 57.70 0.000
.3746754
_cons | 52.79082
8.491632 6.22 0.000
.
-----------------------------------------------------------------------------------------------------------. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta
Source |
SS
df
MS
----------------+-------------------------------------------------------------------Model | 67627352
13
5202104
Residual | 35197921.9
10068
3496.01926
-------------+---------------------------------------------------------------------Total | 102825274
10081
10199.9081
Number of obs
F( 13, 10068)
Prob > F
R-squared
Adj R-squared
Root MSE
-------------------------------------------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t
P>|t|
Beta
--------------+----------------------------------------------------------------------------------------------MEALS | .370891
.0395857 9.37 0.000
.1022703
AVG_ED | 89.51041
1.851184 48.35 0.000
.6727917
P_EL | .2773577
.0526058 5.27 0.000
.0486598
P_GATE | .7084009
.0664352 10.66 0.000
.0666584
EMER | -.7563048
.1396315 -5.42 0.000
-.032008
DMOB | 4.398746
.0817144 53.83 0.000
.349512
PCT_AA | -1.096513
.0651923 -16.82 0.000
-.1112841
PCT_AI | -1.731408
.1560803 -11.09 0.000
-.0718944
PCT_AS | .5951273
.0585275 10.17 0.000
.0715228
PCT_FI | .2598189
.1650952 1.57 0.116
.0099543
PCT_HI | .0231088
.0445723 0.52 0.604
.0066676
PCT_PI | -2.745531
.6295791 -4.36 0.000
-.0274142
PCT_MR | -.8061266
.1838885 -4.38 0.000
-.0295927
_cons | 96.52733
9.305661 10.37 0.000
.
-----------------------------------------------------------------------------------------------------------
= 10082
= 1488.01
= 0.0000
= 0.6577
= 0.6572
= 59.127
2
F ( K 2  K 1 ), ( N  K 2  1 ) 
F 7 , 1068 
2
( R 2  R 1 ) /( K 2  K 1 )
2
( 1  R 2 ) /( N  K 2  1 )
(. 6577  . 6370 ) /( 13  6 )
( 1  . 6577 ) /( 1082  13  1 )
(. 6577  . 6370 ) /( 13  6 )
( 1  . 6577 ) /( 1082  13  1 )

. 0207 / 7
. 3423 / 1068

 9 . 2265
. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI
PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR, vce(hc3) beta
Linear regression
Number of obs = 10082
F( 13, 10068) = 1439.49
Prob > F
= 0.0000
R-squared = 0.6577
Root MSE
= 59.127
------------------------------------------------------------------------------------------------------|
Robust HC3
API13 |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+----------------------------------------------------------------------------------------MEALS | .370891
.0576739 6.43 0.000
.1022703
AVG_ED | 89.51041
2.651275 33.76 0.000
.6727917
P_EL | .2773577
.0646176 4.29 0.000
.0486598
P_GATE | .7084009
.0624278 11.35 0.000
.0666584
EMER | -.7563048
.2248352 -3.36 0.001
-.032008
DMOB | 4.398746
.1645831 26.73 0.000
.349512
PCT_AA | -1.096513
.0799674 -13.71 0.000
-.1112841
PCT_AI | -1.731408
.2257328 -7.67 0.000
-.0718944
PCT_AS | .5951273
.0492148 12.09 0.000
.0715228
PCT_FI | .2598189
.1343712 1.93 0.053
.0099543
PCT_HI | .0231088
.0511823 0.45 0.652
.0066676
PCT_PI | -2.745531
.7471198 -3.67 0.000
-.0274142
PCT_MR | -.8061266
.2485255 -3.24 0.001
-.0295927
_cons | 96.52733
16.89459 5.71 0.000
.
-----------------------------------------------------------------------------.
Notice the coeffcients,
the betas, the Rsquared plus the Root
MSE are unchanged.
The Std. Err.s are
different and so are the
t values and therefore
the P values also
change.
Look at PCT_FI. Now it
is almost significant at
the .05 level. On the
previous slide the P
value is .116.
600
400
-400
-200
0
200
GOOD ONES
Residual
Name
Tested/Enrolled
506.0523
Muir Charter
78/78
488.5563
SIATech
65/66
342.7693
Escuela Popular/Center for Training and
88/91
280.2587
YouthBuild Charter School of California
78/78
246.7804
Oakland Charter Academy
238/238
232.4897
Oakland Charter High
146/146
230.0739
Opportunities For Learning - Baldwin Par
1434/1442
200
400
600
Fitted values
800
1000
BAD ONES
-399.4998
-342.2773
-336.5667
-322.1879
-318.0444
-315.5069
-311.1326
Sierra Vista High
(SD)
Baden High (Continuation)
Dover Bridge to Success
Millennium High Alternative
Aurora High (Continuation)
Sunrise (Special Education)
Nueva Vista High
14/15
73/73
84/88
43/49
128/131
34/34
20/28
. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_
> ED>0 & AVG_ED<6 [aweight = TESTED], beta
(sum of wgt is 9.0302e+06)
Source |
SS
df
MS
----------------+-------------------------------------------------------------------Model | 41089704.2
13
3160746.48
Residual | 13689769.3
10068
1359.73076
----------------+--------------------------------------------------------------------Total | 54779473.6
10081
5433.9325
-----------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t P>|t|
------------------+---------------------------------------------------------------MEALS | .2401007
.032364
7.42 0.000
AVG_ED | 83.84621
1.444873 58.03 0.000
P_EL | .1605591
.0405248
3.96 0.000
P_GATE | .2649964
.0443791
5.97 0.000
EMER | -1.527603
.1503635 -10.16 0.000
DMOB | 3.414537
.0834016 40.94 0.000
PCT_AA | -1.275241
.0583403 -21.86 0.000
PCT_AI | -1.96138
.2143326
-9.15 0.000
PCT_AS | .4787539
.0368303 13.00 0.000
PCT_FI | -.0272983
.1113346
-0.25 0.806
PCT_HI | .0440935
.0351466
1.25 0.210
PCT_PI | -2.464109
.5116525
-4.82 0.000
PCT_MR | -.5071886
.1678521
-3.02 0.003
_cons | 220.2237
9.318893 23.63 0.000
------------------------------------------------------------------------------
Number of obs
F( 13, 10068)
Prob > F
R-squared
Adj R-squared
Root MSE
Beta
.0828479
.8044588
.0306712
.0317522
-.0513386
.2212861
-.1301146
-.0499468
.082836
-.0013581
.0158328
-.0271533
-.0187953
.
= 10082
= 2324.54
= 0.0000
= 0.7501
= 0.7498
= 36.875
Characteristics of OLS if sample is probability sample

Unbiased
Efficient
Consistent




E(b)=
Min b
the mean sample value is the population value

the sample values are as close to each other as possible
lim Pr b n      1 as sample size (n) approaches infinity, the sample
n 
value converges on the population value
If the following assumptions are met:

The Model is




Complete
Linear
Additive
Variables are



measured at an interval or ratio scale
without error
The regression error term is







normally distributed
has an expected value of 0
errors are independent
homoscedasticity
predictors are unrelated to error
In a system of interrelated equations the errors are unrelated to each other
Download