Logic of Multivariate Analysis Multiple Regression Why multivariate analysis? Nothing happens by a single cause If it did – it would imply perfect determinism it would imply perfect/divine measurement it would be impossible to separate cause from effect (where does effect start and where does cause end) Social reality is notoriously multi-causal even more than certain physical/chemical/biological processes People are not just objects but also subjects of causal processes – reflexivity, agency, framing etc. (Some of these are hard to capture in statistical models.) #1. Empirical Association #2. Appropriate Time Order #3. Non-Spuriousness (Excluding other Forms of Causation) Mill tells us that even individual causal relationships cannot be established without multivariate analysis (#3). Suppose we suspect X causes Y Y=f(X,e) Suppose we establish that X is related to Y (#1) and X precedes Y (#2). But what if both X and Y are the result of Z a third variable: E.g. Academic Performance=f( Poverty, e) If that were true redistributing income should help academic achievements. But maybe both are the result of parents education (a confounding factor) - Poverty Poverty Academic Performance e2 e e1 Poverty Academic Performance - + Parents’ Education Eliminating or “controlling for” other, confounding factors (Z) Experiments -- treatment (X) is introduced by researcher: 1. Physical control Excluding factors by physical design – physical control of Zs 2. Randomization Random assignment to treatment and control – randomized control Zs Observational research – no manipulation by researcher 3. Quasi-experiments Found experiments – choice of cases that are “minimum pairs”: they are the same on most confounding factors (Zs) but they are different in the treatment (X) 4. Statistical manipulation Removing the effect of Z from the relationship between Y and X Organizing data into groups homogenous by the control variable Z and looking at the relationship between treatment X and response Y if Y still moves together with X it cannot be because they are moved by Z: Z is constant. If Z is the cause of Y and Z is constant Y must be constant too. Residualizing X on Z then residualizing Y on Z. That leaves us with that part of X and Y that is unrelated to Z. If the two residualized variables still move together, that cannot be because they are moved by Z. Remember: in a regression the error is always unrelated to the independent variable(s) Residualizing – (we ‘take out,’ ‘eliminate’ Z from both Y and X) Y i=a'+b' Z i +e yz i X i=a' ' +b' ' Z i+e yz e i=a * +b * e xz i xz i ei Yi=a+b 1 X i b 2 Z i+e i a* 0 b* b1 The temporal position of Z vis-à-vis X Conditional Effect of X on Y Controlling for Z No change/ Zero or statistically not significant Weaker but statistically significant Stronger than the unconditional effect Uneven among the categories of Z Antecedent variable (Z precedes both X and Y Z is not a factor Spurious association X is not a factor (Z is their common cause) X is a factor but some of its original effect is spurious Suppression Statistical Interaction (X works differently depending on the values of Z) Intervening variable (Z precedes Y but not X) Z is not a factor Explanation or chain relationship X is a factor but only through Y (X does not have a direct or independent effect) X is a factor and it effects Y both through Z and directly (or through other variables missing from the model Suppression Statistical Interaction (X works differently depending on the values of Z) Yi=a+b1Xi+b2Zi+ei or Yi=a+b1X1i+b2X2i+ei To obtain a, b1, and b2 we first calculate β*1 and β*2 from the standardized regression. Then we transform them into their metric equivalents Finally we obtain a with the help of the means of Y, X1 and X2 . Z Yi 1 * Z X 1i 2 * Z X 2 i e i We multiply each side by ZX1i We sum across all cases and divide by n rYX 1 1 2 rX 2 X 1 1 rYX 2 rX We get our first normal equation (for the correlation between Y and X1 ). We get an expression for β*1 . Z X 1 i Z Yi 1 * Z X 1 i Z X 1 i 2 * Z X 1 i Z X 2 i Z X 1 i e i 1. Z Y1 i Z X 1 i 1 * n * Z X1i Z X1i n 2 * n * * Z X1i Z X 2i Z X1i ei n * 1 2X1 rYX 2 1 r X 2 X 1 2 * 2. * rYX 2 ( rYX 1 2 r X 2 X 1 ) r X 1 X 2 2 * 2 * 1 * rYX 2 rYX 1 r X 2 X 1 1 rX 2 X 1 2 rYX 1 rYX 2 r X 2 X 1 1 rX 2 X 1 2 * We multiply each side by ZX2i . Repeat what did. We get our second normal equation (for the correlation between Y and X2 ). Plugging in for β*1 . Both standardized coefficients can be expressed in terms of the three correlations among Y, X1 and X2 . We multiply each standardized coefficient by the ratio of the standard deviation of the dependent variable and the independent variable to which it belongs. b1 b2 SY 1 * S X1 SY 2 * SX2 Take the two normal equations: rYX 1 1 2 r X 2 X 1 * * rYX 2 1 r X 2 X 1 2 * * What do we learn from the normal equations? If either β*2 =0 or rx1x2=0 , the unconditional effect does not change once we control for X2. We get suppression only if β*2 ≠0 and rx1x2 ≠ and of the opposite signs if the unconditional effect is positive and of the same signs if the unconditional effect is negative. The correlation (unconditional effect) of X1 or X2 on Y can be decomposed into two parts. Take X1 The direct (or net) effect of X1 on Y (β*1 ) controlling for X2 and something else that is the product of the direct (or net) effect of X2 (β*2 ) on Y and the correlation between X1 and X2 (rx1x2), the measure of multicollinearity between the two independent variables. http://www.miabella-llc.com/demo.html AP=f(P,e1) ZAP= β*’1 ZP+e1 AP=f(P,PE,e) ZAP= β*1 ZP+ β*2 ZPE+ e Poverty β*’1 e1 Academic Performance e Poverty β*1 Academic Performance β*2 Parents’ Education . correlate AVG_ED API13 MEALS, means (obs=10173) Variable | Mean Std. Dev. Min Max - ------------+-----------------------------------------------------------------------AVG_ED | 2.781778 .758739 1 5 API13 | 784.182 102.2096 311 999 MEALS | 58.57338 27.9053 0 100 | AVG_ED API13 MEALS ------------------+--------------------------AVG_ED | 1.0000 API13 | 0.6706 1.0000 MEALS | -0.8178 -0.4743 1.0000 . regress API13 AVG_ED MEALS, beta Source | SS df MS -------------+------------------------------ ------------------Model | 49544993 2 24772496.5 Residual | 56719871.2 10170 5577.17514 -------------+------------------------------ --------------------Total | 106264864 10172 10446.8014 Number of obs F( 2, 10170) Prob > F R-squared Adj R-squared Root MSE ---------------------------------------------------------------------------------------------------------API13 | Coef. Std. Err. t P>|t| Beta -------------+-------------------------------------------------------------------------------------AVG_ED | 114.9596 1.695597 67.80 0.000 .853387 MEALS | .8187537 .0461029 17.76 0.000 .2235364 _ cons | 416.4326 7.135849 58.36 0.000 . ------------------------------------------------------------------------------------------------------------ = 10173 = 4441.76 = 0.0000 = 0.4662 = 0.4661 = 74.68 . estat hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of API13 -400 -200 0 Residuals 200 400 chi2(1) = 1332.01 Prob > chi2 = 0.0000 600 700 800 Fitted values 900 1000 ryx1 =β*’1 =-.4743 e1 β*’1 =-.4743 Poverty Academic Performance e β*1=.2235364 Academic Performance Poverty rYX 1 1 2 rX 2 X 1 * rx1x2 =β*’1= -.8178 * β*2=.853387 Parents’ Education rYX 1 . 2235364 . 853387 * . 8178 . 2235364 . 69789989 . 4743 Spurious indirect effect ryx2 =β*’2 =. 6706 e1 Parents’ Education β*’2 =. 6706 Academic Performance e β*1=.2235364 Academic Performance Poverty rYX rx1x2 =β*’1= -.8178 2 1 rX 2 X 1 * 2 * β*2=.853387 Parents’ Education rYX 2 . 853387 . 2235364 * . 8178 . 853387 - .18280807 . 6706 Indirect effect Venn diagram R-square= Unique contribution by X1 + unique contribution by X2 + common contribution by both X1 and X2 y x2 x1 Multicollinearity Unique contributions are small, statistically non-significant, still R-square is large because of the common contribution is large. y x2 x1 Comparing theories How much a theory adds to an already existing one Calculating the contribution of a set of variables ----- R2 2 F ( K 2 K 1 ), ( N K 2 1 ) 2 ( R 2 R 1 ) /( K 2 K 1 ) 2 ( 1 R 2 ) /( N K 2 1 ) Where R12 is the fit of the reduced/smaller model and R22 is the fit of the full/complete model and K1 is the number of independent variables in the reduced model and K2 is the number of independent variables in the complete model and N is the sample size. Warning: You have to make sure you use the exact same cases for each model! Adding a new independent variable will always improve fit even if it is unrelated to the dependent variable. We have to consider the parsimony (number of independent variables) of the model relative to the sample size. For N=2, a simple regression will always have a perfect fit General rule: N-1 independent variables will always result in R-squared of 1 no matter what those variables are R 2 adj Adjusted R-square 2 ( K )( 1 R ) 2 R ( N K 1 ) Yi=a+b1X1i+b2X2i+....+bkXki+ei If we standardized Y, X1… Xk turning them into Z scores we can re-write the equation as Zyi=β*1Zx1i+ β*2Zx2i+… +β*kZxki+ei To find the coefficients we have to write out k number of normal equations one for each correlation between each independent variable and the dependent variable ryx1=β*1+ β*2 rx1x2+…..+β*k rx1xk ryx2= β*1rx1x2+ β*2+…..+β*k rx2xk ………………. ryxk= β*1rx1xk + β*2 rx2xk+…..+β*k and solve k equations for k unknowns (β*1, β*2…. β*k) . correlate API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR (obs=10082) | API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS ----------------+-----------------------------------------------------------------------------------------API13 | 1.0000 MEALS | -0.4876 1.0000 AVG_ED | 0.6736 -0.8232 1.0000 P_EL | -0.3039 0.6149 -0.6526 1.0000 P_GATE | 0.2827 -0.1631 0.2126 -0.1564 1.0000 EMER | -0.0987 0.0197 -0.0407 -0.0211 -0.0541 1.0000 DMOB | 0.5413 -0.0693 0.2123 0.0231 0.2198 -0.0487 1.0000 PCT_AA | -0.2215 0.1625 -0.1057 -0.0718 0.0334 0.1380 -0.1306 1.0000 PCT_AI | -0.1388 0.0461 -0.0246 -0.1510 -0.0812 0.0180 -0.1138 -0.0684 1.0000 PCT_AS | 0.3813 -0.3031 0.3946 -0.0954 0.2321 -0.0247 0.1620 -0.0475 -0.0902 1.0000 PCT_FI | 0.1646 -0.1221 0.1687 -0.0526 0.1281 0.0007 0.1203 0.0578 -0.0788 0.2485 PCT_HI | -0.4301 0.6923 -0.8007 0.7143 -0.1296 -0.0192 -0.0193 -0.0911 -0.1834 -0.3733 PCT_PI | -0.0598 0.0533 -0.0228 0.0286 0.0091 0.0315 -0.0202 0.2195 -0.0311 0.0748 PCT_MR | 0.1468 -0.3714 0.3933 -0.3322 0.0052 0.0102 -0.0928 -0.0053 0.0667 0.0904 | PCT_FI PCT_HI PCT_PI PCT_MR -----------------+-----------------------------------PCT_FI | 1.0000 PCT_HI | -0.1488 1.0000 PCT_PI | 0.2769 -0.0763 1.0000 PCT_MR | 0.0928 -0.4700 0.0611 1.0000 API13 Academic Performance Index 2013 MEALS Percent Free or Reduced Price Meal Eligible AVG_ED Average Parent Education Level (1-5) P_EL Percent English Learner P_GATE Percent in Gifted And Talented Education Program EMER Percent Teachers with Emergency Credentials DMOB Percent Students Enrolled in District w/o 30 Gap in Enrollment PCT_AA Percent African American PCT_AI Percent American Indian or Alaska Native PCT_AS Percent Asian PCT_FI Percent Filipino PCT_HI Percent Hispanic or Latino PCT_PI Percent Native Hawaiian or Pacific Islander PCT_MR Percent Mixed Race . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS --------------+------------------------------ -------------------------------------Model | 65503313.6 6 10917218.9 Residual | 37321960.3 10075 3704.41293 -------------+---------------------------------------------------------------------Total | 102825274 10081 10199.9081 Number of obs F( 6, 10075) Prob > F R-squared Adj R-squared Root MSE = 10082 = 2947.08 = 0.0000 = 0.6370 = 0.6368 = 60.864 -----------------------------------------------------------------------------------------------------------API13 | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------------------------------------MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 . -----------------------------------------------------------------------------------------------------------. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS ----------------+-------------------------------------------------------------------Model | 67627352 13 5202104 Residual | 35197921.9 10068 3496.01926 -------------+---------------------------------------------------------------------Total | 102825274 10081 10199.9081 Number of obs F( 13, 10068) Prob > F R-squared Adj R-squared Root MSE -------------------------------------------------------------------------------------------------------------API13 | Coef. Std. Err. t P>|t| Beta --------------+----------------------------------------------------------------------------------------------MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 . ----------------------------------------------------------------------------------------------------------- = 10082 = 1488.01 = 0.0000 = 0.6577 = 0.6572 = 59.127 2 F ( K 2 K 1 ), ( N K 2 1 ) F 7 , 1068 2 ( R 2 R 1 ) /( K 2 K 1 ) 2 ( 1 R 2 ) /( N K 2 1 ) (. 6577 . 6370 ) /( 13 6 ) ( 1 . 6577 ) /( 1082 13 1 ) (. 6577 . 6370 ) /( 13 6 ) ( 1 . 6577 ) /( 1082 13 1 ) . 0207 / 7 . 3423 / 1068 9 . 2265 . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR, vce(hc3) beta Linear regression Number of obs = 10082 F( 13, 10068) = 1439.49 Prob > F = 0.0000 R-squared = 0.6577 Root MSE = 59.127 ------------------------------------------------------------------------------------------------------| Robust HC3 API13 | Coef. Std. Err. t P>|t| Beta -------------+----------------------------------------------------------------------------------------MEALS | .370891 .0576739 6.43 0.000 .1022703 AVG_ED | 89.51041 2.651275 33.76 0.000 .6727917 P_EL | .2773577 .0646176 4.29 0.000 .0486598 P_GATE | .7084009 .0624278 11.35 0.000 .0666584 EMER | -.7563048 .2248352 -3.36 0.001 -.032008 DMOB | 4.398746 .1645831 26.73 0.000 .349512 PCT_AA | -1.096513 .0799674 -13.71 0.000 -.1112841 PCT_AI | -1.731408 .2257328 -7.67 0.000 -.0718944 PCT_AS | .5951273 .0492148 12.09 0.000 .0715228 PCT_FI | .2598189 .1343712 1.93 0.053 .0099543 PCT_HI | .0231088 .0511823 0.45 0.652 .0066676 PCT_PI | -2.745531 .7471198 -3.67 0.000 -.0274142 PCT_MR | -.8061266 .2485255 -3.24 0.001 -.0295927 _cons | 96.52733 16.89459 5.71 0.000 . -----------------------------------------------------------------------------. Notice the coeffcients, the betas, the Rsquared plus the Root MSE are unchanged. The Std. Err.s are different and so are the t values and therefore the P values also change. Look at PCT_FI. Now it is almost significant at the .05 level. On the previous slide the P value is .116. 600 400 -400 -200 0 200 GOOD ONES Residual Name Tested/Enrolled 506.0523 Muir Charter 78/78 488.5563 SIATech 65/66 342.7693 Escuela Popular/Center for Training and 88/91 280.2587 YouthBuild Charter School of California 78/78 246.7804 Oakland Charter Academy 238/238 232.4897 Oakland Charter High 146/146 230.0739 Opportunities For Learning - Baldwin Par 1434/1442 200 400 600 Fitted values 800 1000 BAD ONES -399.4998 -342.2773 -336.5667 -322.1879 -318.0444 -315.5069 -311.1326 Sierra Vista High (SD) Baden High (Continuation) Dover Bridge to Success Millennium High Alternative Aurora High (Continuation) Sunrise (Special Education) Nueva Vista High 14/15 73/73 84/88 43/49 128/131 34/34 20/28 . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ > ED>0 & AVG_ED<6 [aweight = TESTED], beta (sum of wgt is 9.0302e+06) Source | SS df MS ----------------+-------------------------------------------------------------------Model | 41089704.2 13 3160746.48 Residual | 13689769.3 10068 1359.73076 ----------------+--------------------------------------------------------------------Total | 54779473.6 10081 5433.9325 -----------------------------------------------------------------------------API13 | Coef. Std. Err. t P>|t| ------------------+---------------------------------------------------------------MEALS | .2401007 .032364 7.42 0.000 AVG_ED | 83.84621 1.444873 58.03 0.000 P_EL | .1605591 .0405248 3.96 0.000 P_GATE | .2649964 .0443791 5.97 0.000 EMER | -1.527603 .1503635 -10.16 0.000 DMOB | 3.414537 .0834016 40.94 0.000 PCT_AA | -1.275241 .0583403 -21.86 0.000 PCT_AI | -1.96138 .2143326 -9.15 0.000 PCT_AS | .4787539 .0368303 13.00 0.000 PCT_FI | -.0272983 .1113346 -0.25 0.806 PCT_HI | .0440935 .0351466 1.25 0.210 PCT_PI | -2.464109 .5116525 -4.82 0.000 PCT_MR | -.5071886 .1678521 -3.02 0.003 _cons | 220.2237 9.318893 23.63 0.000 ------------------------------------------------------------------------------ Number of obs F( 13, 10068) Prob > F R-squared Adj R-squared Root MSE Beta .0828479 .8044588 .0306712 .0317522 -.0513386 .2212861 -.1301146 -.0499468 .082836 -.0013581 .0158328 -.0271533 -.0187953 . = 10082 = 2324.54 = 0.0000 = 0.7501 = 0.7498 = 36.875 Characteristics of OLS if sample is probability sample Unbiased Efficient Consistent E(b)= Min b the mean sample value is the population value the sample values are as close to each other as possible lim Pr b n 1 as sample size (n) approaches infinity, the sample n value converges on the population value If the following assumptions are met: The Model is Complete Linear Additive Variables are measured at an interval or ratio scale without error The regression error term is normally distributed has an expected value of 0 errors are independent homoscedasticity predictors are unrelated to error In a system of interrelated equations the errors are unrelated to each other