迴歸診斷

中級社會統計第十四講迴歸診斷 ©Ming-chi Chen 社會統計 Page.1 迴歸分析會在哪裡出錯？ • 有了統計軟體之後，因為操作的簡便，複迴歸分析常會被誤用濫用 • 問題往往在於不了解複迴歸分析背後的假設和可能的問題 • 問題多數是在作因果推論 • 純粹用來作預測用的複迴歸模型的問題比較不嚴重 • 以下根據Paul Allison (1999) ©Ming-chi Chen 社會統計 Page.2 迴歸分析會在哪裡出錯？ • Model specification errors – Are important independent variables left out of the model? – Are irrelevant independent variables included • Does the dependent variable affect any of the independent variables? • How well are the independent variables measured? ©Ming-chi Chen 社會統計 Page.3 迴歸分析會在哪裡出錯？ • Is the sample large enough to detect important effects? • Is the sample so large that trivial effects are statistically significant? • Do some variable mediate the effects of other variables? • Are some independent variables too highly correlated? • Is the sample biased ©Ming-chi Chen 社會統計 Page.4 14.1.1 模型設定錯誤-遺漏 Model Specification Errors 自變數設定錯誤的問題 • 為何複迴歸方程式要放入某一個自變數？ – 想要了解這個IV對於DV的影響效果 – 想要控制這個IV • 研究者往往忘了要放入重要的控制變數 • 何謂重要的控制變數？ – 對DV有沒有因果效果？ – 和我們主要關心的主要變數有沒有相關性？ • 相關性？ – 如果和其他IV沒有關係，那也不需要控制了。 – 控制是為了分離出淨關係 ©Ming-chi Chen 社會統計 Page.6 自變數設定錯誤的問題：遺漏 • 遺漏重要IV的後果 • 迴歸係數會有偏誤，不然就是太高，要不然就是太低。 • 沒有控制，IV和DV之間的關係可能是虛假的 spurious ©Ming-chi Chen 社會統計 Page.7 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Yˆ  b1  b2 X 2 Yˆ  b1  b2 X 2  b3 X 3 In this sequence and the next we will investigate the consequences of misspecifying the regression model in terms of explanatory variables. 1 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Yˆ  b1  b2 X 2 Yˆ  b1  b2 X 2  b3 X 3 To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X2, or it depends on both X2 and X3. 2 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Correct specification, ˆ Y  b1  b2 X 2 no problems Yˆ  b1  b2 X 2  b3 X 3 If Y depends only on X2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid. 3 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Correct specification, ˆ Y  b1  b2 X 2 no problems Yˆ  b1  b2 X 2  b3 X 3 Correct specification, no problems Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the multiple regression. 4 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Correct specification, ˆ Y  b1  b2 X 2 no problems Yˆ  b1  b2 X 2  b3 X 3 Correct specification, no problems In this sequence we will examine the consequences of fitting a simple regression when the true model is multiple. 5 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Correct specification, ˆ Y  b1  b2 X 2 no problems Yˆ  b1  b2 X 2  b3 X 3 Correct specification, no problems In the next one we will do the opposite and examine the consequences of fitting a multiple regression when the true model is simple. 6 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Yˆ  b1  b2 X 2 Yˆ  b1  b2 X 2  b3 X 3 Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. Correct specification, no problems The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. 7 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y  1   2 X 2   3 X 3  u Yˆ  b1  b2 X 2 Yi  Y   1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u    2  X 2 i  X 2    3  X 3 i  X 3   ui  u We will now derive the expression for the bias mathematically. It is convenient to start by deriving an expression for the deviation of Yi about its sample mean. It can be expressed in terms of the deviations of X2, X3, and u about their sample means. 12 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Yˆ  b1  b2 X 2 Y  1   2 X 2   3 X 3  u Yi  Y   1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u    2  X 2 i  X 2    3  X 3 i  X 3   ui  u b2  X  X Y  Y     X  X     X  X     X  X  X  X    X  X u    X  X   X  X  X  X    X  X u  u       X  X   X  X  2i 2 i 2 2i 2 2 2 2i 2 3 2i 2 3i 3 2i 2 i   u 2 2i 2i 2 2 3i 3 3 2 2i 2 i 2 2i 2 2 2i 2 Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only. The slope coefficient is therefore as shown. 13 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Yˆ  b1  b2 X 2 Y  1   2 X 2   3 X 3  u Yi  Y   1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u    2  X 2 i  X 2    3  X 3 i  X 3   ui  u b2  X  X Y  Y     X  X     X  X     X  X  X  X    X  X u    X  X   X  X  X  X    X  X u  u       X  X   X  X  2i 2 i 2 2i 2 2 2 2i 2 3 2i 2 3i 3 2i 2 i   u 2 2i 2i 2 2 3i 3 3 2 2i 2 i 2 2i 2 2 2i 2 We substitute for the Y deviations and simplify. 14 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Yˆ  b1  b2 X 2 Y  1   2 X 2   3 X 3  u Yi  Y   1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u    2  X 2 i  X 2    3  X 3 i  X 3   ui  u b2  X  X Y  Y     X  X     X  X     X  X  X  X    X  X u    X  X   X  X  X  X    X  X u  u       X  X   X  X  2i 2 i 2 2i 2 2 2 2i 2 3 2i 2 3i 3 2i 2 i   u 2 2i 2i 2 2 3i 3 3 2 2i 2 i 2 2i 2 2 2i 2 Hence we have demonstrated that b2 has three components. 15 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Yˆ  b1  b2 X 2 Y  1   2 X 2   3 X 3  u b2   2   3   X  X  X  X     X  X u  u   X  X   X  X  2i 2 3 2i 2 i 2 2i E b2    2   3 3i 2 2 2i 2   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2 To investigate biasedness or unbiasedness, we take the expected value of b2. The first two terms are unaffected because they contain no random components. Thus we focus on the expectation of the error term. 16 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2    2   3   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2    X 2 i  X 2 ui  u   1  E E   X 2 i  X 2 ui  u  2 2    X 2i  X 2     X 2i  X 2    1 E  X 2 i  X 2 ui  u  2    X 2i  X 2  1  X 2 i  X 2 E ui  u   2    X 2i  X 2  0 X2 is nonstochastic, so the denominator of the error term is nonstochastic and may be taken outside the expression for the expectation. 17 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2    2   3   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2    X 2 i  X 2 ui  u   1  E E   X 2 i  X 2 ui  u  2 2    X 2i  X 2     X 2i  X 2    1 E  X 2 i  X 2 ui  u  2    X 2i  X 2  1  X 2 i  X 2 E ui  u   2    X 2i  X 2  0 In the numerator the expectation of a sum is equal to the sum of the expectations (first expected value rule). 18 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2    2   3   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2    X 2 i  X 2 ui  u   1  E E   X 2 i  X 2 ui  u  2 2    X 2i  X 2     X 2i  X 2    1 E  X 2 i  X 2 ui  u  2    X 2i  X 2  1  X 2 i  X 2 E ui  u   2    X 2i  X 2  0 In each product, the factor involving X2 may be taken out of the expectation because X2 is nonstochastic. 19 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2    2   3   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2    X 2 i  X 2 ui  u   1  E E   X 2 i  X 2 ui  u  2 2    X 2i  X 2     X 2i  X 2    1 E  X 2 i  X 2 ui  u  2    X 2i  X 2  1  X 2 i  X 2 E ui  u   2    X 2i  X 2  0 By Assumption A.3, the expected value of u is 0. It follows that the expected value of the sample mean of u is also 0. Hence the expected value of the error term is 0. 20 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Yˆ  b1  b2 X 2 Y  1   2 X 2   3 X 3  u E b2    2   3   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i E b2    2   3 2 2 2i 2   X  X  X  X   X  X  2i 2 3i 3 2 2i 2 Thus we have shown that the expected value of b2 is equal to the true value plus a bias term. Note: the definition of a bias is the difference between the expected value of an estimator and the true value of the parameter being estimated. 21 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Yˆ  b1  b2 X 2 Y  1   2 X 2   3 X 3  u E b2    2   3   X  X  X  X   E    X  X u  u       X  X X  X     2i 2 3i 3 2i 2 i 2 2i E b2    2   3 2 2 2i 2   X  X  X  X   X  X  2i 2 3i 3 2 2i 2 As a consequence of the misspecification, the standard errors, t tests and F test are invalid. 22 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ We will illustrate the bias using an educational attainment model. To keep the analysis simple, we will assume that in the true model S教育年數 depends only on ASVABC測驗分數 and SM母親教育年數. The output above shows the corresponding regression using EAEF Data Set 21. 23 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ E (b2 )   2   3   ASVABC  ASVABC SM  SM    ASVABC  ASVABC  i i 2 i We will run the regression a second time, omitting SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC. 24 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ E (b2 )   2   3   ASVABC  ASVABC SM  SM    ASVABC  ASVABC  i i 2 i It is reasonable to suppose, as a matter of common sense, that 3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant. 25 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 . cor SM ASVABC Number of obs = 540 (obs=540) F( 2, 537) = 147.36 Prob > F SM = 0.0000 | ASVABC R-squared = 0.3543 --------+-----------------Adj R-squared = 0.3519 SM| 1.0000 Root 0.4202 MSE = 1.963 ASVABC| 1.0000 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ E (b2 )   2   3   ASVABC  ASVABC SM  SM    ASVABC  ASVABC  i i 2 i The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive. 26 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ Here is the regression omitting SM. 27 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 -----------------------------------------------------------------------------. reg S ASVABC -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias. 28 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------SM | .3130793 .0348012 9.00 0.000 .2447165 .3814422 _cons | 10.04688 .4147121 24.23 0.000 9.232226 10.86153 ------------------------------------------------------------------------------ E (b3 )   3   2   ASVABC  ASVABC SM  SM  SM  i i  SM  2 i Here is the regression omitting ASVABC instead of SM. We would expect b3 to be upwards biased. We anticipate that 2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive. 29 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 -----------------------------------------------------------------------------. reg S SM -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------SM | .3130793 .0348012 9.00 0.000 .2447165 .3814422 _cons | 10.04688 .4147121 24.23 0.000 9.232226 10.86153 ------------------------------------------------------------------------------ In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while 2 and 3 are similar in size, judging by their estimates. 30 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 Finally, we will investigate how R2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R2 is 0.34, and in the simple regression of S on SM it is 0.13. 31 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because the multiple regression reveals that their joint explanatory power is 0.35, not 0.47. 32 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S   1   2 ASVABC   3 SM  u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power. 33 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE LGEARN   1   2 S   3 EXP  u . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 100.86 0.0000 0.2731 0.2704 .50274 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 ------------------------------------------------------------------------------ However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP. 34 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE LGEARN   1   2 S   3 EXP  u . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 . cor S EXP 540 (obs=540)Number of obs = F( 2, 537) = 100.86 =EXP0.0000 |Prob > F S R-squared = 0.2731 --------+-----------------R-squared = 0.2704 S|Adj 1.0000 MSE = .50274 EXP|Root -0.2179 1.0000 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 ------------------------------------------------------------------------------ E (b2 )   2   3  S i  S EXPi  EXP  2    Si  S If we omit EXP from the regression, the coefficient of S should be subject to a downward bias. 3 is likely to be positive. The numerator of the other factor in the bias term is negative since S and EXP are negatively correlated. The denominator is positive. 35 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE LGEARN   1   2 S   3 EXP  u . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 . cor S EXP 540 (obs=540)Number of obs = F( 2, 537) = 100.86 =EXP0.0000 |Prob > F S R-squared = 0.2731 --------+-----------------R-squared = 0.2704 S|Adj 1.0000 MSE = .50274 EXP|Root -0.2179 1.0000 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 ------------------------------------------------------------------------------ E (b3 )   3   2  EXP  EXP S  S   EXP  EXP  i i 2 i For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP should be downwards biased. 36 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 . reg LGEARN S -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1096934 .0092691 11.83 0.000 .0914853 .1279014 _cons | 1.292241 .1287252 10.04 0.000 1.039376 1.545107 . reg LGEARN EXP -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------EXP | .0202708 .0056564 3.58 0.000 .0091595 .031382 _cons | 2.44941 .0988233 24.79 0.000 2.255284 2.643537 As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. 37 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 100.86 0.0000 0.2731 0.2704 .50274 . reg LGEARN S Source | SS df MS -------------+-----------------------------Model | 38.5643833 1 38.5643833 Residual | 148.14326 538 .275359219 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 140.05 0.0000 0.2065 0.2051 .52475 . reg LGEARN EXP Source | SS df MS -------------+-----------------------------Model | 4.35309315 1 4.35309315 Residual | 182.35455 538 .338948978 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 12.84 0.0004 0.0233 0.0215 .58219 A comparison of R2 for the three regressions shows that the sum of R2 in the simple regressions is actually less than R2 in the multiple regression. 38 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 100.86 0.0000 0.2731 0.2704 .50274 . reg LGEARN S Source | SS df MS -------------+-----------------------------Model | 38.5643833 1 38.5643833 Residual | 148.14326 538 .275359219 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 140.05 0.0000 0.2065 0.2051 .52475 . reg LGEARN EXP Source | SS df MS -------------+-----------------------------Model | 4.35309315 1 4.35309315 Residual | 182.35455 538 .338948978 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 12.84 0.0004 0.0233 0.0215 .58219 This is because the apparent explanatory power of S in the second regression has been undermined by the downwards bias in its coefficient. The same is true for the apparent explanatory power of EXP in the third equation. 39 自變數設定錯誤的問題：遺漏 • • • • 正確的迴歸模型為： Y=α+βX+γZ+ε 漏掉重要解釋變數Z Y=α’+β’X+ε’  ˆ '為前式的迴歸係數，ˆ ' 不再是的不偏估計式，E ( ˆ ' )     xz ˆ E ( ' )    2 x   如果（Y和Z的關係）與 xz （X和Z的關係）同號，則ˆ ' 有正偏誤若不同號，則有負偏誤。 ©Ming-chi Chen 社會統計 Page.43 自變數設定錯誤的問題：遺漏 xz ˆ  ˆ ˆ  '    ˆ    ˆ  bZX 2 x ˆ ' 是X對Y在控制了Z這個IV之後的迴歸係數ˆ加上 Z對Y的影響（ˆ）乘以X對Z之影響bzx ˆ ' 是總效果 ˆ是X對Y的直接影響效果 ˆ  bZX 是X影響Z再影響Y的間接效果 ©Ming-chi Chen 社會統計 Page.44 自變數設定錯誤的問題：遺漏 X 直接效果︿ β ︿ γ b ZX Z 間接效果︿ bZX*γ ©Ming-chi Chen 社會統計 Page.45 自變數設定錯誤的問題：遺漏  如果Z與Y無關（即  0）或Z與X無關（ xz  0）時，則ˆ '  ˆ且E ( ˆ ' )    如果Z會影響Y與X（即  0,  xz  0），那如果迴歸方程式中沒有把Z放進來，則所估計的X對Y的迴歸係數將包含了經由X對Z，Z對Y的影響力，沒辦法估計出X對Y的真正直接影響力。 ©Ming-chi Chen 社會統計 Page.46 自變數設定錯誤的問題：遺漏遺漏了Z變數的迴歸模型所估計出來的變異數 ˆ ' X )2 ˆ ( Y   '    2 SY | X  不再是母體變異數 2的 n2 不偏估計式，我們可證明 E (SY2 | X )   2   2  z2 2 n2 SY2 | X 是 2的正偏估計式，除非  0，否則推論統計會有錯誤。當 xz  0時，ˆ ' 是一個不偏估計式，但由上式可知SY2 | X 仍有偏誤（除非  0），使統計推論發生錯誤。 ©Ming-chi Chen 社會統計 Page.47 自變數設定錯誤的問題：遺漏 2 ˆ V ( ' )   x2 z   x  z  ( xz) z    x z ( 1  r )   x V ( ˆ )  2 2 2 2 2 2 2 2 XZ 2 2 1 2 2 2 (1  rXZ ) 2 通常未知 ，必須用 S2ˆ '和S2ˆ來進行統計推論，兩者大小無法確定， 1 2 S  S 2 Y|X x 2 ˆ ' ©Ming-chi Chen 1 2 S  S Y | XZ 2 2 x ( 1  r )  XZ 2 ˆ 社會統計 Page.48 自變數設定錯誤的問題：遺漏雖然 1  2 2  x (1  rXZ ) 1 2 2 2 2 ，但是 S 可能小於 S （根據 E( S )  E( S Y | XZ Y|X Y | XZ Y | X )) 2 x ˆ ' ˆ ˆ ˆ 因此其大小不能確知。此外， ' 與值也不相同，因此與兩者之t值大小 Sˆ ' Sˆ ˆ ' 不相等，然而為錯誤的t值會導致錯誤的統計推論。 Sˆ ' ©Ming-chi Chen 社會統計 Page.49 14.1.2 模型設定錯誤-加入不相關變數 Model Specification Errors-including irrelevant IV VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Yˆ  b1  b2 X 2 Yˆ  b1  b2 X 2  b3 X 3 Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. Correct specification, no problems In this sequence we will investigate the consequences of including an irrelevant variable in a regression model. 1 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Consequences of Variable Misspecification True Model Fitted Model Y  1   2 X 2  u Y   1   2 X 2   3 X 3  u Yˆ  b1  b2 X 2 Yˆ  b1  b2 X 2  b3 X 3 Correct specification, no problems Coefficients are unbiased (in general), but inefficient. Standard errors are valid (in general) Coefficients are biased (in general). Standard errors are invalid. Correct specification, no problems The effects are different from those of omitted variable misspecification. In this case the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large. 2 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y  1   2 X 2  u Yˆ  b1  b2 X 2  b3 X 3 These results can be demonstrated quickly. 3 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y  1   2 X 2  u Yˆ  b1  b2 X 2  b3 X 3 Y  1   2 X 2  0 X 3  u Rewrite the true model adding X3 as an explanatory variable, with a coefficient of 0. Now the true model and the fitted model coincide. Hence b2 will be an unbiased estimator of 2 and b3 will be an unbiased estimator of 0. 4 加入不相關的自變數 •正確的迴歸模型為： •Y=α+βX+ε •放入不相關的解釋變數Z •Y=α*+β*X+γ*Z+ε*  若以錯誤模型估計得ˆ，則 ' ˆ ' 仍為之不偏估計式。證明ˆ '為之不偏估計式若以錯誤模型估計得ˆ，則 ' ˆ ' 仍為之不偏估計式。 ˆ '   z  xy   xz zy  x  z  ( xz) 2 2 2 2  y  Y - Y    X   - (   X   )   ( X  X )  (   )   x   ' 前式分子部分  z  xy   xz zy   z  x(x   ' )   xz z (x   ' )   z  ( xx  x ' )   xz ( zx  z ' )  z ( xx   x ')   xz( zx   z ')   z (   x   x ')   xz(   zx   z ')    z  x    ( xz)   z  x '   xz z '   [ z  x   ( xz) ]  [ z  x '   xz z '] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 把分母帶回來 z xy   xz zy  [ z  x   ( xz) ]  [ z  x '   xz z '] ˆ '     x z  ( xz )     x  z  ( xz)  z  x '   xz z '   x  z  ( xz)   x '   x(   )   x    x   x    ( X  X )   x  0 2 2 2 2 2 2  E ( ˆ ' )  ，故為不偏 2 2 2 2 2 2 2 2 2 自變數設定錯誤的問題：放入不相關的變數 • • • • 正確的迴歸模型為： Y=α+βX+ε 放入不相關的解釋變數Z Y=α*+β*X+γ*Z+ε*  ˆ *為的不偏估計式   考慮了不相關的變數Z，會使 z    x  z (1  r 2 S 2 ˆ *  2 ˆ * S 2 ˆ * 2 2 XZ ) SY2| XZ  S2ˆ  1 1 1 2 S （   ） 2 Y|X 2 2 2 x x ( 1  r ) x    XZ 的t值會被低估，而誤以為X對Y沒有影響，造成統計檢定錯誤，影響結論。 ©Ming-chi Chen 社會統計 Page.57 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE Source | SS df MS ---------+-----------------------------Model | 138.776549 2 69.3882747 Residual | 130.219231 865 .150542464 ---------+-----------------------------Total | 268.995781 867 .310260416 Number of obs F( 2, 865) Prob > F R-squared Adj R-squared Root MSE = = = = = = 868 460.92 0.0000 0.5159 0.5148 .388 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 ------------------------------------------------------------------------------ The analysis will be illustrated using a regression of LGFDHO, the logarithm of annual household expenditure on food eaten at home, on LGEXP, the logarithm of total annual household expenditure, and LGSIZE, the logarithm of the number of persons in the household. 10 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE Source | SS df MS ---------+-----------------------------Model | 138.776549 2 69.3882747 Residual | 130.219231 865 .150542464 ---------+-----------------------------Total | 268.995781 867 .310260416 Number of obs F( 2, 865) Prob > F R-squared Adj R-squared Root MSE = = = = = = 868 460.92 0.0000 0.5159 0.5148 .388 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 ------------------------------------------------------------------------------ The source of the data was the 1995 US Consumer Expenditure Survey. The sample size was 868. 11 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE LGHOUS Source | SS df MS ---------+-----------------------------Model | 138.841976 3 46.2806586 Residual | 130.153805 864 .150640978 ---------+-----------------------------Total | 268.995781 867 .310260416 Number of obs F( 3, 864) Prob > F R-squared Adj R-squared Root MSE = = = = = = 868 307.22 0.0000 0.5161 0.5145 .38812 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ Now add LGHOUS, the logarithm of annual expenditure on housing services. It is safe to assume that LGHOUS is an irrelevant variable and, not surprisingly, its coefficient is not significantly different from zero. 12 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE LGHOUS . cor LGHOUS LGEXP LGSIZE Source | SS df MS Number of obs = 868 (obs=869) ---------+-----------------------------F( 3, 864) = 307.22 Model | 138.841976 3 46.2806586 Prob > F LGEXP = LGSIZE 0.0000 | LGHOUS Residual | 130.153805 864 .150640978 --------+--------------------------R-squared = 0.5161 ---------+-----------------------------Adj R-squared = 0.5145 lGHOUS| 1.0000 Total | 268.995781 867 .310260416 Root MSE1.0000 = .38812 LGEXP| 0.8137 LGSIZE| 0.3256 0.4491 1.0000 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ It is however highly correlated with LGEXP (correlation coefficient 0.81), and also, to a lesser extent, with LGSIZE (correlation coefficient 0.33). 13 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 -----------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ Its inclusion does not cause the coefficients of those variables to be biased. 14 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 -----------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ But it does increase their standard errors, particularly that of LGEXP, as you would expect, reflecting the loss of efficiency. 15 14.2 線性重合與其他問題 Does the dependent variable affect any of the independent variables? • 因果方向 • Reverse causation逆反的因果 • 後果 – Every coefficient in the regression model may be biased. – It’s hard to design a study that will adequately solve this problem. • 時間順序有助於我們澄清因果方向。 • 一般常識 • 還留下很多不確定性 ©Ming-chi Chen 社會統計 Page.65 線性重合multicollinearity • 兩個或多個自變數之間具有高度線性相關的現象 • 比如研究家務時數和個人所得和教育之間的關係，個人所得和教育程度之間可能有線性關係。 •很難區分各個IV對DV的影響，因為IV的變動是高度相關，當一個IV變動，另一個也跟著變動。 •完全線性重合perfect multicollinearity •近似線性重合near multicollinearity ©Ming-chi Chen 社會統計 Page.66 完全線性重合perfect multicollinearity • 任一IV可被寫成為其他IV的線性合數 • 比如兩個自變數X1與X2， X1=a+b X2 ©Ming-chi Chen 社會統計 Page.67 完全線性重合perfect multicollinearity • OLSE無法求出 Y    X  Z   ˙最小平方法 ˆ  2 z   xy   xz zy  x  z  ( xz) z  xy   xz zy    x  z (1  r ) 2 2 2 2 2 ©Ming-chi Chen 2 社會統計 2 2 XZ Page.68 完全線性重合perfect multicollinearity • 如果X、Z完全線性重合，則r2xz=1，上式分母為0，因此無解 • 解決：可將Z寫成X的線性函數，代入原方程式。 ©Ming-chi Chen 社會統計 Page.69 近似線性重合near multicollinearity • 最小平方估計式會膨脹，容易發生偏誤 • 估計式的變異數變大，可靠性低，估計的迴歸係數雖具不偏性，但有效性低 • 統計推論不正確，非常不容易拒絕虛無假設（所以可以看成是一個保守的估計） • 迴歸係數對樣本數很敏感，樣本數稍有改變，係數就會有很大的改變 ©Ming-chi Chen 社會統計 Page.70 近似線性重合near multicollinearity • OLSE無法準確求出個別係數 Y    X  Z   ˙最小平方法 ˆ  2 z   xy   xz zy  x  z  ( xz) z  xy   xz zy    x  z (1  r ) 2 2 2 2 2 ©Ming-chi Chen 2 社會統計 2 2 XZ Page.71 近似線性重合near multicollinearity 當X與Z發生近似線性重合時， 1. r 2 很大且接近1，則ˆ會高估（膨脹） xz 2. V( ˆ )  2 2 z   2 2 2 x z ( 1  r   xz ) 也會變大，表示ˆ值非常不穩定 3.會造成統計推論時t值不顯著因而無法拒絕虛無假設，不易驗證假設。 ©Ming-chi Chen 社會統計 Page.72 診斷近似線性重合  F值顯著，但是個別的t值卻不顯著，亦即整個迴歸模型有聯合解釋能力，但是卻無法分別顯現出個別IV對DV的影響程度。  計算IV間的相關係數 2 如果rx2i x j  R（ R 2是原迴歸方程式的判定係數），則一般可判定無線性重合的現象。  利用變異數膨脹因素（varianceinflat ionfact or,VIF）。 1 2 VIFi  ， R i 是第i個解釋變數對其他IV作迴歸的判定係數，又 2 1  Ri 稱auxiliary R 2 VIFi  10 （P aul Allison主張  2.5），則判定有線性重合的問題。 ©Ming-chi Chen 社會統計 Page.73 Stata & Multicollinearity • Stata可以計算VIF，但要先完成迴歸分析。 • DV還是85年社會變遷調查的家務時數，IV則有收入和教育年數。 ©Ming-chi Chen 社會統計 Page.74 Stata & Multicollinearity 所以線性重合的問題不嚴重 ©Ming-chi Chen 社會統計 Page.75 資料與線性重合 • 以下這些類型的資料比較容易有線性重合的問題 • 時間序列資料time-series data • 重複樣本研究panel study • 集體層次資料aggregate-level data – 個體層次的差異在集體層次彼此取消 ©Ming-chi Chen 社會統計 Page.76 如何處理線性重合？ • 沒有簡單的解決之道 • 利用事前有關IV或迴歸係數間關係的訊息，代入方程式中 • 去掉較不重要的IV，但要注意specification error • 擴大樣本數 ©Ming-chi Chen 社會統計 Page.77 變異數齊一性假設homoscedasticity • 我們一般假定迴歸方程式的殘差項的變異數是固定的常數，即V(εi)=σi2= σ2 • 這個條件很多時候並不成立 • 高收入者的消費支出的變異比低所得者來的大 • 這就是變異數不齊一性heteroscedasticity ©Ming-chi Chen 社會統計 Page.78 變異數不齊一性heteroscedasticity • 迴歸方程式中殘差項的變異數不是一個固定的常數 • V(εi)=σi2≠σ2 • 常發生在橫斷面資料cross-sectional data • 最小平方估計式OLSE仍為一不偏估計式 • 但不是最佳線性不偏估計式BLUE ©Ming-chi Chen 社會統計 Page.79 檢查變異數不齊一性White test  利用OLS求出估計之迴歸方程式 Yˆ  ˆ  ˆX  ˆZ 計算估計之殘差值 e  Y  Yˆ i i i 2  計算nR（n為樣本數）。本統計量為自由度P  1的卡方分配 P為上述e 2迴歸是中IV的個數，卡方分配，右尾檢定。 nR2   P2 1, 則拒絕虛無假設H 0 : 變異數具齊一性。 nR2   P2 1,則接受虛無假設H 0 : 變異數具齊一性。  估計下列迴歸方程式 e 2  a0  a1 X  a2 X 2  a3 Z  a4 Z 2  a5 XZ   及其判定係數R 2 ©Ming-chi Chen 社會統計 Page.80 收入與消費p.389 在data editor 裡自行輸入資料 ©Ming-chi Chen 社會統計 Page.81 收入與消費p.389 Stata沒有內建White test的功能，有的是另一個檢定Cook-Weisberg。一樣是要先run過迴歸分析。 ©Ming-chi Chen 社會統計 Page.82 Stata and Heteroscedasticity 所以拒絕虛無假設，也等於說變異數不齊一。 ©Ming-chi Chen 社會統計 Page.83 White Test • Help>Search> Search All>空格填入white test 點進去 ©Ming-chi Chen 社會統計 Page.84 White Test ©Ming-chi Chen 社會統計 Page.85 White Test • Help whitetst ©Ming-chi Chen 社會統計 Page.86 White Test 所以拒絕虛無假設，也等於說變異數不齊一。 ©Ming-chi Chen 社會統計 Page.87 看圖形判斷 ©Ming-chi Chen 社會統計 Page.88 看圖形判斷 ©Ming-chi Chen 社會統計 Page.89 看圖形判斷 X越大殘差值越大 ©Ming-chi Chen 社會統計 Page.90 Heteroscedasticity的後果 • Inefficiency: LSE (least square estimate) no longer have minimum standard errors. That means you can do better using alternative methods. The reason OLS is not optimal where there is heteroscedasticity is that it gives equal weight to all obs. When, in fact, obs with larger disturbance contain less information than obs with smaller one.weighted least squares, which gives greater weight on the obs with smaller variance. ©Ming-chi Chen 社會統計 Page.91 Heteroscedasticity的後果 • Biased standard errors: The standard errors reported by regression programs are only estimates of the true standard errors, which cannot be observed directly. If there is hestroscedasticity, these standard error estimates can be seriously biased. That in turn leads to bias in test statistics and confidence intervals. • more serious, leads to incorrect conclusions. • it is easier to use robust standard errors here. • This doesn’t change the coefficient estimates and, therefore, doesn’t solve the inefficiency problem. • But at least the test statistics will give us reasonably accurate p values • Require fewer assumptions ©Ming-chi Chen 社會統計 Page.92 處理變異數不齊一性 • 除了加權最小平方法（weighted least squares）和強韌迴歸法（robust regression）之外，另一個常用來矯正heteroscedasiticity的方法是針對DV來作轉化，如取對數值或取平方根，這個轉化被稱為vairance stabilizing transformations，但是這個轉化也同時根本地改變了IV和DV之間的關係，讓迴歸係數難以解釋。 • 一個更好的方法是改用概化線性模型（generalized linear models, GLM）的分析方法。比如在前述收入和消費的例子裡，不假設殘差呈常態分配，而用 gamma分配來評估，得到gamma GLM ©Ming-chi Chen 社會統計 Page.93 如何處理Heteroscedasticity? • It has to be pretty severe before it leads to serious bias in the standard errors. ©Ming-chi Chen 社會統計 Page.94 Outlier的影響 • OLS很容易受到離群值（outlier）的影響，尤其當樣本不大的時候。 • 有很多的統計技術可用來檢驗每個obs對於迴歸模型的影響。 • 主要是討論如果我們刪除了某個觀察個案，對於模型的參數會有何影響。 • 個案的影響力取決於下列兩個條件： – 個案觀察到的Y值跟預測值有多大的落差 – 個案在IV上有多極端（離均值） ©Ming-chi Chen 社會統計 Page.95 Studentized residual • 第一個方法是先求殘差 • 殘差越大，就表示該觀察值離整體趨勢越遠多遠？可以用標準化轉換。 • 這稱為studentized residual • 如果絕對值>2.5就要小心 ©Ming-chi Chen 社會統計 Page.96 Stata & Studentized residual • 先跑迴歸，用前面income和consumption的例子。 • Predict 新變數名稱, rstudent 自行定名 ©Ming-chi Chen 社會統計 Page.97 Hat value or leverage • 個案在某IV的值離這個IV的均數有多遠 • hat value越大，在計算預測值Y-hat時的權重就越大，它的槓桿也越大。 • hat value的平均是p/n，p是模型裡的參數數量。 • hat value隨著樣本變大而變小。 • >3p/n表示有大的槓桿 ©Ming-chi Chen 社會統計 Page.98 hat value (leverage) & Stata • Predict 新變數名稱, hat ©Ming-chi Chen 社會統計 >(3*2)/20Page.99 DFFIT • 有兩個常用的診斷統計量：DFFITS和 DFBETAS • DFFITS：去除了個案對model fit的影響，也就是how much the predicted value for each obs would change. ©Ming-chi Chen 社會統計 Page.100 DFFIT & Stata • Predict 新變數名稱, dfits ©Ming-chi Chen 社會統計 Page.101 DFBETAS • 移除個案後迴歸係數的改變，除以調整過後的資料組的估計式標準誤。 • ＞1表示個案有重大影響 ©Ming-chi Chen 社會統計 Page.102 DFBETAS & Stata • predict 新變數名稱, dfbeta(所選的自變數) 分別選擇自變項 >1 ©Ming-chi Chen 社會統計 Page.103 DFBETAS & Stata • Dfbeta不用指定自變項 ©Ming-chi Chen 社會統計只有一個自變數，所以只有一個 DF值，注意 Stata 變數命名 Page.104 移除離群值重跑迴歸 • • • • • • reg consum income if abs(DFincome) < 1 新迴歸方程式： consum=23527.91+0.62income 舊（未刪除離群值）迴歸方程式 Consum=30948.47+0.54income R-square也變大了（0.9589->0.9785） ©Ming-chi Chen 社會統計 Page.105 圖示收入和殘差之間似乎有一種曲線的關係 ©Ming-chi Chen 社會統計 Page.106 中級社會統計 14.3 非線性關係：變數轉換 ©Ming-chi Chen 社會統計 Page.107 非線性關係 • 嚴格說來，複迴歸方程式 y=A+B1x1+B2x2+…+Bkxk+U，線性指的是迴歸係數B的部分 • 複迴歸方程式裡的係數可以乘上某數字，然後再加總起來；而自變數x卻不一定要是線性的，我們可以基於數學運算原則，對自變數作一些轉換（取對數、平方根、多次方項），除了解釋可能遇到的困難外，不會有其他嚴重的問題。 ©Ming-chi Chen 社會統計 Page.108 非線性關係的理由 • 我們的理論預設自變數和依變數之間存在著非線性關係，例如經濟發展程度和政治動亂 – Hibbs (1973)認為政治動亂會隨著經濟發展程度由低到中等而增加，但是隨著經濟程度由中等到高度發展，政治動亂則會下降。 • 從散佈圖看自變數與依變數之間的關係，發現兩者之間並非線性關係，而是曲線關係。 ©Ming-chi Chen 社會統計 Page.109 非直線關係：二次方迴歸模型 • 先在迴歸模型中放入二次方項 • 一定要放一次方項main effect，不可單獨只放二次方項。 • 在多次方迴歸方程式裡，要把所有低次方項放在模型裡。 ©Ming-chi Chen 社會統計 Page.110 非直線關係：二次方迴歸模型一次方項和二次方項皆顯著，顯然有二次曲線關係，迴歸方程式為： Consum=14364.27 + 0.8377*income - 0.00000108*income2 ©Ming-chi Chen 社會統計 Page.111 解讀二次方迴歸模型 • 這是一個沒有刪除任何觀察值的資料 • 一開始的時候，收入增加，消費也隨之增加 • 但是增加的速度越來越慢（負的二次方項），過了二次曲線的反折點（最高點/最低點）以後，收入增加反而壓抑消費。 • 切線的斜率依其位置不同而改變，公式為： slope=β1+2β2X • 一直到收入為-β1/2β2=-0.8377/2*(-0.00000108)=387827 時到達最高點，之後收入增加，消費反而下降。 • R2=0.9839>一次方迴歸方程式的R2=0.9589 ©Ming-chi Chen 社會統計 Page.112 非線性迴歸模型：自變數的其他轉換 • 有時候我們可以對自變數做其他數學轉換 • 比如說收入跟消費，或教育年數跟收入的關係。 • 教育年數增加，收入也隨之增加，但是增加的幅度會越來越小，不過沒有反折點，亦即兩者關係不會反轉。 • 這樣的情況下，可以對教育年數取對數。不過這樣的轉換詮釋不易。 • 既使在這樣的情況下，用二次方迴歸方程式來逼近也勝過線性迴歸方程式。 ©Ming-chi Chen 社會統計 Page.113 非線性迴歸模型：自變數的其他轉換 • 我們可以在excel裡產生相應的數列來模擬前述的現象。 • 先用excel裡產生一個0.01到5.0，以0.01的增幅構成的500個數值。 ©Ming-chi Chen 社會統計 Page.114 非線性迴歸模型：自變數的其他轉換 • 在第一格填入0.01 非線性迴歸模型：自變數的其他轉換在B1格中輸入 =ln(a1)，enter，然後一直複製到 B500格非線性迴歸模型：自變數的其他轉換 • 把這個資料存成以TAB字元相隔的文字檔。 • 在Stata中匯入這個資料非線性迴歸模型：自變數的其他轉換對數關係與線性關係 • 用線性關係來分析這筆資料 • R2=0.77，整個模型顯著，迴歸係數為0.59，亦為顯著對數關係與線性關係對數關係與二次方關係 • 用二次方迴歸模型來分析 • R2=0.91，整個模型顯著；迴歸係數皆顯著 • v2=-1.68+1.56*v1-0.19*v12 對數關係與二次方關係二次方關係 • 多次方轉換容許等差尺度的變數 • 大部分其他函數轉換則需要等比尺度的變數非直線關係：轉換依變數 • 有時我們也會用其他變數轉換法來分析自變數和依變數之間的關係 • 比如說像指數迴歸（exponential regression）也等同於把變數作對數轉化（log transformation），一般常用的是自然對數，也就是底數為e(≒2.71828)的對數函數。 • μ=E(Y)=αβx • 也就是ln[μ]=lnα+(lnβ)X=α’+β’X ©Ming-chi Chen 社會統計 Page.124 指數迴歸 β<1 β>1 ©Ming-chi Chen 社會統計 Page.125 非直線關係：轉換依變數 • 轉換依變數會改變自變數與依變數的關係，轉換後兩者的關係不再適合用線性關係來分析 • 這必須用概化線性模型（generalized linear model, GLM）來分析，而OLS可被視為是 GLM家族中的一員

迴歸診斷

Related documents

Products

Support

迴歸診斷

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib