Regression Forced March 17.871 Spring 2006 Regression quantifies how one variable can be described in terms of another Black Elected Officials Example I beo 10.8 0 1.2 30.8 bpop Stop a second: What is the correlation between beo & bpop? .72, .82, .92? beo 10.8 0 1.2 30.8 bpop The Linear Relationship between Two Variables Yi 0 1 X i i The Linear Relationship between African American Population & Black Legislators beo Fitted values beo 10 0 1.31 1 0.359 5 0 0 10 20 bpop 30 How did we get that line? 1. Pick a representative value of Yi beo Fitted values Yi beo 10 5 0 0 10 20 bpop 30 How did we get that line? 2. Decompose Yi into two parts beo Fitted values beo 10 5 0 0 10 20 bpop 30 How did we get that line? 3. Label the points beo Fitted values Yi Yi-Y^i ^ Y 10 εi “residual” beo i 5 0 0 10 20 30 bpop Yi ( 0 1 X i ) i Stop a moment: What is gi? • Vagueness of theory • Poor proxies (i.e., measurement error) • Wrong functional form • See Utts & Heckard discussion about the difference between deterministic relationships and statistical relationships The Method of Least Squares Pick 0 and 1 to minimize n 2 ˆ ( Y Y ) i i or beo Fitted values 10 beo (Yi 0 1 X i ) i 1 ^ Yi-Y i ^ Yi εi i 1 n Yi 5 0 2 0 10 20 bpop 30 n Solve for (Yi 0 1 X i ) 2 i 1 1 0 n 1 (Y Y )( X X ) i i 1 i or n (X X ) i 1 cov( X , Y ) var( X ) i 2 (Utts & Heckard, p. 164) n Solve for (Yi 0 1 X i ) 2 i 1 0 0 0 Y 1 X Note that if you rearrange. .... Y 0 1 X (Utts & Heckard, p. 164) Y 0 1 X beo Fitted values beo 10 5 0 0 10 20 bpop 30 About the Functional Form • Linear in the variables vs. linear in the parameters – – – – Y = a + bX + e (linear in both) Y = a + bX + cX2 + e (linear in parms.) Y = a + Xb + e (linear in variables) Y = a + lnXb/Zc + e (linear in neither) • Utts & Heckard pp. 174-175 0 5 10 15 Black Elected Officials 0 10 20 pop leg Fitted values Fitted values 30 Log transformations Y = a + bX + e b = dY/dX, or b = the unit change in Y given a unit change in X Typical case Y = a + b lnX + e b = dY/(dX/X), or b = the unit change in Y given a % change in X Cases where there’s a natural limit on growth ln Y = a + bX + e b = (dY/Y)/dX, or b = the % change in Y given a unit change in X Exponential growth ln Y = a + b ln X + e b = (dY/Y)/(dX/X), or b = the % change in Y given a % change in X (elasticity) Economic production How “good” is the fitted line? smally Fitted values smally 15 beo Fitted values 15 -2 1.2 30.8 beo bpop bigy Fitted values 15 -2 30.8 bpop bigy 1.2 -2 1.2 30.8 bpop Judging results • Substantive interpretation of coefficients • Technical judgment of regression – Judgment of coefficients – Judgment of overall fit Determining Goodness of Fit I • Coefficients – Standard error of a coefficient – t-statistic: coeff./s.e. Standard error of the regression picture beo Fitted values Yi Yi-Y^i ^ Y 10 εi beo i 5 0 0 10 20 bpop 30 Determining Goodness of Fit • Standard error of the regression or standard error of estimate (Root mean square error in STATA) n s.e.e. 2 ˆ (Yi Yi ) i 1 d.f. = n-2 d. f . 2 R beo picture Fitted values 10.8 10 ^) (Yi-Y i ^ -Y) (Y i beo (Yi-Y) _ Y 0 -.884722 1.2 30.8 bpop beo Fitted values 10 10.8 _ (Yi-Y) beo ^ _ (Yi-Y) ^) (Yi-Y i _ Y 0 -.884722 1.2 30.8 bpop n 2 ( Y Y ) " total sumof squares" i i 1 Y ) 2 " regression sumof squares" ( Y i 1 i n n ) 2 " residual sumof squares" ( Y Y i i 1 i Determining Goodness of Fit • R-squared n r 2 2 ˆ (Yi Y ) i 1 n (Y Y ) i 1 or 2 i percent va riance " explained" “coefficient of determination” Return to Black Elected Officials Example . reg beo bpop Source | SS df MS -------------+-----------------------------Model | 351.26542 1 351.26542 Residual | 67.6326195 39 1.73416973 -------------+-----------------------------Total | 418.898039 40 10.472451 Number of obs F( 1, 39) Prob > F R-squared Adj R-squared Root MSE = = = = = = 41 202.56 0.0000 0.8385 0.8344 1.3169 -----------------------------------------------------------------------------beo | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------bpop | .3584751 .0251876 14.23 0.000 .3075284 .4094219 _cons | -1.314892 .3277508 -4.01 0.000 -1.977831 -.6519535 ------------------------------------------------------------------------------ Residuals ei = Yi – B0 – B1Xi 10 be o be o AL 5 Fit ted va lue s 0 0 10 IL bp op 20 30 One important numerical property of residuals • The sum of the residuals is zero. Regression Commands in STATA • reg depvar indvars • predict newvar • predict newvar, resid Height of Sons Why It’s Called Regression Height of Fathers Some Regressions 80 Temperature and Latitude LosAngelesCA PhoenixAZ HoustonTX MobileAL SanFranciscoCA 40 DallasTX MemphisTN NorfolkVA PortlandOR 20 BaltimoreMD NewYorkNY WashingtonDC BostonMA KansasCityMO PittsburghPA ClevelandOH SyracuseNY MinneapolisMN DuluthMN 0 JanTemp 60 MiamiFL 25 30 35 latitude 40 45 . reg jantemp latitude Source | SS df MS -------------+-----------------------------Model | 3250.72219 1 3250.72219 Residual | 1185.82781 18 65.8793228 -------------+-----------------------------Total | 4436.55 19 233.502632 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 49.34 0.0000 0.7327 0.7179 8.1166 -----------------------------------------------------------------------------jantemp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------latitude | -2.341428 .3333232 -7.02 0.000 -3.041714 -1.641142 _cons | 125.5072 12.77915 9.82 0.000 98.65921 152.3552 -----------------------------------------------------------------------------. predict py (option xb assumed; fitted values) . predict ry,resid 80 60 MiamiFL LosAngelesCA PhoenixAZ HoustonTX MobileAL SanFranciscoCA 40 DallasTX MemphisTN NorfolkVA PortlandOR 20 BaltimoreMD NewYorkNY WashingtonDC BostonMA KansasCityMO PittsburghPA ClevelandOH SyracuseNY MinneapolisMN 0 DuluthMN 25 30 35 latitude Fitted values 40 JanTemp 45 gsort -ry . list city jantemp py ry 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. +-------------------------------------------------+ | city jantemp py ry | |-------------------------------------------------| | PortlandOR 40 17.8015 22.1985 | | SanFranciscoCA 49 36.53293 12.46707 | | LosAngelesCA 58 45.89864 12.10136 | | PhoenixAZ 54 48.24007 5.759929 | | NewYorkNY 32 29.50864 2.491357 | |-------------------------------------------------| | MiamiFL 67 64.63007 2.36993 | | BostonMA 29 27.16722 1.832785 | | NorfolkVA 39 38.87436 .125643 | | BaltimoreMD 32 34.1915 -2.1915 | | SyracuseNY 22 24.82579 -2.825786 | |-------------------------------------------------| | MobileAL 50 52.92293 -2.922928 | | WashingtonDC 31 34.1915 -3.1915 | | MemphisTN 40 43.55721 -3.557214 | | ClevelandOH 25 29.50864 -4.508643 | | DallasTX 43 48.24007 -5.240071 | |-------------------------------------------------| | HoustonTX 50 55.26435 -5.264356 | | KansasCityMO 28 34.1915 -6.1915 | | PittsburghPA 25 31.85007 -6.850072 | | MinneapolisMN 12 20.14293 -8.142929 | | DuluthMN 7 15.46007 -8.460073 | +-------------------------------------------------+ Bush Vote and Southern Baptists .7 UT WY ID NE OK .6 SD KS IN MT .5 OH AL TX AK MSKY WV AZ NC VA MO FL CO IA WI PANH MN MI OR NJ DE WA ME IL CA CT MD .4 Bush Pct 2004 ND NV SC GA TN LA AR NM HI NY VT RI MA 0 .2 .4 Southern Baptist % Bush Fitted values .6 . reg bush sbc_mpct Source | SS df MS -------------+-----------------------------Model | .069183833 1 .069183833 Residual | .280630922 48 .005846478 -------------+-----------------------------Total | .349814756 49 .007139077 Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 11.83 0.0012 0.1978 0.1811 .07646 -----------------------------------------------------------------------------bush | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------sbc_mpct | .196814 .0572138 3.44 0.001 .0817779 .3118501 _cons | .4931758 .0155007 31.82 0.000 .4620095 .524342 ------------------------------------------------------------------------------ .7 UT WY ID NE OK .6 SD KS IN MT .5 OH AL TX AK MSKY WV AZ NC VA MO FL CO IA WI PANH MN MI OR NJ DE WA ME IL CA CT MD .4 Bush Pct 2004 ND NV SC GA TN LA AR NM HI NY VT RI MA 0 .2 .4 Southern Baptist % Bush Fitted values .6 Weight by State Population . reg bush sbc_mpct [aw=votes] (sum of wgt is 1.2207e+08) Source | SS df MS -------------+-----------------------------Model | .118925068 1 .118925068 Residual | .142084951 48 .002960103 -------------+-----------------------------Total | .261010018 49 .005326735 Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 40.18 0.0000 0.4556 0.4443 .05441 -----------------------------------------------------------------------------bush | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------sbc_mpct | .261779 .0413001 6.34 0.000 .1787395 .3448185 _cons | .4563507 .0112155 40.69 0.000 .4338004 .4789011 ------------------------------------------------------------------------------ .7 .6 Bush Pct 2004 .5 .4 0 .4 .2 Southern Baptist % Bush Fitted values Fitted values .6 Midterm loss & pres’l popularity 2002 0 1998 1962 1986 1990 1970 -20 1978 1954 -40 1982 1950 1942 19741966 1958 1994 -60 1946 -80 1938 30 40 50 Gallup approval rating (Nov.) 60 70 . reg loss gallup Source | SS df MS -------------+-----------------------------Model | 2493.96962 1 2493.96962 Residual | 6564.50097 15 437.633398 -------------+-----------------------------Total | 9058.47059 16 566.154412 Number of obs F( 1, 15) Prob > F R-squared Adj R-squared Root MSE = = = = = = 17 5.70 0.0306 0.2753 0.2270 20.92 -----------------------------------------------------------------------------loss | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gallup | 1.283411 .53762 2.39 0.031 .1375011 2.429321 _cons | -96.59926 29.25347 -3.30 0.005 -158.9516 -34.24697 ------------------------------------------------------------------------------ 2002 0 1998 1990 1970 -20 1978 1962 1986 1954 -40 1982 1950 1942 19741966 1958 1994 -60 1946 -80 1938 30 40 50 Gallup approval rating (Nov.) loss Fitted values 60 70 . reg loss gallup if year>1948 Source | SS df MS -------------+-----------------------------Model | 3332.58872 1 3332.58872 Residual | 2280.83985 12 190.069988 -------------+-----------------------------Total | 5613.42857 13 431.802198 Number of obs F( 1, 12) Prob > F R-squared Adj R-squared Root MSE = = = = = = 14 17.53 0.0013 0.5937 0.5598 13.787 -----------------------------------------------------------------------------loss | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208 _cons | -127.4281 25.54753 -4.99 0.000 -183.0914 -71.76486 ------------------------------------------------------------------------------ 2002 -20 0 1998 1990 1970 1978 1962 1986 1954 -40 1982 1950 1942 -60 19741966 1958 1994 1946 -80 1938 30 40 50 Gallup approval rating (Nov.) loss Fitted values 60 Fitted values 70