Exercise 3 Third day in PC lab - Wrap up from day 2 - Multivariate Regression Model Production function and multicollinearity • use <path:\Y> cobb.dta • describe Q = e β1 ⋅ Lβ 2 ⋅ K β 3 ⋅ eε • ⇔ ln(Q) = β1 + β 2 ln( L) + β 3 ln( K ) + ε • • • • • • • • • • • • • • • • • • generate logarithms of variables estimate: regress lnQ lnL lnK interpret coefficients high R2 but none of the variables is significant corr lnK lnL gives 0.98 correlation (extreme) corr K L is similarly high (this is usually the case, that is, logarithms do not remarkably change the correlation) regress model if t>10, if t<25 (big differences) individual regressions of lnQ on lnK and lnQ on lnL give very significant coefficients, which are much higher in magnitude, however, and are definitely biased (one variable also captures the effect of the other) collect more data? (will probably not help, this is not primary data, so we cannot, except for waiting another 5 or ten years) we could introduce non-sample information, if we know that the production process is associated with constant economies-of-scale (sum of production elasticities is one) constraint 1 lnK+lnL=1 cnsreg lnQ lnK lnL, constraint(1) (command for constrained regression) parameters still not significant, but this is not crucial. If we know that constant scale economies occur, then this would be a good estimate, since we know that both variables do play a significant role Note that no R2 is displayed with constraints, because SST is not SSR + SSE cons drop 1 (or _all), if you do not need them anymore do regress lnQ lnK lnL again just to show you the estimation without a constant term: regress lnQ lnK lnL, nocons see that this type of constraint also changes the estimates quite a bit, but look at the R2, which should not be reported, because it is wrong Beer data (Regression specification error test) • use <path\> beer.dta • demand for a good depends on own price, prices of substitutes, prices of other goods and income • we have data for the average per capita beer demand • describe • summarize and look at the data and the variation • general model: q=f(pb, pl, pr, m) • what are the expected signs? • • • • • • • • • • • • • • • • • • • • • • graph matrix q pb pl pr m (in order to see the whole matrix of twoway scatter plots) but you can easily be fooled, because elasticities are ceteris paribus, whereas the graphs do not control for changes in other variables linear regression: regress q pb pl pr m see for instance that marginal effect of income is positive, although scatter rather indicated a negative relationship R2 is relatively high but is model correctly specified in terms of variables and functional form? predict yroof RESET test: what do we have to do? gen yroof2=yroof^2 and gen yroof3=yroof^3 regress q pb pl pr m yroof2 (t-value is significant) regress q pb pl pr m yroof2 yroof3 (although individual t-values are not significant the joint F test is: test yroof2 yroof3) we have to reject H0 that there is no mis-specification ovtest (after regress original model), Ramsey RESET test, or “omitted variable test”, but we know that this is also for functional form also leads us to rejection of H0 since we do not know what variables would be missing, there is possibly a problem with functional form let us try the log-log model (which is popular) generate logs of all variables regress lnq lnpb lnpl lnpr lnm interpret variables ovtest shows us that there is no misspecification anymore Often results are not that clear cut Note that R2 would not have been a good indicator of correct or incorrect model specification (almost the same as in previous linear model) Dummy variables • • • • • • • • • • use <path\> Soybean.dta describe soybean area, clear time trend? regress Area t / predict yroof / scatter yroof Area t, c(l) additional information: in 1997 RR technology was introduced generate slope and interaction dummies gen RR=0 replace RR=0 if Year>1996) gen RR_t=(RR*t) if the dummies are not included, the estimator is biased (because we assume that before and after RR introduction the parameters are the same when in fact they are not). • regress Area t RR RR_t • predict yhat Area = beta1 + beta2 t + gamma1 RR + gamma2 RR_t Area = beta1 + beta2 t if RR = 0 Area = (beta1 + gamma1) + (beta2 + gamma2) t if RR = 1 scatter yhat Area Year, c(l) nice to show in simple regression model with only one regressor House prices • Dummy slope and interaction and their interpretation • use <path\> Soybean.dta • describe • hedonic house price equation: price = f(size, college, college_size, age, pool, fireplace) • regress price size age college pool fireplace college_size • interpret slope and dummy variables, coefficients, R2 etc. Existence of qualitative effects? (Restricted vs unrestricted model or, alternatively, Chow-test) • • • • use <path\> investment.dta describe investment function: invest = b1 + b2value + b3capital but data over 20 years are pooled for two different electrical companies (Dummy variable D: 1=Westinghouse, 0=General Electric) • it is nice to use all 40 observations, but is this appropriate? • Null Hypothesis: H0: the functions for both firms are identical, so the data can be pooled • generate set of slope dummies • gen D_value=D*value • gen D_capital=D*capital 1) Restricted vs unrestricted model ( SSER − SSEU ) / J SSEU /(T − K ) • Recall that F-statistic is: • • • • • estimate restricted model: regress invest value capital Note down sum of squared residuals = SS Residual = (SSER) regress unrestriced model: regress invest value capital D D_value D_capital Note down sum of squared residual Contsruct F-statistic: F = ( SSER − SSEU ) / J = (16563.00 − 14989.82) / 3 = 1.1894 F= SSEU /(T − K ) • 14989.82 /(40 − 6) The α = .05 critical value Fc=2.8826 comes from the F(3,34) distribution. Since F<Fc we cannot reject the null hypothesis that the investment functions for General Electric and Westinghouse are identical. • Therefore, pooling the data seems appropriate (without dummies) • We can simplify the analysis by asking STATA to carry out the F test for joint significance of the coefficients for D D_value and D_capital • regress the unrestricted model: regress invest value capital D D_value D_capital • test D D_value D_capital • It is no coincidence that the F statistic is exactly equal to the one we calculated manually before. The p-value of 0.3284 indicates that we cannot reject H0. 2) Chow test • Alternatively both sets of data can be estimated separately and the SSE1 and SSE2 be noted down. The same F statistic can be calculated because SSEU =SSE1+SSE2 (SSE R − SSE U ) • • • • • • • • • • • Then F = SSE [DFR − DFU ] U with DFR=(40-3)=37; DFU=(20-3)+(20-3)=34 DFU Therefore numerator df=3, denominator df=34 This will result in exactly the same test statistic as above The advantage is that no dummy and intercept variables have to be generated (especially important in longer equations with many variables) regress invest value capital if D==1 regress invest value capital if D==0 (add Residual SS) The Chow test (in general the F test) assumes that the MR assumptions hold for all observations, especially also heteroskedasticity. Otherwise, the test fails. Since there is likely to be heteroskedasticity in the model, we have to use an alternative to the F-test, the so-called Wald test. Without going into the details, the Wald test is similar to the F-test, but it uses the correct variance-covariance matrix (White estimator). regress invest value capital D D_value D_capital, robust test D D_value D_capital. Now the p-value of 0.067 still tells us at a 5% level that pooling the data is okay, but not as clear as before. Nonlinear models • • • • use <path\> EAF_technology.dta describe Over the last 3 decades, traditional technology for steel making, involving blast and oxygen furnaces and the use of iron ore, has been replaced by newer electric arc furnace (EAF) technology that utilizes scrap steel. Predictions about how fast this new technology is being adopted has implications for the suppliers of iron ore (mining companies) and scrap steel. The data display the share of EAF technology adoption in the US steel industry for the years 1970 to 1997. Look at data: share in 1970 and share in 1997 • Logistic growth curve • • • • • • alpha is maximum adoption or saturation point (which we do not know) delta controls the speed, beta determines how far below saturation at time 0. -beta/delta is point of inflection (half saturation) the relationship cannot be estimated by OLS, but by non-linear LS programming required In the head-menu: select "windows", choose "do-file editor", "New Do-file" • program nllogistic version 8.0 if "`1'"=="?" { yt = α + et 1 + exp(−β − δt ) (name we have given the program, has to start with nl) (always program start, to put parameters in macro S_1) (declare parameters to be estimated) (give them initial values to start the iteration) global S_1 "A B D" global A=0 global B=0 global D=0 exit (exit parametr definition) } (do not forget bracket) replace `1' = $A/(1+exp(-$B-$D*t)) (function to be estimated, dollar signs) end • • • • • • • • • • Save your file under the name "logistic.do" if you have another function, just replace parameter definitions, initial values and functional form) One has to put program into memory to use for the next command In the head-menu: select "File" , choose "Do" and choose your file logistic.do) nl logistic eaf_share (only the dependent variable has to be defined, because the regressors are defined in the program) predict yhat scatter yhat t, c(l) be careful with R2. Other software does not report this altogether nl logistic eaf_share, init(A=-2) or init(A=2, B=6…) program drop logistic Heteroskedasticity Detecting heteroskedasticity • use <path\> foodexp.dta • regress foodexp income • predict residual, residuals • predict yhatlin • scatter residual income • scatter yhatlin foodexp income, c(l) GQ Test by hand • gen inc_group=1 if income>712 (median) • replace inc_group=2 if missing(inc_group) • regress foodexp income if inc_group==1 (remember SSE = var1) • regress foodexp income if inc_group==2 (remember SSE = var2) • Calculate GQ=var1/var2 • The disadvantage: Here you need to know with respect to what variable heteroskedasticity occurs. This is obvious in the simple regression model, but not always in multiple regression. • An alternative is: • GQ test does not seem to be included in STATA. Instead, the Cook-Weisberg test, same as Breusch-Pragan test is included, which tests whether t=0 in σ 2 = s 2 exp( x⋅t ) • • • If t=0, then the variance of the error term is homoskedastic (it does not depend on x). The STATA command is: hettest x (in our case hettest income) If income is not specified (just hettest) in a multiple regression model the fitted values are used. Heteroskedastic partition • use <path\> wheat.dta • describe • 26 observations on wheat supply, price and time (Australian wheat growing district) • q = f(price, technology, wheater), t is proxy for technology, no wheather data, so part of e • q = β1 + β 2 p + β 3t + e • new wheat varieties with lower susceptibility to weather were introduced after year 13. • • • • • • • • • • • • • • • • • • • • • • • • • • Split up the sample into two sub-samples generate group=0 if t<14 replace group=1 if missing(group) STATA allows for simplifying the process by providing a maximum-likelihood estimation (One could do things from hand by only using OLS in different steps, but software packages are there to facilitate things. Only one needs to understand what the software is actually doing). So here is the "shortcut" first: xtgls q p t, i(group) panel(hetero) (xt is panel data command (time series cross section) i defines the panel, which in our case is the first and the second 13 years) panel(hetero) indicates that the two panels show heteroskedasticity (unequal variances) Regress simple OLS model and Perform Breusch-Pragan Test (Cook-Weisberg) hettest group (H0 is rejected). One could also do hettest t (in this case with the same conclusion but a lower test statistic). Hence, if there is very specific non-sample info, then this should be included in order to come to more reliable conclusions. Perform Goldfeld-Quandt Test by running two OLS: Regress q p t if group==0 Regress q p t if group==1 Divide var0/var1=11.11. (F(10, 10)=2.98 (5%) df are (T1-K, T2-K). predict residual, residuals scatter residual t Alternative to the "Short-Cut": GLS by hand (σ2 for group 0 (t=1-13) = 641.640762, for group 1 = 57.585656) gen cons_trans=1/sqrt(6416.40762) if group==0 replace cons_trans=1/sgrt(577.585656) if group==1 gen q_trans=q/sqrt(6416.40762) if group==0 replace q_trans=q/sqrt(577.585656) if group==1 gen p_trans=q/sqrt(6416.40762) if group==0 replace p_trans=q/sqrt(577.585656) if group==1 gen t_trans=q/sqrt(6416.40762) if group==0 replace t_trans=q/sqrt(577.585656) if group==1 regress q_trans cons_trans p_trans t_trans, nocons result will be slightly different due to rounding errors However, if we just know that there is heteroskedasticity, but have no idea where this is coming from, then using White’s estimator (robust) might still be the best choice. • • • • • • use <path\> foodexp.dta describe In order to show the working of the White estimator use the food expenditure data compare the two models: regress foodexp income and regress foodexp income, robust The estimated coefficients (and R2) are the same, but the se and t-statistics are different GLS for proportional heteroskedasticity • • • here we use weighted least squares (weight is the reciprocal of x), i.e. instead of minimizing the sum of errors function, we minimize the sum of transformed errors function regress foodexp income [aweight=1/income] aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; i.e., the variance of the j-th observation is assumed to be var( yt ) = var(et ) = σ2 aweight . Since we have assumed var(et ) = σ 2 xt , the correct weight is 1/xt • The same can be done by hand by building the transformed model: * * * * • yt = β1 xt1 + β 2 xt 2 + et 1 y x e yt* = t xt*1 = xt*2 = t et* = t xt xt xt xt where transform all variables, generate a new one for x1t, and estimate the transformed model by suppressing the constant. (…, nocons).This will lead to the same results. Only we should be careful with interpreting the R2, because of the nocons constraint.