Econometrics II Seppo Pynnönen Department of Mathematics and Statistics, University of Vaasa, Finland January 12 – February 25, 2016 Seppo Pynnönen Econometrics II Panel Data Part II Panel Data As of Feb 2, 2016 Seppo Pynnönen Econometrics II Panel Data 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Data sets that combine time series and cross sections data are common in economics. Independently pooled cross section: Data are obtained by sampling randomly a large population at different points in time (e.g., yearly). Allows to investigate the effect of time. E.g., whether relationships have changed. Raises typically minor statistical complications. Important feature: The data set consists of independently sampled observations. Seppo Pynnönen Econometrics II Panel Data A panel data set (longitudinal data): is a sample of same individuals, families, firms, cities . . ., are followed across time. E.g., OECD statistics contain numerous series observed yearly from several countries. Similarly time series data on several firms, industries, etc., are these type of data. Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time Example 1 Women’s fertility over time: Data from General Social Survey contains samples collected even years from 1972 to 1984. Model for explaining total number of children born to a woman. Data is available on the course web side (password protected). Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time * read data .insheet using "fertil1.csv", comma clear * describe data .des Contains data obs: 1,129 vars: 14 size: 24,838 (99.9% of memory free) -----------------------------------------------------------storage display value variable name type format label variable label -----------------------------------------------------------year byte %8.0g educ byte %8.0g meduc byte %8.0g feduc byte %8.0g age byte %8.0g kids byte %8.0g black byte %8.0g east byte %8.0g northcen byte %8.0g west byte %8.0g farm byte %8.0g othrural byte %8.0g town byte %8.0g smcity byte %8.0g Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time . tabstat kids, statistics( mean count ) by(year) columns(statistics) Summary for variables: kids by categories of: year year | mean N ---------+-------------------72 | 3.0 156 74 | 3.2 173 76 | 2.8 152 78 | 2.8 143 80 | 2.8 142 82 | 2.4 186 84 | 2.2 177 ---------+-------------------Total | 2.7 1129 ------------------------------ Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time Number of children per woman N of children 4 3 2 70 72 74 76 78 80 82 Year It is obvious that the fertility rate has declined over years Seppo Pynnönen Econometrics II 84 86 Panel Data Pooling independent cross section across time The analysis can be substantially elaborated by regression analysis. After controlling other factors (educations, age, etc.), what has happened to the fertility rate? Build a regression with year dummies: y74 for 1974, · · · , y84 for year 1984. Year 1972 is the base year. Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time . reg kids educ age age2 black east northcen west farm /// y74 y76 y78 y80 y82 y84 Source | SS df MS -------------+-----------------------------Model | 389.777313 14 27.8412367 Residual | 2695.73199 1114 2.41986713 -------------+-----------------------------Total | 3085.5093 1128 2.73538059 Number of obs F( 14, 1114) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1129 11.51 0.0000 0.1263 0.1153 1.5556 ----------------------------------------------------kids | Coef. Std. Err. t P>|t| -------------+--------------------------------------educ | -.1242409 .0181486 -6.85 0.000 age | .5381453 .1384005 3.89 0.000 age2 | -.0058679 .0015645 -3.75 0.000 black | 1.083783 .1734035 6.25 0.000 east | .2276015 .1312518 1.73 0.083 northcen | .3713906 .1199679 3.10 0.002 west | .2188689 .1663522 1.32 0.189 farm | -.0918808 .122027 -0.75 0.452 y74 | .2586277 .1727165 1.50 0.135 y76 | -.1012358 .1787317 -0.57 0.571 y78 | -.0671507 .1814491 -0.37 0.711 y80 | -.0751199 .1827069 -0.41 0.681 y82 | -.5323518 .1723385 -3.09 0.002 y84 | -.5383952 .174472 -3.09 0.002 _cons | -7.894707 3.05159 -2.59 0.010 ----------------------------------------------------- Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time Sharp drop in fertility in the early 1980s (others are not statistically significant). E.g., the coefficient on y82 indicates that, holding other factors fixed (educ, age, and others), per 100 women there were about 53 less children than in 1972. In particular, since education is controlled, this decline is separate from the decline due to the increase in eduction. Women with more education have fewer children (coefficient −0.12 is highly statistically significant with t = −6.85 and p-value < 0.0005). Other things equal, per 100 women with a college education tend to have 4 × 0.124 = 0.496, i.e., about 50 children less than women with only high school education. Seppo Pynnönen Econometrics II Panel Data Pooling independent cross section across time In summary, pooled cross section data (independent samples) problems can be analyzed utilizing dummy variables. Seppo Pynnönen Econometrics II Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model From each individual (people, firms, schools, cities, countries, etc.) data are collected at two time points, t = 1 and t = 2. In usual regression one major source of bias stems from omitted (important) variables. For example, if the true model is yi = β0 + β1 xi + β2 zi + ui , (1) yi = β0 + β1 xi + vi , (2) vi = β2 zi + ui , (3) but we estimate where the bias in OLS estimator β̂1 from model (2) is Pn h i (xi − x̄)zi E β̂1 − β1 = β2 Pi=1 , n 2 i=1 (xi − x̄) (4) which can be substantial if x and z are correlated and β2 is large. Seppo Pynnönen Econometrics II Panel Data Fixed effects model The problem is that we usually do not know if important variables are missing from our model! Use of panel data makes it possible to eliminate the omitted variable bias in certain cases. Suppose that we have the following situation in terms of model (1) yit = β0 + β1 xit + β2 zi + uit , (5) where i refers to individual i and t to time point t. Thus, we have panel data where data is collected from each individual i at different time points t (in the two period case, t = 1, 2). Note that in (5) zi does not have the time index, which implies that variable z is time invariant (or at least changing very slowly with time). Seppo Pynnönen Econometrics II Panel Data Fixed effects model Suppose, we have from each of the n individuals observations on yit and xit at time points t = 1 and t = 2, thus altogether 2n observations. However, we do not observe zi . Suppose further that we allow the possibility that intercept β0 may be different at different time points, such that (5) can be written as yit = β0 + δ0 Dt + β1 xit + β2 zi + uit , where Dt = 0 for t = 1 and Dt = 1 for t = 2 (time dummy). Seppo Pynnönen Econometrics II (6) Panel Data Fixed effects model Then taking differences ∆yi = yi2 − yi1 , the model in (6) becomes ∆yi = δ0 + β1 ∆xi + ∆ui , (7) i.e., the (unobserved) omitted variable disappears and estimating the slope parameter β1 with OLS is unbiased. Seppo Pynnönen Econometrics II Panel Data Fixed effects model The above generalizes immediately such that if we denote ai = z0i γ = γ1 zi1 + γ2 zi2 + · · · + γq ziq (8) and enhance (6) to yit = β0 + δ0 Dt + βxit + ai + uit , (9) taking differences reduces again to estimation model (7). The above model is called the fixed effect (FE) model in which ai is fixed over the time periods (ai can be a random variable, and can correlate with the explanatory variable xit ). If ai is not correlated with other explanatory variables, the model is called random effect (RE) model and is estimated with different techniques that are supposed to yield more efficient estimators to β-parameters than the fixed effect methods (that are basically OLS methods). We will return to the RE model later. Seppo Pynnönen Econometrics II Panel Data Fixed effects model In the FE case the resulting estimators of the regression parameters from the first-differenced equation with OLS are called the first-differenced estimators (FD estimators). We will deal with other fixed effect estimators later. In summary: Differencing eliminates all unobserved time invariant factors from the model. A major pitfall is that differencing also wipes out observed time invariant variables (like gender) from the model! FE cannot be used in these cases (if we want to estimate these effects), or in cases where the explanatory variables change very slowly across time (the difference is nearly zero). Seppo Pynnönen Econometrics II Panel Data Fixed effects model In many cases the FD-method is useful, however. The following example highlight the biasing effect of unobserved factors and how panel estimation with the simple FD-method likely solves the problem. Example 2 Data set crime2.xls (Wooldridge) contains data on crime and unemployment rates for 46 US cities for 1982 (t = 1) and 1987 (t = 2). Running simple cross section regression of crmrte on unem by using only 1987 yields Seppo Pynnönen Econometrics II Panel Data Fixed effects model . regress crmrte unem if year==87 Source | SS df MS -------------+-----------------------------Model | 1775.90928 1 1775.90928 Residual | 52674.6428 44 1197.15097 -------------+-----------------------------Total | 54450.5521 45 1210.01227 Number of obs F( 1, 44) Prob > F R-squared Adj R-squared Root MSE ----------------------------------------------------crmrte | Coef. Std. Err. t P>|t| -------------+--------------------------------------unem | -4.161134 3.416456 -1.22 0.230 _cons | 128.3781 20.75663 6.18 0.000 ----------------------------------------------------- Seppo Pynnönen Econometrics II = = = = = = 46 1.48 0.2297 0.0326 0.0106 34.6 Panel Data Fixed effects model Coefficient of crmrte is negative, −4.16! However, not statistically significant. Likely suffers from omitted variables problem (age distribution, gender distribution, eduction levels, . . .). Most of these can be expected to be fairly stable across time. Thus, use of panel data techniques may be helpful. Before proceeding to the panel data estimation, let us see what happens if we simply pool the two years and estimate crmrte = β0 + δ0 D87 + β1 unem + u, where D87 is the year 1987 dummy. Seppo Pynnönen Econometrics II (10) Panel Data Fixed effects model . regress crmrte d87 unem Source | SS df MS -------------+-----------------------------Model | 989.717314 2 494.858657 Residual | 80055.7864 89 899.503218 -------------+-----------------------------Total | 81045.5037 91 890.609931 Number of obs F( 2, 89) Prob > F R-squared Adj R-squared Root MSE = 92 = 0.55 = 0.5788 = 0.0122 = -0.0100 = 29.992 ----------------------------------------------------crmrte | Coef. Std. Err. t P>|t| -------------+--------------------------------------d87 | 7.940413 7.975324 1.00 0.322 unem | .4265461 1.188279 0.36 0.720 _cons | 93.42026 12.73947 7.33 0.000 ----------------------------------------------------- The situation does not change much qualitatively! Seppo Pynnönen Econometrics II Panel Data Fixed effects model For example, Stata has very sophisticated panel data procedures. We discuss some of them later. The FD-method can be applied by using the regress routine by first declaring the data as a panel data with the xtset command (Menu: Statistics > Longitudinal/panel data > Setup and utilities > Declare data set to be panel data). In Eviews: Proc > Structure/Resize Current Page. . ., and follow the instructions. In SAS: proc panel data = crime2; model crmrte = unemp; id = state year; end; Before applying proc panel the data must be sorted by proc sort. Whichever software is used, identifiers for the individuals (in particular) are needed to indicate the multiple measurements on an individual. Seppo Pynnönen Econometrics II Panel Data Fixed effects model After declaring to the program the panel structure, the model ∆crmrte = δ0 + β1 ∆umem + ∆u can be estimated with the FD difference method e.g., in Stata as (d.crmrte means crmrte87 − crmrte82 ): . reg d.crmrte d.unem Source | SS df MS -------------+-----------------------------Model | 2566.43056 1 2566.43056 Residual | 17689.5426 44 402.035059 -------------+-----------------------------Total | 20255.9732 45 450.132737 Number of obs F( 1, 44) Prob > F R-squared Adj R-squared Root MSE = = = = = = 46 6.38 0.0152 0.1267 0.1069 20.051 ----------------------------------------------------D.crmrte | Coef. Std. Err. t P>|t| -------------+--------------------------------------unem | D1. | 2.217996 .8778657 2.53 0.015 | _cons | 15.40219 4.702116 3.28 0.002 ----------------------------------------------------- Seppo Pynnönen Econometrics II (11) Panel Data Fixed effects model In Eviews, after the data has been reshaped to panel data, the FD-estimatation can be worked out using Quick > Estimate Equation. . . to open the Equation Estimation command window to input d(cmrte) c d(unem) to get the results similar to above. The coefficient estimate of the β̂1 ≈ 2.22 is now highly statistically significant and of expected sign. The model predicts that one percent increase in unemployment increases crimes by about 2.2 per 1, 000 people. The constant term indicates that even if the change in unemployment rate were zero, the crime rate has generally increased during the period from 1982 to 1987 by about 15.4 crimes per 1,000 people. Seppo Pynnönen Econometrics II Panel Data Fixed effects model Note that the time dummy component δ0 in (11) captures all unobserved time effect that are common to all cross-sectional individuals. That is, we can consider δ0 to represent δ0 = z0t δ = δ1 z1t + δ2 z2t + · · · + δp zpt , where zt ’s are common trend components affecting all individual crime rates with same intensity. Seppo Pynnönen Econometrics II Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model Differencing can be used with more than two time periods to work out fixed effect estimation. As an example consider a three period model. yit = δ1 + δ2 D2t + δ3 D3t + β1 xit1 + · · · + βk xitk + uit (12) for t = 1, 2, 3, where D2t = 1 for period t = 2 and zero otherwise and D3t = 1 for t = 3 and zero othewrise. Differencing yields ∆yit = δ2 ∆D2t + δ3 ∆D3t + β1 ∆xit1 + · · · + βk ∆xitk + ∆uit (13) t = 2, 3. Note: For t = 2, ∆D2t = 1 and ∆D3t = 0 = D3t ; for t = 3, ∆Dt2 = −1 and ∆D3t = 1 = D3t . Again it is simple to estimate with OLS the model. Seppo Pynnönen Econometrics II Panel Data Fixed effects model Remark 1 Model in (13) is usual reparametrized into an equivalent form ∆yit = α0 + α3 D3t + β1 ∆xit1 + · · · + βk ∆itk + ∆uit . (14) This generalizes to T time periods with time dummies D1t , D2t , . . . , DTt ∆yit = α0 + α3 D3t + · · · + αT DTt +β1 ∆xit1 + · · · + βk ∆itk + ∆uit . Seppo Pynnönen Econometrics II (15) Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model An alternative method, which works in certain cases better than the FD-method, is called the fixed effects method. Consider the simple case model of yit = β1 xit + ai + uit , (16) i = 1, . . . , n, t = 1, . . . , T . Thus there are altogether n × T observations. Define means over the T time periods ȳi = T 1 X yit , T t=1 x̄i = T 1 X xit , T t=1 Seppo Pynnönen ūi = Econometrics II T 1 X uit . T t=1 (17) Panel Data Fixed effects model Then ȳi = β1 x̄i + ai + ūi . Note that (18) T 1 X 1 ai = Tai = ai . T T t=1 Thus, subtracting (18) from (16) eliminates ai and gives yit − ȳi = β1 (xit − x̄i ) + (uit − ūi ) (19) ẏit = β1 ẋit + u̇it , (20) or where e.g., ẏit = yit − ȳi is the time demeaned data on y . This transformation is also called the within transformation and resulting (OLS) estimators of the regression parameters applied to (20) are called fixed effect estimators or within estimators. Seppo Pynnönen Econometrics II Panel Data Fixed effects model In the two period case the FD method and FE lead to identical results. Remark 2 The slope coefficient β1 estimated from (18) is called the between estimator. vi = ai + ūi is the error term. The estimator is biased, however, if the unobserved component ai is correlated with x. Remark 3 When estimating the unobserved effect by the fixed effect (FE) method, it is unfortunately not clear how the goodness-of-fit R-square should be computed. Stata produces three different R-squares: within, between, and total. Seppo Pynnönen Econometrics II Panel Data Fixed effects model Remark 4 Usually a full set of year dummies (i.e., year dummies for all years but the first) are included in FE estimation to capture time variation. However, then the effect of any variable whose change across time is constant cannot be estimated (an example of such a variable is experience measure by the number of year; experience increases every year by one). Remark 5 Although time invariant variables cannot be included by themselves in a FE mode, their interactions with year dummies can. For example, in a wage equation (year dummy) x (education) measure the change in return of education over time. Seppo Pynnönen Econometrics II Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model Yet another method is to introduce dummy variables for the cross section unit (N − 1 dummy variables) and (possibly) for the periods (T − 1 dummies). If N and T are large this is not very practical. Gives the same estimates for the regression coefficients as the time demeaned method and the standard errors and major statistics are the same. Seppo Pynnönen Econometrics II Panel Data Fixed effects model Example 3 Papke (1994), Journal of Public Economics 54, 37–49, studied the effect of Indiana enterprise zone program on unemployment, years 1980–1988 (Wooldridges data base, file: ezunem.xls). Six zones designated 1984 and four more in 1985. Twelve cities did not receive a zone (control group). An evaluation model of the policy is log(uclmsit ) = θt + β1 Dit + ai + uit (21) where θt indicates time varying intercept, ucclmsit is the number unemployment claims during year t in city i, and Dit = 1 if the city i had the zone in year t and zero otherwise. First Difference estimates for β1 : Seppo Pynnönen Econometrics II Panel Data Fixed effects model . reg d.luclms d82 d83 d84 d85 d86 d87 d88 d.ez Source | SS df MS -------------+-----------------------------Model | 12.8826331 8 1.61032914 Residual | 7.79583815 167 .046681666 -------------+-----------------------------Total | 20.6784713 175 .118162693 Number of obs F( 8, 167) Prob > F R-squared Adj R-squared Root MSE = = = = = = 176 34.50 0.0000 0.6230 0.6049 .21606 ----------------------------------------------------D.luclms | Coef. Std. Err. t P>|t| -------------+--------------------------------------(Year dummy variable estimates results deleted) ez | D1. | -.1818775 .0781862 -2.33 0.021 _cons | -.3216319 .046064 -6.98 0.000 ----------------------------------------------------- Seppo Pynnönen Econometrics II Panel Data Fixed effects model The estimate of β1 , β̂1 = −.182 indicates that the presence of an EZ causes about a 16.6% (e −.182 − 1 = .166) fall in unemployment claims, which is both economically and statistically significant (t-val 2.33). Seppo Pynnönen Econometrics II Panel Data Fixed effects model Fixed Effect estimation results . xtreg luclms d82 d83 d84 d85 d86 d87 d88 ez, fe R-sq: within = 0.8148 between = 0.0002 overall = 0.3415 corr(u_i, Xb) = -0.0040 F(8,168) Prob > F = = 92.36 0.0000 ----------------------------------------------------luclms | Coef. Std. Err. t P>|t| -------------+--------------------------------------ez | -.1044148 .059753 -1.75 0.082 _cons | 11.53358 .0325925 353.87 0.000 -------------+--------------------------------------sigma_u | .55551522 sigma_e | .21619434 rho | .86846297 (fraction of variance due to u_i) ----------------------------------------------------------F test that all u_i=0: F(21, 168) = 59.31 Prob > F = 0.0000 Seppo Pynnönen Econometrics II Panel Data Fixed effects model Dummy variable regression: . reg luclms d82 d83 d84 d85 d86 d87 d88 /// c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 /// c14 c15 c16 c17 c18 c19 c20 c21 c22 ez Source | SS df MS -------------+-----------------------------Model | 92.6439601 29 3.19461932 Residual | 7.85231887 168 .046739993 -------------+-----------------------------Total | 100.496279 197 .510133396 Number of obs F( 29, 168) Prob > F R-squared Adj R-squared Root MSE = = = = = = 198 68.35 0.0000 0.9219 0.9084 .21619 ----------------------------------------------------luclms | Coef. Std. Err. t P>|t| -------------+--------------------------------------(dummy variable results removed) ez | -.1044148 .059753 -1.75 0.082 _cons | 11.51534 .0799536 144.03 0.000 ----------------------------------------------------- The results show that the FE and DVRM results are exactly the same. Using the FE results, the coefficient −0.104 implies about 10.4 percent drop in the unemployment claims due to the program. The estimate is significant in one-tailed testing but not in two-tailed testing. Seppo Pynnönen Econometrics II Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model If the number of periods is 2 (T = 2) FE and FD give identical results. When T ≥ 3 the FE and FD are not the same. Both are unbiased under assumptions FE.1–FE.4 FE.1 For each i, the model is yit = β1 xit1 + · · · + βk xitk + ai + uit , t = 1, . . . T . FE.2 We have a random sample from the cross section. FE.3 Each explanatory variables changes over time, and they are not perfectly collinear. FE.4 E[uit |Xi , ai ] = 0 for all time periods (Xi stands for all explanatory variables). FE.5 var[uit |Xi , ai ] = σu2 for all t = 1, . . . , T . FE.6 cov[uit , uis ] = 0 for all t 6= s FE.7 uit |Xi , ai ∼ NID(0, σu2 ). Both are consistent under assumptions FE.1–FE.4 for fixed T as n → ∞. Seppo Pynnönen Econometrics II Panel Data Fixed effects model If uit is serially uncorrelated, FE is more efficient than FD (because of this FE is more popular). If uit is (highly) serially correlated, ∆uit may be less serially correlated, which may favor FD over FE. However, typically T is rather small, such that serial correlation is difficult to observe. In sum, there are no clear cut guidelines to choose between these two. Thus, a good advise is to check them them both and try to determine why they differ if there is a big difference. Seppo Pynnönen Econometrics II Panel Data Fixed effects model 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Fixed effects model A data set is called a balanced panel if the same number of time series observations are available for each cross section units. That is T is the same for all individuals. The total number of observations in a balanced panel is nT . All the above examples are balanced panel data sets. If some cross section units have missing observations, which implies that for an individual i there are available Ti time period observations i = 1, . . . , n, Ti 6= Tj for some i and j, we call the data set an unbalanced panel. The total number of observations in an unbalanced panel is T1 + · · · + Tn . In most cases unbalanced panels do not cause major problems to fixed effect estimation. Modern software packages make appropriate adjustments to estimation results. Seppo Pynnönen Econometrics II Panel Data Random effects models 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Random effects models Consider the simple unobserved effects model yit = β0 + β1 xit + ai + uit , (22) i = 1, . . . , n, t = 1, . . . , T . Typically also time dummies are also included to (22). Using FD or FE eliminates the unobserved component ai . However, if ai is uncorrelated with xit using random effect (RE) estimation can lead to more efficient estimation of the regression parameters. Seppo Pynnönen Econometrics II Panel Data Random effects models Generally, we call the model in equation (22) the random effects model if ai is uncorrelated with all explanatory variables, i.e., cov[xit , ai ] = 0, t = 1, . . . , T . (23) How to estimate β1 efficiently? If (23) holds, β1 can be estimated consistently from a single cross section. Obviously this discards lots of useful information. Seppo Pynnönen Econometrics II Panel Data Random effects models If the data set is simply pooled and the error term is denoted as vit = ai + uit , we have the regression yit = β0 + β1 xit + vit . (24) σa2 σa2 + σu2 (25) Then corr[vit , vis ] = for t 6= s, where σa2 = var[ai ] and σu2 = var[uit ]. That is, the error terms vit are (positively) autocorrelated, which biases the standard errors of the OLS β̂1 . Seppo Pynnönen Econometrics II Panel Data Random effects models If σa2 and σu2 were known, optimal estimators (BLUE) would be obtained the generalized least squares (GLS), which in this case would reduce to estimate the regression slope coefficients from the quasi demeaned equation yit − λȳt = β0 (1 − λ) + β1 (xit − λx̄i ) + (vit − λv̄i ), where λ=1− σu2 σu2 + T σa2 (26) 12 . (27) In practice σu2 and σa2 are unknown, but they can be estimated. Seppo Pynnönen Econometrics II Panel Data Random effects models One method is to estimate (24) from the pooled data set and use the OLS residuals v̂it to estimate σa2 and σu2 and plug them into (27). There resulting GLS estimators for the regression slope coefficients are called random effects estimators (RE estimators). Under the random effects assumptions2 the estimators are consistent, but not unbiased. They are also asymptotically normal as n → ∞ for fixed T . However, with small n and large T properties of the RE estimator is largely unknown. 2 The ideal random effects assumptions include FE.1, FE.2, FE.4–FE.6. FE.3 is replaced with RE.3: There are no perfect linear relationships among the explanatory variables. RE.4: In addition of FE.4, E[ai |Xi ] = 0. Seppo Pynnönen Econometrics II Panel Data Random effects models It is notable that λ = 1 results in (26) results to the pooled regression and FE obtained with λ = 0. RE estimation is available in modern statistical packages with different options. Example 4 Data set wagepan.xls (Wooldridge): n = 545, T = 8. Is there a wage premium in belonging to labor union? log(wageit ) = β0 + β1 educit + β3 exprit + β4 expr2it +β5 marriedit + β6 unionit + ai + uit Year dummies for 1980–1987 are included. It is notable that with inclusion of full set of year dummies implies that one cannot estimate with the FE method effects that change a constant amount over time. Experience (exper) is such a variable. Seppo Pynnönen Econometrics II Panel Data Random effects models ------------------------------------------lwage | Pooled Random Fixed | OLS Effects Effects --------+---------------------------------educ | .0989945 .0906150 .. | (.0046227) (.0105807) exper | .0861696 .1027934 .. | (.0101415) (.0153853) exper2 | -.0027349 -.0046859 -.0051855 | (.0007099) (.0006896) (.0007044) married | .1230113 .0678821 .0466804 | (.0155714) (.0167369) (.0183104) union | .1685243 .1031103 .0800019 | (.0170652) (.0178388) (.0193103) ------------------------------------------- It is notable that OLS standard errors tend to be smaller than in the RE or FE cases. OLS standard errors underestimate the true standard errors. OLS coefficient estimates also suffer from the omitted variable problem accounted in panel estimation. Stata estimate of the correlation in (25) is .464. Seppo Pynnönen Econometrics II Panel Data Random effects or fixed effects 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Random effects or fixed effects FE is widely considered preferable because it allows correlation between ai and x variables. Given that the common effects, aggregated to ai is not correlated with x variables, an obvious advantage of the RE is that it allows also estimation of the effects of factors that do not change in time (like education in the above example). Typically the condition that common effects ai is not correlated with the regressors (x-variables) should be considered more like an exception than a rule, which favors FE. Seppo Pynnönen Econometrics II Panel Data Hausman specification test 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Hausman specification test Hausmanan (1978) devised a test for the orthogonality of the common effects (ai ) and the regressors. The test compares the fixed effect (OLS) and random effect (GLS) estimates utilizing the Wald testing approach. Seppo Pynnönen Econometrics II Panel Data Hausman specification test The basic idea of the test relies on the fact that under the null hypothesis of orthogonality both OLS and GLS are consistent, while under the alternative hypothesis GLS is not consistent. Thus, under the null hypothesis OLS and GLS estimates should not differ much from each other. The test compares these estimates with Wald statistic. In Stata performing Hausman requires that both OLS and GLS regression results are saved for availability for the postestimation test0 procedure. Seppo Pynnönen Econometrics II Panel Data Hausman specification test Example 5 Applying the Hausman test to the case of Examle 4 can be in Stata yields: Seppo Pynnönen Econometrics II Panel Data Hausman specification test * Estimate fixed effects xtreg lwage y81 y82 y83 y84 y85 y86 y87 exper2 married union, fe * store the results into "hfixed" estimates store hfixed * Estimate the random effects model xtreg lwage y81 y82 y83 y84 y85 y86 y87 educ exper exper2 married union, re * store the results into "hrandom" estimates store hrandom * Hausman test hausman hfixed hrandom ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | hfixed hrandom Difference S.E. --------+--------------------------------------------------------y81 | .1511912 .0427498 .1084414 . y82 | .2529709 .035577 .2173939 . y83 | .3544437 .0270943 .3273494 . y84 | .4901148 .052207 .4379078 . y85 | .6174822 .0690524 .5484299 . y86 | .7654965 .1053229 .6601736 . y87 | .9250249 .1505464 .7744785 . exper2 | -.0051855 -.0046859 -.0004996 .000144 married | .0466804 .0678821 -.0212017 .0074261 union | .0800019 .1031103 -.0231085 .0073935 ------------------------------------------------------------------b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(10) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 26.77 Prob>chi2 = 0.0028 (V_b-V_B is not positive definite) Seppo Pynnönen Econometrics II Panel Data Hausman specification test The test rejects the orthogonality condition. Thus, FE should be used. In Eviews Hausman test is obtained by first estimating the model as a random effect model and then selecting View > Fixed/Rendom Effect Testing > Correlated Random Effects - Hausman Test Seppo Pynnönen Econometrics II Panel Data Policy analysis with panel data 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Policy analysis with panel data Panel data is useful for policy analysis, in particular, program evaluation. Example 6 Continue Example 1.2, where training program on worker productivity was evaluated. The data include three years, 1987, 1988, and 1989. The training program was implemented first time 1988. We focus on the years 1987 (no program) and 1988 (program implemented) to see whether the program benefits firms. The model panel model is log(scarpit ) = β0 + δ0 y 88 + β1 grantit + ai + uit , where y 88 is the year 1988 dummy (= 1 for year 1988 and = 0 otherwise) and ai includes the unobserved firm effects (worker skill, etc.). Seppo Pynnönen Econometrics II (28) Panel Data Policy analysis with panel data Ignoring panel structure OLS results suggested no improvement. Dependent Variable: LOG(SCRAP) Method: Panel Least Squares Sample: 1 471 IF YEAR < 1989 Periods included: 2 Cross-sections included: 54 Total panel (balanced) observations: 108 ===================================================== Variable Coefficient Std. Error t-Statistic Prob. ----------------------------------------------------C 0.523144 0.159783 3.274086 0.0014 GRANT -0.058018 0.380949 -0.152299 0.8792 ----------------------------------------------------R-squared 0.000219 Adjusted R-squared -0.009213 S.E. of regression 1.507393 F-statistic 0.023195 Prob(F-statistic) 0.879241 ===================================================== The coefficient for grant is not statistically significant, suggesting that the program does not help in reducing the scrap rate. Seppo Pynnönen Econometrics II Panel Data Policy analysis with panel data Accounting for the possible firm effects and imposing also the year dummy to account for possible time effect, yields ===================================================== Variable Coefficient Std. Error t-Statistic Prob. ----------------------------------------------------C 0.568716 0.048603 11.70126 0.0000 GRANT -0.317058 0.163875 -1.934753 0.0585 ----------------------------------------------------Effects Specification Cross-section fixed (dummy variables) Period fixed (dummy variables) R-squared 0.964308 Adjusted R-squared 0.926556 S.E. of regression 0.406642 F-statistic 25.54364 Prob(F-statistic) 0.000000 The estimate of the coefficient for the grant is negative and close to statistically significant in two sided testing and significant in one sided testing (program improves) for the alternative H 1 : β1 < 0 significant at the 5% level with p-value 0.0265. Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models 1 Panel Data Pooling independent cross section across time Fixed effects model Two-period panel data analysis More than two time periods Fixed effects method Dummy variable regression Fixed effects or first differencing? Balanced and unbalanced panels Random effects models Random effects or fixed effects Hausman specification test Policy analysis with panel data Dynamic Panel Models Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models Many economic relationships are dynamic. These may be characterized by the presence of lagged dependent variables yit = δyi,t−1 + x0it β + vit , (29) where vit = ai + uit with ai ∼ iid(0, σa2 ) and uit ∼ iid(0, σu2 ) are independent, i = 1, . . . , n, t = 1, . . . , T . Seppo Pynnönen Econometrics II (30) Panel Data Dynamic Panel Models Alternatively the one-way error component model in (30) can be a two-way specification such that vit = ai + bt + uit , (31) where all the components are assumed again independent. After differencing we have ∆yit = δ∆yi,t−1 + ∆x0it β + ∆uit . (32) The lagged term yi,t−1 as a regressor variable is correlated with ui,t−1 , which causes problems in estimation. Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models Once regressor variables are correlated with the error term, OLS or GLS estimators become inconsistent. A typical solution to the problem is to apply some kind of instrumental variable estimation. These are least squares (LS) or some other type of methods, where instrumental variables are utilized to remove the inconsistency due to the error term correlation with the regressors. A variable is suitable for an instrumental variable if it is not correlated with the error term, but is correlated with the regressors. Thus, those regressors that are not correlated with the error term can be used also as instruments. Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models Example 7 2SLS (two state least squares). Consider a standard regression model yi = x0i β + ui , (33) where xi is a k-vector of regressors (including the constant term) cov[xi , ui ] 6= 0, i = 1, . . . , n. Suppose we have m ≥ k, additional variables in zi (m-vector) such that cov[zi , ui ] = 0 but cov[zi , xi ] 6= 0. 2SLS solution for the problem is such that first (first stage) use OLS to regress x-variables on z-variables. In the second stage replace the original regressors xi by the predicted variables x̂i from the first stage, and estimate β from the regression yi = x̂0i β + ui . (34) β̂ 2SLS = (X̂0 X̂)−1 X̂0 y (35) The estimator is called the 2SLS estimator of β. Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models In particular, if m = k then (35) becomes β̂ IV = (Z0 X)−1 Z0 y, which is called the Instrumental Variable estimator of β. Seppo Pynnönen Econometrics II (36) Panel Data Dynamic Panel Models Example 8 (Data: http://eu.wiley.com/college/baltagi/ > Student companion site > datasets) Demand for cigarettes in 46 US States [annual data, 1963–1992]. Estimated equation cit = α + β1 ci,t−1 + β2 pit + β3 yit + β4 pnit + vit , (37) vit = ai + bt + uit , (38) where ai and bt are fixed effects, uit ∼ NID(0, σu2 ), and all the observable variables are in logarithms: cit = real per capita sales of cigarettes by persons of smoking age (14 and older). cigarette average price per pack pit = real average retail price of a pack of cigarettes yit = real per capital disposable income pnit = the minimum real price of cigarettes in any neighboring state (proxy for casual smuggling effect across state borders) ci,t−1 is very likely correlated with uit . Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models For reference purposes, estimating with panel OLS (average of within group regressions with time dummies) yields Fixed-effects (within) regression Group variable: state Number of obs Number of groups = = 1334 46 R-sq: Obs per group: min = avg = max = 29 29.0 29 within = 0.9283 between = 0.9859 overall = 0.9657 corr(u_i, Xb) = 0.4743 F(32,1256) Prob > F = = 508.07 0.0000 ----------------------------------------------------lc | Coef. Std. Err. t P>|t| -------------+--------------------------------------lc | L1. | .8302514 .0126242 65.77 0.000 | lp | -.2916822 .0230847 -12.64 0.000 ly | .1068698 .0233417 4.58 0.000 lpn | .0354559 .02656 1.33 0.182 _cons | .8204374 .2228775 3.68 0.000 -------------+--------------------------------------sigma_u | .02738301 sigma_e | .03504776 rho | .37905103 (fraction of variance due to u_i) ----------------------------------------------------F test that all u_i=0: F(45, 1256) = 4.52 Prob > F = 0.0000 Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models Several method are proposed to estimate when there is potential correlation between the error term and (some) regressors. GMM (Generalized Method of Moments) estimation has gained lately much popularity, in particular when there are non-linear moment restrictions. Stata has xtdpd procedure which produces the Arellano and Bond or the Arellano-Bover/Blundell-Bond estimator, which are GMM estimators, where instruments are defined in a particular way (the idea will be discussed in the classroom). Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models xtdpd l(0/1).lc lp ly lpn y66-y92, div(lp ly lpn y66-y92) dgmmiv(lc) Dynamic panel-data estimation Number of obs Group variable: state Number of groups Time variable: year Obs per group: min avg max Number of instruments = 437 = 1334 = 46 = = = 29 29 29 Wald chi2(31) = 13273.45 Prob > chi2 = 0.0000 One-step results ----------------------------------------------------lc | Coef. Std. Err. z P>|z| -------------+--------------------------------------lc | L1. | .8201729 .0161446 50.80 0.000 | lp | -.3607549 .0311244 -11.59 0.000 ly | .1871102 .0334027 5.60 0.000 lpn | -.0215713 .0399233 -0.54 0.589 ----------------------------------------------------Instruments for differenced equation GMM-type: L(2/.).lc Standard: D.lp D.ly D.lpn D.y66 D.y67 D.y68 D.y69 D.y70 D.y71 D.y72 D.y73 D.y74 D.y75 D.y76 D.y77 D.y78 D.y79 D.y80 D.y81 D.y82 D.y83 D.y84 D.y85 D.y86 D.y87 D.y88 D.y89 D.y90 D.y91 D.y92 Instruments for level equation Standard: _cons Seppo Pynnönen Econometrics II Panel Data Dynamic Panel Models Test for the orthogonality conditions of the instruments Sargan test of overidentifying restrictions H0: overidentifying restrictions are valid chi2(405) Prob > chi2 = = 561.5047 0.0000 The orthogonality conditions are rejected. The reason may be that that the errors are MA(1), which implies that the GMM instruments (lct−2 , . . .) are correlated with the error term. This can be tried to fix by defining starting from t − 3 with command · · · dgmmiv(lc, lagrange(3 .)). Doing this improved slightly the situation but still lead to rejection of the orthogonality conditions. Seppo Pynnönen Econometrics II