PANEL DATA

advertisement
PANEL DATA
Data sets that combine time series and cross sections are called longtitudinal or panel
data sets. Panel data sets are more orientated towards cross section analyses – they are
wide but typically short (in terms of observations over time). Heterogeneity across
units is central to the issue of analysing panel data. The basic framework is a
regression of the form:
Yit = Xitβ + Ziπ + εit
(1)
X has k columns and does not include a constant term. The heterogeneity or
individual effect is Ziπ where Z contains a constant term and a set of individual or
group specific variables. Such as gender, location, etc. We will consider two cases:
Fixed Effects Zi is unobserved, but correlated with Xit then OLS estimators of β are
biased. However, in this case where i = Ziπ embodies all the observable effects and
specifies an estimable equation. This takes I to be a group specific constant term.
Random Effects if the unobserved heterogeneity however formulated can be assumed
to be uncorrelated with Xit then :
Yit = Xitβ + E[Ziπ] + { Ziπ - E[Ziπ] } + εit
= Xitβ +  + ui + εit
(2)
(3)
This random effects approach specifies that ui is a group specific random element
which although random is constant for that group throughout the time period.
FIXED EFFECTS
This assumes that differences across units of observation can be captured by
differences in the constant term. Each i is estimated:
Yi = Xiβ + ii + εi
(4)
Where Yi is a Tx1 column of observations on m individual (group) i over T time
periods. Hence the total sample size is mT. I is the unit vector {1,1,….,1} an identity
matrix. Collecting together groups over time we get:
Y = Xβ + D + ε
(5)
The model is usually referred to as the least squares dummy variable model (LSDV).
This is a classical regression model and no new methodology or tests are needed to
analyse it. In effect we simply regress Y on X plus a dummy variable for each group.
Of course if the number of groups is very large then this presents computational
problems. To tackle this we write the regression as:
The Within and between Groups estimators
We can formulate a pooled regression in three ways:
(i) The original formulation:
Yit = Xitβ +  + εit
(6)
(ii) Deviations from Group Means
Yit - MYi= (Xit - MXi)β + εit - Mεi
(7)
Where MYi denotes the mean of the observation for the i’th group across the T
observations, etc.
NOTE: The R2 related to (7) is known as the within groups R2.
(iii) In terms of the group means:
MYi= MXiβ +  + Mεi
(8)
NOTE: The R2 related to (8) is known as the between groups R2.
All three are classical regression models and in principal could be estimated by OLS.
Because we have a large number of dummy variables to put in we can proceed by
estimating (7), obtaining estimates of . Now rearrange (8) and assume Mεi= 0:
 = MYi - MXiβ
(9)
That is we can get estimates of  by including subtracting from the mean of Y for
each group the means of X multiplied by :
Yit - MYi= (Xit - MXi)β + εit - Mεi
(10)
Unbalanced Panels and Fixed Effects.
If we have missing data we have what we call an unbalanced panel. The required
modifications are relatively simple. With a balanced panel the sample size is n = mT.
With an unbalanced panel it is Ti. Hence instead of calculating group means on the
basis of a sample size of n we have to have a specific sample size Ti for each group.
RANDOM EFFECTS
The fixed effects model allows the unobserved individual effects to be correlated with
the included variables. The differences between units are then modelled as shifts in
the constant term. If the individual effects are uncorrelated with the regressors then
this is appropriate. The gain to this approach is that it substantially reduces the
number of parameters to be estimated. The cost is the possibility of inconsistent
estimates should the assumption be inappropriate. For random effects we reformulate
the basic model:
Yit = Xitβ + ( + ui) + εit
(11)
There is now a single constant term which is the mean of the unobserved
heterogeneity, E(Ziπ). ui is the random heterogeneity specific to the i’th observation
and is constant throughout time. For example in an analysis of firms it is the factors
which we cannot measure which are still specific to that firm. We define:
ηit = ui + εit
and ηi = [ηi1, ηi2,……….., ηiT,]'
This gives us what is called an “error components model”. For this:
E[ηit2 |X] = σε2 + σu2
E[ηit ηis |X] = σu2 t ≠s
E[ηit ηjs |X] = 0 for all t and s, i ≠ j
Then:
Σ = E[ηi ηi' |X] = σε2 + σu2 σu2 ………………… σu2
σu 2
σε2 + σu2……………. σu2
…………………………………….
σu 2
σu2……………. σε2 + σu2
Since observations I and j are independent the disturbance covariance matrix for the
full nT observations is:
Ω=
Σ 0 0 …………….. 0
0 Σ 0………………0
……………………… = I x Σ Kronecker multiplication
0 0 ………………. Σ
We then estimate the model using GLS and the standard formula:
β = (X ' Ω-1X)-1X' Ω-1Y
(12)
It can be shown that the GLS estimator is, like the OLS estimator, a matrix weighted
average of the within and between units estimators. The inefficiency of OLS (i.e.
fixed effects) follows from an inefficient weighting of the two sets of estimates. Fixed
effects places too much emphasis on the between units variation.
As usual we have the problem that we do not know the variances and covariances
which comprise Σ. We begin by estimating (7):
Yit - MYi= (Xit - MXi)β + εit - Mεi
(7)
Using this we can get an estimate of 2ε. It remains to estimate σu2. Return to the
original model in (2):
Yit = Xitβ +  + ui + εit
(3)
This is a classical least squares model in which OLS is consistent and unbiased
although efficient. Therefore the probability limit of the residuals from this regression
equal:
2ε + 2u
Estimate this, subtract 2ε and wehave our estimate of 2u.
Testing for Random effects
Breusch and Pagan (1980) have devised a Lagrange multiplier test for the random
effects model based on the OLS residuals. For:
H0: σu2 = 0
H1: σu2 ≠ 0
We have
LM=nT/2(T-1)[(Σ(Tei.)2/ Σ Σeit2)-1]
Where ei.= Mεi (from 8)
This is distributed as Χ2; if it exceeds the critical value we conclude OLS is
inappropriate and random effects is preferable.
Note in Greene this is:
LM = nT/[2(T-1)] {[T2ei’ei/eit’eit]-1}2
(slightly changed his terminology) where ei is the vector of unit (e.g. firm or country)
specific mean errors and eit is the vector of total residuals from the least squares
regression. This appears to be the same as above with one exception, the addition of a
squared term:
LM=nT/2(T-1)[(Σ(Tei.)2/ Σ Σeit2)-1]2
This has been checked and is correct.
Hausman’s test for the Random Effects Model
The specification test devised by Hausman is used to test for whether the random
effects are independent of the right hand side variables. This is a general test to
compare any two estimators. The test is based on the assumption that under the
hypothesis of no correlation between the right hand side variables and the random
effects both fixed effects and random effects are consistent estimators of (10) but
fixed effects is inefficient (This is the assumption with random effects).
Whereas under the alternative assumption (i.e. that with fixed effects) fixed effects is
consistent but random effects is not.
The test is based on the following Wald statistic:
W = [ FE - RE] -1[ FE - RE]
where
Var[ FE - RE] = Var[ FE] - Var[RE] = 
W is distributed as 2 with (K-1) degrees of freedom where K is the number of
parameters in the model. If W is greater than the critical value obtained from the table
then we reject the null hypothesis of that both estimators are consistent i.e. of “no
correlation between the right hand side variables and the ‘random effects’” in which
case the fixed effects model is better. The intuition behind the test is relatively simply
if both estimates are consistent then  FE - RE should not be too great, i.e. the two
should be close together. [ FE - RE] [ FE - RE] would be equivalent to summing the
squares of the differences between the two sets of estimators. Hence the greater this is
the more unlikely the null hypothesis is to be valid. The insertion of -1 effectively
weights these differences in inverse proportion to the variance Var[ FE - RE]. If this
is great then the measure tends to downplay the difference between  FE and RE. On
the other hand if this variance is small than any difference between  FE and RE is
given substantial weight.
In terms of which is more appropriate I tend to favour on intuitive terms the random
effects model in most cases. Take the firm example below, most of the difference in
for example capital stock is between firms rather than within firms over time. To use
fixed effects in this case would be to lose a great deal of power from the capital stock
variable. The same is true with distance from London. This worry becomes much less
if we change our mode of our analysis, our unit from e.g. the firm to the region the
firm is operating within or the industry. But this is then very easy to do within
conventional regression analysis by the use of dummy variables. The key question is
whether the unobservable heterogeneity is correlated with the right hand side
variables, if yes then fixed effects has a case as anything else will induce bias – i.e.
some of the impact due to the unobservable heterogeneity will be wrongly ascribed to
the firm specific variables such as capital stock. But in solving this problem we lose a
lot of explanatory power from the regressions as argued above with capita stock. This
raises the question as to why we expect the unobservable characteristics to be linked
with the other variables. If e.g. it is due to quality of entrepreneur, does this then have
a linkage with capital stock?
EXAMPLES
Example 1:
The dependent variable is [the log of] Gross value added for firms. The data base is
on firms operating in the UK. Note: the number of firms in this data set is 49,027. The
panel is unbalanced. Wald statistic has 35 degrees of freedom as K, the number of
parameters being estimated is, 36. The critical value from the 2 table with 35 degrees
of freedom is 57.34. We can see that the Wald value is massively greater than this and
hence we would reject the random effects model and accept the fixed effects model.
There are R2 figures relating to ‘within-groups’ and ‘between groups’. Basically the
within groups R2 is the explanatory power due to the right hand side variables
explaining changes in GVA for individual firms, this is relatively low compared to the
R2 between groups. This is not surprising as the bulk of the differences in GVA come
from differences in capital stock and size of labour force between firms. The output is
from a STATA program.
Random-effects GLS regression
Group variable (i): dlink_ref22
R-sq: within = 0.0254
between = 0.8312
overall = 0.8395
Random effects u_i ~ Gaussian
corr(u_i, X)
= 0 (assumed)
Number of obs
= 69349
Number of groups = 49027
Obs per group: min =
avg =
1.4
max =
5
Wald chi2(35)
Prob > chi2
=
1
= 244114.96
0.0000
-----------------------------------------------------------------------------lgvafc |
Coef. Std. Err.
z P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------------------lemp | .5558439 .0040942 135.76 0.000 .5478195 .5638683 Log of Employed Labour force
lcap | .4022117 .002785 144.42 0.000 .3967532 .4076702 Log of capital stock
lpropnoqua~l | -.0506918 .0092815 -5.46 0.000 -.0688833 -.0325004 Log of proportion with no qualifications
lprophighq~l | .0625173 .0086047 7.27 0.000 .0456524 .0793822 Log of proportion with high qualifications
private | .3906842 .0198204 19.71 0.000 .3518369 .4295314 Private sector firm
uk | .0878014 .0082682 10.62 0.000 .0715961 .1040068 UK multinational
usa | .2033561 .0206409 9.85 0.000 .1629006 .2438116 US multinational
llunit | -.027734 .0059808 -4.64 0.000 -.0394562 -.0160119 Log of number of plants
mfd1 | -.1846409 .0191904 -9.62 0.000 -.2222535 -.1470284 Multi-region multi-plant firmdummy
nw1 | -.0646788 .0270416 -2.39 0.017 -.1176794 -.0116782 REGIONALDUMMIES
yorks1 | -.1105119 .0276669 -3.99 0.000 -.164738 -.0562858
ne1 | -.0278356 .0386535 -0.72 0.471 -.103595 .0479238
wmids1 | -.1370831 .0217464 -6.30 0.000 -.1797053 -.0944609
wales1 | -.0999943 .0455664 -2.19 0.028 -.1893029 -.0106858
beds1 | -.1495485 .0214604 -6.97 0.000
-.19161 -.107487
sw1 | -.111522 .0249593 -4.47 0.000 -.1604412 -.0626028
emids1 | -.1523312 .0224785 -6.78 0.000 -.1963883 -.1082742
east1 | -.1660794 .0222715 -7.46 0.000 -.2097307 -.1224282
se1 | -.1366562 .0163141 -8.38 0.000 -.1686313 -.1046812 Time from London
timenmfd1 | -.0009853 .0001045 -9.43 0.000 -.0011902 -.0007805 INDUSTRY DUMMIES
miningdum | -.3116256 .1813403 -1.72 0.086 -.6670462 .0437949
manufactur~m | -.354197 .1377291 -2.57 0.010 -.6241411 -.0842528
powerdum | .0229343 .1653576 0.14 0.890 -.3011605 .3470292
constructi~m | -.0548109 .138434 -0.40 0.692 -.3261366 .2165148
wholesaler~m | -.1450245 .1376412 -1.05 0.292 -.4147963 .1247473
cateringdum | -1.151165 .1382849 -8.32 0.000 -1.422199 -.8801321
transportdum | -.5101581 .1383287 -3.69 0.000 -.7812774 -.2390388
realestate~m | -.2839032 .1375022 -2.06 0.039 -.5534027 -.0144038
educationdum | -.5616366 .139509 -4.03 0.000 -.8350691 -.288204
socialwork~m | -.6742467 .1385968 -4.86 0.000 -.9458914 -.402602
communitydum | -.7754511 .1382927 -5.61 0.000
-1.0465 -.5044024
year2 | .0236857 .0097134 2.44 0.015 .0046478 .0427236 YEAR DUMMIES
year3 | .0075121 .0095382 0.79 0.431 -.0111825 .0262066
year4 | .0635178 .0086933 7.31 0.000 .0464793 .0805563
year5 | .1018157 .0088747 11.47 0.000 .0844216 .1192097
_cons | Included but not reported
-------------+---------------------------------------------------------------sigma_u | .73642326
sigma_e | .47167868
rho | .7090994 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Example 2:
This is from LIMDEP. We are regressing growth on lagged growth aid per capita
lagged one period (APCYL1), positive aid volatility APCYRS1P and negative aid
volatility (APCYRS1N) and a time trend. The groups are countries, of which there
about 66. We have data over time back to about 1961, but it is an unbalanced panel as
we do not have full observations for all countries. We begin with the OLS equation
without group dummy variables. There then follows least squares with group
variables (Fixed effects). Finally we have the random effects model. Prior to this
being printed out we have test results for the Lagrange multiplier test and also the
Hausmann test. The test statistics are:
| Lagrange Multiplier Test vs. Model (3) = 29.62 |
| ( 1 df, prob value = .000000)
|
| (High values of LM favor FEM/REM over CR model.) |
| Fixed vs. Random Effects (Hausman) = 66.54 |
| ( 5 df, prob value = .000000)
|
The value of29.62 suggests that FE/RE are appropriate. The Hausman test suggests
that of the two the fixed effects is appropriate. Turning to the fixed effects printout we
can see that lagged growth impacts on current growth as does lagged aid. With respect
to aid volatility, positive volatility (roughly, unexpected upward shifts in aid) have a
positive impact on growth, but negative volatility has no discernable adverse
effects.Worryingly there is a negative time trend.
+-----------------------------------------------------------------------+
| OLS Without Group Dummy Variables
|
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = GROWTH Mean= 3.676811935 , S.D.= 5.599609225 |
| Model size: Observations = 2017, Parameters = 6, Deg.Fr.= 2011 |
| Residuals: Sum of squares= 59973.73230 , Std.Dev.=
5.46103 |
| Fit:
R-squared= .051243, Adjusted R-squared =
.04888 |
| Model test: F[ 5, 2011] = 21.72, Prob value =
.00000 |
| Diagnostic: Log-L = -6283.1290, Restricted(b=0) Log-L = -6336.1784 |
|
LogAmemiyaPrCrt.= 3.398, Akaike Info. Crt.=
6.236 |
| Panel Data Analysis of GROWTH [ONE way]
|
|
Unconditional ANOVA (No regressors)
|
| Source
Variation
Deg. Free. Mean Square
|
| Between
4350.44
62.
70.1683
|
| Residual
58862.5
1954.
30.1241
|
| Total
63212.9
2016.
31.3556
|
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
GROWTHL1 .1471741628 .21153013E-01 6.958 .0000 3.6459885
APCYL1 -.1928354626E-02 .18453840E-01 -.104 .9168 7.8342771
APCYRS1P .1781995479 .52169562E-01 3.416 .0006 1.1362761
APCYRS1N .3243939601E-01 .57797666E-01 .561 .5746 -1.1312713
TREND -.6280410686E-01 .11491455E-01 -5.465 .0000 24.482895
Constant 4.527164152
.32255659 14.035 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+-----------------------------------------------------------------------+
| Least Squares with Group Dummy Variables
|
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = GROWTH Mean= 3.676811935 , S.D.= 5.599609225 |
| Model size: Observations = 2017, Parameters = 68, Deg.Fr.= 1949 |
| Residuals: Sum of squares= 55490.25413 , Std.Dev.=
5.33584 |
| Fit:
R-squared= .122169, Adjusted R-squared =
.09199 |
| Model test: F[ 67, 1949] = 4.05, Prob value =
.00000 |
| Diagnostic: Log-L = -6204.7693, Restricted(b=0) Log-L = -6336.1784 |
|
LogAmemiyaPrCrt.= 3.382, Akaike Info. Crt.=
6.220 |
| Estd. Autocorrelation of e(i,t) .026811
|
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
GROWTHL1 .8840072682E-01 .21389867E-01 4.133 .0000 3.6459885
APCYL1 .7405384230E-01 .27813228E-01 2.663 .0078 7.8342771
APCYRS1P .1319179295 .57626292E-01 2.289 .0221 1.1362761
APCYRS1N -.1050606242 .63889238E-01 -1.644 .1001 -1.1312713
TREND
-.1006714311 .12309034E-01 -8.179 .0000 24.482895
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+------------------------------------------------------------------------+
|
Test Statistics for the Classical Model
|
|
|
|
Model
Log-Likelihood Sum of Squares R-squared |
| (1) Constant term only -6336.17838 .6321293692D+05 .0000000 |
| (2) Group effects only -6264.26752 .5886250009D+05 .0688219 |
| (3) X - variables only -6283.12895 .5997373230D+05 .0512427 |
| (4) X and group effects -6204.76924 .5549025413D+05 .1221693 |
|
|
|
Hypothesis Tests
|
|
Likelihood Ratio Test
F Tests
|
|
Chi-squared d.f. Prob.
F num. denom. Prob value |
| (2) vs (1) 143.822 62 .00000 2.329 62 1954 .00000 |
| (3) vs (1) 106.099
5 .00000 21.723 5 2011 .00000 |
| (4) vs (1) 262.818 67 .00000 4.048 67 1949 .00000 |
| (4) vs (2) 118.997
5 .00000 23.689 5 1949 .00000 |
| (4) vs (3) 156.719 62 .00000 2.540 62 1949 .00000 |
+------------------------------------------------------------------------+
+--------------------------------------------------+
| Random Effects Model: v(i,t) = e(i,t) + u(i) |
| Estimates: Var[e]
= .284711D+02 |
|
Var[u]
= .135170D+01 |
|
Corr[v(i,t),v(i,s)] = .045324
|
| Lagrange Multiplier Test vs. Model (3) = 29.62 |
| ( 1 df, prob value = .000000)
|
| (High values of LM favor FEM/REM over CR model.) |
| Fixed vs. Random Effects (Hausman) = 66.54 |
| ( 5 df, prob value = .000000)
|
| (High (low) values of H favor FEM (REM).)
|
| Reestimated using GLS coefficients:
|
| Estimates: Var[e]
= .285879D+02 |
|
Var[u]
= .212761D+01 |
|
Sum of Squares
.602189D+05 |
+--------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
GROWTHL1 .1135835519 .21085842E-01 5.387 .0000 3.6459885
APCYL1 .1991765749E-01 .21725640E-01 .917 .3593 7.8342771
APCYRS1P .1733946894 .53760692E-01 3.225 .0013 1.1362761
APCYRS1N -.2436091344E-01 .59905717E-01 -.407 .6843 -1.1312713
TREND -.7883073300E-01 .11664972E-01 -6.758 .0000 24.482895
Constant 4.895058779
.36527924 13.401 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
Download