Panel Data and Methods

advertisement
Panel Data and Methods
Panel data consist of a cross-section of “individuals” for which there are repeated observations
over time. Individuals can be any cross-section unit of analysis, such as states, dyads, or survey
respondents. Panel data sets are typically dichotomized between long panels, which have many
measurement occasions relative to the size of the cross-section, and short panels, which have
many individuals in the cross-section relative to the number of repeated measurement occasions,
or “waves.” In general, the methods associated with the term “panel data analysis” or
“longitudinal analysis” focus on short panels, while methods under the “time-series crosssection” umbrella focus more on analyzing long panels.
The key advantage of panel data is that such data offer the opportunity to better evaluate
causal propositions than strictly cross-sectional data. Whereas cross-sectional data only allow the
researcher to observe covariances, panel data further allow the researcher to observe whether a
change in an input precedes a change in the outcome. In other words, since panel data consist of
the same individuals over time, the analyst can observe a shift in responses as a reaction to an
input. One example would be an evaluation of whether a state’s present behavior responds to the
prior behavior of its neighbors. Another example might be using a survey panel, such as those
often incorporated into the American National Election Studies, to assess how partisan strength
influences campaign interest over time.
Issues of Panel Data and Common Remedies
Panel data have a number of features that can pose challenges to analysts. These issues include:
unit effects, serial correlation, heteroscedasticity, and contemporaneous correlation. Panels also
1
have special problems of missingness. The remainder of this entry focuses on these issues and
some remedies for each.
Unit Effects
Whenever individuals’ mean responses differ, unit effects are present in the data. Unit effects can
pose serious problems for inference as failure to account for them in some way can produce bias
in estimates akin to omitted variable bias. If the mean response varies cross-sectionally via
unobserved unique means, but this difference is not modeled (and thereby left in the error term),
then any cross-sectionally varying covariate will correlate with the error term. Such a situation
produces endogeneity bias in the model’s coefficients.
In the econometric tradition, two approaches are widely used to handle unit effects. One
is the fixed effects model, typically estimated with least squares dummy variables (LSDV). This
approach estimates the desired model using OLS, including dummy variables for each
individual, save a reference individual. This approach has the advantage of being
computationally simple and accounting for a known source of variance in the model
specification. However, individual dummies are perfectly collinear with any variable which
varies only cross-sectionally. Hence, LSDV precludes the inclusion of time-invariant variables in
a model. An alternative that does allow time-invariant covariates is a GLS model with a
compound symmetry covariance structure, known as a random effects model. This model
recognizes that repeated observations will covary, so the estimator accounts for this structure by
including a term that forces all repeated observations to correlate at a constant level with each
other.
2
It should be noted that the terms fixed and random effects have multiple meanings.
Econometricians typically call a LSDV model a fixed effects model and a GLS-compound
symmetry model a random effects model. These terms take a different meaning when analyzing
data from the view of hierarchical modeling. Specifically, a fixed effect refers to any model
quantity estimated in the fitting of a model (i.e., obtained via least squares or maximum
likelihood), while a random effect refers to any parameter that is unique to the individual but can
be predicted separately. Mixed effects models contain both fixed and random effects. Confusion
can arise because a “random effects model” is a special case of a mixed effects model. For
example, the general form of a linear mixed effects model is:
Yij=X'ijβ+Z'ijbi+eij
Where Yij is the response value for individual i at time j, Xij is the vector of all covariate values
for individual i at time j, β is a vector of fixed effects—coefficients that apply to all individuals,
Zij is a subset of X which may include any time-varying covariate or a constant, bi is a vector of
random effects for individual i, and eij is the error term for individual i at time j. One special case
would be a model in which there is only a random intercept, which becomes:
Yij=X'ijβ+bi+eij
Each bi is not estimated directly in the fitting of the model, but can be predicted using empirical
Bayes techniques. By decomposing the unexplained variance into bi and eij, which are
independent of each other, the model successfully accounts for differences in the mean responses
for individuals and the necessary correlation among observations. Hence, the random effects
model is seen as a special case of the more general mixed effects model.
There are several practical considerations when deciding how to control for unit effects in
a longitudinal model. Again, fixed effects models cannot include time-invariant covariates.
3
Further, when the number of individuals is large, especially relative to the number of waves, then
estimating a LSDV model is inefficient. An alternative fixed effects estimator to LSDV is the
within estimator, wherein the outcome variable and the covariates are all rescaled as deviations
from an individual’s mean of the variable. The within estimator avoids the inefficiency of
estimating unique intercepts for each individual, and yields the same coefficient estimates at
LSDV; however, just like LSDV it cannot accommodate time-invariant covariates.
The model of random effects for units allows for time-invariant covariates and avoids the
inefficiency problem that could emerge from LSDV. Hence, with especially short panels or any
model for which the effects of time-invariant covariates are to be estimated, random effects
models are probably the most practical option. However, this model assumes that unit effects are
independent of all covariates. If the unit effects are correlated with any of the input variables,
then the random effects model is biased and inconsistent. Whether or not independent unit
effects is a fair assumption can be evaluated with a Hausman test, under which the null
hypothesis is that the unit effects are independent, implying that a random effects model is
consistent. Rejection of this null hypothesis implies that the random effects model has problems
of endogeneity bias. As a final, practical point on random effects models: GLS models such as
this require the analyst to specify how the errors of the model are correlated. However, the true
correlation between the errors of individuals’ repeated measurements is unknown, so feasible
GLS must be used. Feasible GLS is estimated with a multi-step procedure whereby residuals of
an initial model are used to estimate the correlation of errors, which is then inserted into a GLS
estimator. For instance, the Cochrane-Orcutt FGLS estimator repeats this iterative process until
the estimate of correlation of errors ceases to improve. All of this suggests that analysts must
carefully weigh the structure of their data and the goals of their model when choosing how best
4
to handle unit effects.
Serial Correlation, Heteroscedasticity & Contemporaneous Correlation
Serial correlation refers to the fact that repeated observations on the same individual are highly
correlated. In general, this correlation tends to be large and positive, but diminishes as the time
between measurements increases. Serial correlation violates the OLS assumption of uncorrelated
errors. The solutions to this problem resemble the fixes for unit effects. One solution is to include
a lagged response as a covariate, as this term often accounts for serial correlation and makes the
remaining errors independent. Lagged outcome variables are more commonly used for long
panels because the first wave of observations cannot be modeled with this approach, which is
more costly when repeated observations are scarce. (It should be noted that while many argue a
lagged response most effectively accounts for unit effects and serial correlation, others maintain
that an endogeneity bias can occur if the lagged term does not filter all of the serial correlation.)
Another solution is to estimate a GLS model that includes a covariance pattern matrix, which
estimates the covariances between each pair of time waves: the matrix may be unstructured or
defined by a clear pattern, such as first-order autoregressive. Lastly, mixed effects models
produce correlation matrices based on the variances and covariances of the random effects. Thus,
a pattern of correlation also can be captured by random effects. It should be noted that in
particularly short panels (for instance, three waves), serial correlation can be hard to account for
with any of these methods: A lagged dependent variable costs one wave of data, covariance
pattern matrices more complex than a simple random effects model can be difficult to estimate,
and very short panels do not allow for a lot of random parameters.
5
The methods for covariance patterns and random effects also can be incorporated into the
general linear model framework, which means that remedies for unit effects and serial
correlation also can be used for limited dependent variables (such as counts or binary outcomes).
Marginal models, estimated with the generalized estimating equations, require the analyst to
specify how repeated observations are associated and thereby resemble the covariance pattern
GLS model for continuous outcomes. General linear mixed effects models incorporate random
effects into the specification and account for the correlation of repeated observations through the
random effects.
Heteroscedasticity can be present in panel data if the unmodeled variance in outcomes
differs from one individual to the next. This problem can be addressed through a GLS estimator
which allows for unique variances among individuals, in addition to the correlation pattern.
Contemporaneous correlation arises when individuals have similar errors at particular times. This
may arise because some time-dependent factor is simultaneously influencing all individuals. In
the presence of contemporaneous correlation, the error variance of linear coefficient estimates
increases relative to the estimates of coefficients when Gauss-Markov assumptions hold. Regular
standard errors do not account for this inefficiency, however. Rather, panel-corrected standard
errors will better account for the larger error variance, thereby making statistical inference on
coefficient estimates less prone to Type I errors.
Missing Data
With panel data, a key concern is that measuring each individual at every wave of observations
may not be possible. One reason for this may be censoring that arises from the structure of the
study. For example, if different individuals were recruited to participate in a study with staggered
6
start times, but the study ended simultaneously for all, then late joiners would have fewer
repeated observations. In this situation, the non-observance of later waves for late joiners would
be missing completely at random (MCAR), as qualities of the individuals had no bearing on how
often they were observed. In this case, as with any panel data with observations missing
completely at random, the data could be analyzed by complete-case analysis (analyzing only
cases for which all waves are observed) or available-data analysis (i.e., methods that do not
require response vectors of equal length).
Another, more serious cause of missing data in panels is attrition (also called dropout or
panel mortality). Individuals who are part of the study may choose not to participate after a few
observations, or the researcher may lose track of individuals and be unable to reach them for
further study. Dropout specifically refers to the situation where, once an individual goes
unobserved in one wave, the individual is not observed in any future wave. Whereas data that are
missing due to censoring are nearly always MCAR, data missing due to dropout may not fit this
criterion. If the data are at least missing at random (MAR, meaning that the probability of
missingness is conditional only on observable information), then imputation methods can yield
unbiased estimates of model quantities. Many studies impute missing values from dropout by
assuming all missing values of a response are equal to the last observed value. This method
assumes, however, that the responses would not have changed since dropout, which is usually
unrealistic.
A better alternative is multiple random imputation. One technique for multiple random
imputation is to model the probability of missingness and match missing observations with
observed observations that have similar probabilities of being missing, randomly drawing several
observations with similar probabilities of being missing to impute for the missing value. A
7
second technique for multiple random imputation is to model the value of an observation at a
particular time with the observed data and impute a value for the missing observation that is
computed with known information about the subject plus a random disturbance. For individuals
who later return to the study, observations at later waves—as well as early waves—should be
used to impute missing middle values.
As a final consideration for attrition, a researcher may choose to refresh the sample by
adding new observations toward the end of the study, these new observations being called a
refreshment sample. Though this strategy does not directly remedy the problem of uneven
panels, it does prevent the sample size from shrinking too much when constructing an overall
response profile. Further, refreshment allows the researcher to diagnose the severity of panel
effects. This can be done by comparing variable means from the refreshment samples to the
means of those still in the panel at a given wave to see how dropout is influencing the makeup of
the sample. Further, the process of being part of a panel study may influence an individual’s
response over time, a process called “panel conditioning.” Refreshment samples allow for the
possibility of adjusting for panel effects through techniques such as “fractional pooling” or “twostage auxiliary instrumental variables.”
James E. Monogan III
See also: Generalized Least Squares, General Linear Models, Multiple Random Imputation,
Regression, Time Series Analysis, Time-Series Cross-Section
Further Readings
8
Allison, P. D. (1994). Using Panel Data to Estimate the Effects of Events. Sociological Methods
& Research, 23, 174-199.
Bartels, L. M. (1999). Panel Effects in the American National Election Studies. Political
Analysis, 8, 1-20.
Beck, N. & Katz, J. N. (1995). What to Do (and Not to Do) with Time-Series Cross-Section
Data. American Political Science Review, 89, 634-647.
Cameron, A.C. & Trivedi, P. K. (2005). Section V of Microeconometrics: Methods and
Applications. New York: Cambridge University Press.
Fitzmaurice, G., N. Laird, & J. Ware (2004). Applied Longitudinal Analysis. Hoboken, NJ:
Wiley.
Stimson, J. A. (1985). Regression in Time and Space: A Statistical Essay. American Journal of
Political Science, 29, 914-947.
9
Download