Lecture 6 Panel data: following the same population over time. Different from regular repeated cross-sectional data. Year dummies Different intercept for students this year and students next year. Makes intercept dependent on time. Deflating monetary values to take out inflation effects. Panel data Dimensions: N individuals observed over T periods T: long/short panel (relative in terms of data, e.g. days can be long when unit of measurement is seconds) N: narrow/wide panel Microeconomic panels typically wide and short Macroeconomic panels typically narrow and long Unbalanced panel: number of time series observations differs across individuals (e.g. households) Balanced panel: Same number of time series for each individual (e.g. countries) Both can be used for estimations, but determine whether dropout may involve self-selection Why use panel data? Account for endogeneity bias: account for unobserved (time-constant) individual heterogeneity. Study dynamic factors. E.g. health in time after hospital visit. Or whether hospital visit gets people back on previous health trajectory. xtset ID YEAR defines panel with ID=N and YEAR=T Pooled model yit = B0 + B1X1it + B2X2it + uit Subscript I denotes the ith individual and t denotes the tth time period Indices I and t imply total observations i * t Pooled model: why it is not a good idea to simply ignore the panel dimension and perform OLS? In OLS we assume random sample => assume error between individuals are uncorrelated However, when same individuals are observed over time, values will often depend a lot on previous values Cluster-robust standard errors Allow some correlation over time Consequences of autocorrelation and hetereskedasticity inherent in panel data. Least-squares estimators are still consistent but standard errors incorrect (typically too small) Use panel-robust standard errors (=cluster-robust standard errors) Stata: add option vce(cluster id) after regression. Takes care of potential serial correlation at the IDlevel Depending on data structure, you may also want to cluster on city or region Clustering on a larger level automatically also allows for clusters at lower levels. E.g. clustering at region also allows clustering at individual. When clustering you increase standard error. Standard errors thus become more conservative. So there is trade-off. Choose cluster at level where policy researched applies. Having panel data allows to address endogeneity issues > individual fixed effects: constant over time but varying over individuals > time-fixed effects: vary over time but constant for individuals Fixed-effect model Assume data for t 1,2 and dummy t2 which is 1 for t = 2 Yit = B0 + d0t2t + B1xit + ai + uit Differences in differences Estimating effect of natural experiment. E.g. reg health hospital_visists How to assess causal affect of going to hospital on healthy yi for individual i? Compare what happened when he went and what happened if he hadn’t gone. Mimic potential outcome by taking people who did and who didn’t go to the hospital. Measure treatment and control group before and after reform. Then look not at same individual over time since we don’t know their other state. Also cannot compare two groups after treatment because there might very well be selection. But we can compare differences in health between these groups. E.g. treatment group increased health by e, control group increased health by f. Then treatment effect: e-f Assumption: in absence of going to the hospital, this group would have develop the same as the control group. (parallel trends)