SC968 Panel data methods for sociologists Lectures 3 – 4.5 Fixed and random effects models Overview Types of variables: time-invariant, time-varying and trend Between- and within-individual variation Concept of individual heterogeneity Within and between estimators The implementation of fixed and random effects models in STATA Statistical properties of fixed and random effects models Choosing between fixed and random effects: the Hausman test Estimating coefficients on time-invariant variables in FE Thinking about specification Types of variable Those which vary between individuals but hardly ever over time Those which vary over time, but not between individuals The retail price index National unemployment rates Age, in a cohort study Those which vary both over time and between individuals Sex Ethnicity Parents’ social class when you were 14 The type of primary school you attended (once you’ve become an adult) Income Health Psychological wellbeing Number of children you have Marital status Trend variables Vary between individuals and over time, but in highly predictable ways: Age Year Between- and within-individual variation If you have a sample with repeated observations on the same individuals, there are two sources of variance within the sample: The fact that individuals are systematically different from one another (between-individual variation) The fact that individuals’ behaviour varies between observations (within-individual variation) T ( xij x ) i Not nearly as scary as it looks 2 Total variation is the sum over all individuals and years, of the square of the difference between each observation of x and the mean j W ( xij x i ) 2 i j B ( x i x i ) 2 ni ( x i x ) 2 i j i i denotes individual s, j denotes years xij a person - year observatio n x whole - sample mean x i mean of observatio ns for person i SD T/(N - 1) Within variation is the sum of the squares of each individual’s observation from his or her mean Between variation is sum of squares of differences between individual means and the whole-sample mean xtsum in STATA . . Similar to ordinary “sum” command xtset pid wave panel variable: time variable: delta: pid (unbalanced) wave, 1 to 15, but with gaps 1 unit Have chosen a balanced sample xtsum female partner age ue_sick LIKERT wave if nwaves == 15 Variable female Mean Std. Dev. Min Max Observations .4984321 .4989059 0 0 0 .5397574 1 1 .5397574 N = 16324 n = 1237 T-bar = 13.1964 N = 16292 n = 1234 T-bar = 13.2026 overall between within .5397574 partner overall between within .6892954 .4627963 .4217842 .243531 0 0 -.244038 1 1 1.622629 age overall between within 40.03349 19.74332 19.27238 4.31763 0 6.4 31.30015 98 90.93333 54.30015 ue_sick overall between within .0672924 .2505353 .1738938 .1852756 0 0 -.866041 1 1 1.000626 N = 16302 n = 1237 T-bar = 13.1787 LIKERT overall between within 11.26167 5.344825 3.609665 4.030974 0 0 -6.738331 36 29.69231 35.12834 N = 15661 n = 1225 T-bar = 12.7845 wave overall between within 8 4.320605 0 4.320605 1 8 1 15 8 15 N = n = T = N = n = T = 19410 1294 15 19410 1294 15 All variation is “between” Most variation is “between”, because it’s fairly rare to switch between having and not having a partner All variation is within, because this is a balanced sample More on xtsum…. . . xtset pid wave panel variable: time variable: delta: pid (unbalanced) wave, 1 to 15, but with gaps 1 unit xtsum female partner age ue_sick LIKERT wave if nwaves == 15 Variable female Mean Std. Dev. Min Max Observations .4984321 .4989059 0 0 0 .5397574 1 1 .5397574 N = 16324 n = 1237 T-bar = 13.1964 N = 16292 n = 1234 T-bar = 13.2026 overall between within .5397574 partner overall between within .6892954 .4627963 .4217842 .243531 0 0 -.244038 1 1 1.622629 age overall between within 40.03349 19.74332 19.27238 4.31763 0 6.4 31.30015 98 90.93333 54.30015 ue_sick overall between within .0672924 .2505353 .1738938 .1852756 0 0 -.866041 1 1 1.000626 N = 16302 n = 1237 T-bar = 13.1787 LIKERT overall between within 11.26167 5.344825 3.609665 4.030974 0 0 -6.738331 36 29.69231 35.12834 N = 15661 n = 1225 T-bar = 12.7845 overall between within 8 4.320605 0 4.320605 1 8 1 15 8 15 wave N = n = T = N = n = T = 19410 1294 15 Observations with non-missing variable Number of individuals Average number of time-points Min & max refer to xi-bar 19410 1294 15 Min & max refer to individual deviation from own averages, with global averages added back in. The xttab command For simplicity, omitted jbstats of missing, maternity leave, gov training and other. . xttab jbstat if nwaves == 15 & jbstat >= 1 & jbstat != 5 & jbstat <= 8 jbstat Overall Freq. Percent self-emp employed unemploy retired family c ft studt lt sick, 1388 8982 539 2687 1159 718 558 8.66 56.03 3.36 16.76 7.23 4.48 3.48 Total 16031 100.00 Pooled sample, broken down by person/years Between Freq. Percent 228 974 274 314 292 271 105 2458 (n = 1236) Within Percent 18.45 78.80 22.17 25.40 23.62 21.93 8.50 42.72 68.27 17.51 58.49 28.97 42.93 39.08 198.87 50.28 Number of people who spent any time in this state Of those who spent any time in this state, the proportion of their time (on average) they spent in it. Individual heterogeneity A very simple concept: people are different! In social science, when we talk about heterogeneity, we are really talking about unobservable (or unobserved) heterogeneity. Observed heterogeneity: differences in education levels, or parental background, or anything else that we can measure and control for in regressions Unobserved heterogeneity: anything which is fundamentally unmeasurable, or which is rather poorly measured, or which does not happen to be measured in the particular data set we are using. Example: the relationship between employment and children We know that women who have more children are less likely to go out to work, and if they do go out to work they work fewer hours. [A priori, this isn’t obvious – women with children do face a higher opportunity cost of work, but one could also argue that they need more money] What is the causal relationship: does having lots of children cause women to do less paid work? Or are women who have lots of children a fundamentally different type, with different sets of preferences? Unobserved heterogeneity yi xi11 xi 2 2 xi 3 3 ......... xiK K ui i Extend the OLS equation we used in Week 1, breaking the error term down into two components: one representing the unobservable characteristics of the person, and the other representing genuine “error”. In cross-sectional analysis, there is no way of distinguishing between the two. But in panel data analysis, we have repeated observations – and this allows us to distinguish between them. Within and between estimators Individual-specific, fixed over time yit xit ui it Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself) mean of all observatio ns for person i y i x i ui i This is the “between” estimator subtractin g : ( yit y i ) ( xit x i ) ( it i ) And this is the “within” estimator – “fixed effects” And finally, the random effects estimator is a weighted average of the within and between estimators ( yit y i ) (1 ) ( xit x i ) {(1 )ui ( it i )} θ measures the weight given to between-group variation, and is derived from the variances of ui and εi Fixed effects (within estimator) yit xit ui it ( yit y i ) ( xit x i ) ( it i ) Ignores between-group variation – so it’s an inefficient estimator However, few assumptions are required, so FE is generally consistent and unbiased Disadvantage: can’t estimate the effects of any time-invariant variables Also called least squares dummy variable model (LDV) Analysis of covariance (CV) model Between estimator yit xit ui it y i x i ui i Not much used It’s inefficient compared to random effects It doesn’t use as much information as is available in the data (only uses means) Assumption required: that vi is uncorrelated with xi Except to calculate the θ parameter for random effects, but STATA does this, not you! Easy to see why: if they were correlated, how could one decide how much of the variation in y to attribute to the x’s (via the betas) as opposed to the correlation? Can’t estimate effects of variables where mean is invariant over individuals Age in a cohort study Macro-level variables Random effects estimator yit xit ui it ( yit y i ) (1 ) ( xit x i ) {(1 )ui ( it i )} Weighted average of within and between models Assumption required: that ui is uncorrelated with xi Rather heroic assumption – think of examples Will see a test for this later Uses both within- and between-group variation, so makes best use of the data and is efficient But unless the assumption holds that vi is uncorrelated with xi , it is inconsistent AKA one-way error components model, variance component model, GLS estimator (STATA also allows ML random effects) Consistency versus efficiency. Random effects clearly does worse here….. “True” value of betas Inconsistent but efficient Consistent but inefficient …. But arguably, random effects do a better job of getting close to the “true” coefficient here. “True” value of betas Random effects Fixed effects Testing between FE and RE Sex does not appear Hausman test Hypothesis H0: ui is uncorrelated with xi Hypothesis H1: ui is correlated with xi Fixed effects is consistent under both H0 and H1 Random effects is efficient, and consistent under H 0 (but inconsistent under H1) . quietly xtreg LIKERT female ue_sick partner age age2 badh, fe . estimates store fixed . quietly xtreg LIKERT female ue_sick partner age age2 badh, re . hausman fixed . Coefficients (B) (b) . fixed 1.951485 -.298668 .1141748 -.0011833 1.230831 ue_sick partner age age2 badhealth 2.045302 -.1947691 .1058038 -.0011062 1.433115 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. -.0938175 -.1038989 .008371 -.0000771 -.2022848 .0572845 .0677693 .0157531 .0001624 .0187202 b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Example from last week Test: Ho: difference in coefficients not systematic chi2(5) = (b-B)'[(V_b-V_B)^(-1)](b-B) 123.96 = Random 0.0000 Prob>chi2 = effects rejected (inconsistent) in favour of fixed effects (consistent but inefficient) HOWEVER Big disciplinary divide Economists swear by the Hausman test and rarely report random effects Other disciplines (eg psychology) consider other factors such as explanatory power. Estimating FE in STATA . xtreg LIKERT female ue_sick partner age age2 badh, fe Fixed-effects (within) regression Group variable: pid “R-square-like” R-sq: statistic within = 0.0501 between = 0.1906 overall = 0.1285 corr(u_i, Xb) Peaks at age 48 Number of obs Number of groups Coef. female ue_sick partner age age2 badhealth _cons (dropped) 1.951485 -.298668 .1141748 -.0011833 1.230831 6.252975 sigma_u sigma_e rho 3.9934565 4.0525618 .49265449 F test that all u_i=0: 24204 3317 Obs per group: min = avg = max = 1 7.3 14 F(5,20882) Prob > F = 0.1561 LIKERT = = Std. Err. .1394164 .118635 .0214403 .0002209 .0428556 .4932977 t 14.00 -2.52 5.33 -5.36 28.72 12.68 P>|t| 0.000 0.012 0.000 0.000 0.000 0.000 = = [95% Conf. Interval] 1.678218 -.5312018 .0721501 -.0016163 1.14683 5.286073 (fraction of variance due to u_i) F(3316, 20882) = 4.56 Talk about xtmixed 220.44 0.0000 2.224752 -.0661342 .1561994 -.0007503 1.314831 7.219877 “u” and “e” are the two parts of the error term Prob > F = 0.0000 Between regression: . Not much used, but useful to compare coefficients with fixed effects xtreg LIKERT female ue_sick partner age age2 badh, be Between regression (regression on group means) Group variable: pid Number of obs Number of groups = = 24204 3317 R-sq: Obs per group: min = avg = max = 1 7.3 14 within = 0.0480 between = 0.2322 overall = 0.1482 sd(u_i + avg(e_i.))= F(6,3310) Prob > F 3.833357 LIKERT Coef. female ue_sick partner age age2 badhealth _cons 1.476659 2.038192 -.0101941 .0827335 -.0009489 2.275832 3.953941 Std. Err. .1350226 .312191 .1777423 .0219026 .0002263 .0926521 .4430909 t 10.94 6.53 -0.06 3.78 -4.19 24.56 8.92 P>|t| 0.000 0.000 0.954 0.000 0.000 0.000 0.000 = = 166.80 0.0000 [95% Conf. Interval] 1.211923 1.426085 -.35869 .0397895 -.0013927 2.094171 3.085181 1.741395 2.650299 .3383019 .1256775 -.0005052 2.457493 4.822701 Coefficient on “partner” was negative and significant in FE model. In FE, the “partner” coeff really measures the events of gaining or losing a partner Random effects regression . xtreg LIKERT female ue_sick partner age age2 badh, re theta Random-effects GLS regression Group variable: pid Number of obs Number of groups = = 24204 3317 R-sq: Obs per group: min = avg = max = 1 7.3 14 within = 0.0500 between = 0.2239 overall = 0.1471 Random effects u_i ~ Gaussian corr(u_i, X) = 0 (assumed) min 0.1986 5% 0.1986 theta median 0.5482 95% 0.6629 Std. Err. Wald chi2(6) Prob > chi2 LIKERT Coef. female ue_sick partner age age2 badhealth _cons 1.493431 2.045302 -.1947691 .1058038 -.0011062 1.433115 5.181864 .1259931 .1271039 .0973734 .014544 .0001498 .0385506 .3137662 sigma_u sigma_e rho 3.0248563 4.0525618 .3577895 (fraction of variance due to u_i) 11.85 16.09 -2.00 7.27 -7.39 37.17 16.52 2013.32 0.0000 Option “theta” gives a summary of weights max 0.6629 z = = P>|z| 0.000 0.000 0.045 0.000 0.000 0.000 0.000 [95% Conf. Interval] 1.246489 1.796183 -.3856175 .0772981 -.0013998 1.357558 4.566894 1.740373 2.294422 -.0039207 .1343094 -.0008126 1.508673 5.796835 And what about OLS? OLS simply treats within- and between-group variation as the same Pools data across waves . reg LIKERT female ue_sick partner age age2 badh Source SS df MS Model Residual 103583.505 6 591239.694 24197 17263.9175 24.4344214 Total 694823.199 24203 28.7081436 LIKERT Coef. female ue_sick partner age age2 badhealth _cons 1.409466 2.031815 -.0751296 .0983746 -.0010613 1.841796 4.450393 Std. Err. .0640651 .1240757 .0769271 .0103316 .0001049 .0357165 .2212733 t 22.00 16.38 -0.98 9.52 -10.12 51.57 20.11 Number of obs F( 6, 24197) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.329 0.000 0.000 0.000 0.000 = = = = = = 24204 706.54 0.0000 0.1491 0.1489 4.9431 [95% Conf. Interval] 1.283895 1.788619 -.2259116 .078124 -.001267 1.771789 4.016684 1.535038 2.275011 .0756524 .1186252 -.0008557 1.911802 4.884102 Comparing models Compare coefficients between models Reasonably similar – differences in “partner” and “badhealth” coeffs R-squareds are similar Within and between estimators maximise within and between r-2 respectively. FE Female Ue_sick Partner Age Age-2 Badhealth Cons Within R2 Between r2 Overall r2 1.95 -0.30 0.11 -0.00 1.23 6.25 *** ** *** *** *** *** 0.050 0.191 0.129 RE 1.49 *** 2.04 *** -1.94 *** 0.11 ** -0.00 *** 1.43 ** 5.18 *** BE 1.47 *** 2.03 *** -0.01 0.08 *** -0.00 *** 2.28 *** 3.95 *** OLS 1.41 *** 2.03 *** -0.08 0.10 *** -0.00 *** 1.84 *** 4.45 *** 0.050 0.224 0.147 0.048 0.232 0.148 0.149 Test whether pooling data is valid yit xit ui it If the ui do not vary between individuals, they can be treated as part of α and OLS is fine. Breusch-Pagan Lagrange multiplier test H0 Variance of ui = 0 H1 Variance of ui not equal to zero If H0 is not rejected, you can pool the data and use OLS Post-estimation test after random effects . quietly xtreg LIKERT female ue_sick partner age age2 badh, re . xttest0 Breusch and Pagan Lagrangian multiplier test for random effects LIKERT[pid,t] = Xb + u[pid] + e[pid,t] Estimated results: Var LIKERT e u Test: 28.70814 16.42326 9.149756 sd = sqrt(Var) 5.357998 4.052562 3.024856 Var(u) = 0 chi2(1) = 10816.48 Prob > chi2 = 0.0000 Thinking about the within and between estimators….. y i x i ui i ( yit y i ) ( xit x i ) ( it i ) Both between and FE models written with the same coefficient vector β, but no reason why they should be the same. Between: βj measures the difference in y associated with a one-unit difference in the average value of variable xj between individuals – essentially a cross-sectional concept Within: βj measures the difference associated with a one-unit increase in variable xj at individual level – essentially a longitudinal concept Random effects, as a weighted average of the two, constrains both βs to be the same. Excellent article at http://www.stata.com/support/faqs/stat/xt.html And lots more at http://www.stata.com/support/faqs/stat/#models Examples Example 1: Consider estimating a wage equation, and including a set of regional dummies, with S-E the omitted group. Wages in (eg) the N-W are lower, so the estimated between coefficient on N-W will be negative. However, in the within regression, we observe the effects of people moving to the N-W. Presumably they wouldn’t move without a reasonable incentive. So, the estimated within coefficient may even be positive – or at least, it’s likely to be a lot less negative. Example 2: Estimate the relationship between family income and children’s educational outcomes The between-group estimates measure how well the children of richer families do, relative to the children of poorer families – we know this estimate is likely to be large and significant. The within-group estimates measure how children’s outcomes change as their own family’s income changes. This coefficient may well be much smaller. Thinking in terms of slopes and intercepts Cross-sectional methods on data pooled across waves Fixed effects Assume betas are identical between individuals Allow intercepts to vary between individuals, though an individual’s intercept is constant over time Random effects Assume betas are identical between individuals Intercepts also identical between individuals Assume betas are identical between individuals [and within and between betas are identical] Allow intercepts to vary between individuals, and within individuals over time. More on this next week! Possible combinations of slopes and intercepts with panel data Constant slopes Constant intercept The OLS model yij β0 β1 xij εij Possible combinations of slopes and intercepts with panel data Constant slopes Varying intercepts The random effects model yij β0 i β1 xij εij yij ( β0 b0i ) β1 xij εij yij β0 β1 xij ui εij Possible combinations of slopes and intercepts with panel data Varying slopes Constant intercept Unlikely to occur yij β0 β1i xij εij yij β0 ( β1 bi ) xij εij Possible combinations of slopes and intercepts with panel data Varying slopes Varying intercepts Random coefficients model separate regression for each individual yij β0i β1i xij εij yij ( β0 b0i) ( β1 bi ) xij εij yij β0 β1i xij ui εij FE and time-invariant variables Reformulating the regression equation to distinguish between time-varying and time-invariant variables: yit xit zi ui it Residual Timevarying variables: income, health Timeinvariant variables – eg sex, race Individual-specific fixed effect Inconveniently, fixed effects washes out the z’s, so does not produce estimates of γ. But there is a way! Requires z’s to be uncorrelated with u’s Coefficients on time-invariant variables Run FE in the normal way Use estimates to predict the residuals Use the between estimator to regress the residuals on the time-invariant variables Done! Only use this if RE is rejected: otherwise, RE provides best estimates of all coefficients Going back to the previous example, . quietly xtreg LIKERT female ue_sick partner age age2 badh, fe . predict FE_RESID, ue (13352 missing values generated) . xtreg FE_RESID female, be Between regression (regression on group means) Group variable: pid Number of obs Number of groups = = 24204 3317 R-sq: Obs per group: min = avg = max = 1 7.3 14 within = 0.0000 between = 0.0400 overall = 0.0212 sd(u_i + avg(e_i.))= F(1,3315) Prob > F 3.913298 FE_RESID Coef. female _cons 1.599518 -.7288892 Std. Err. .1360426 .0984186 t 11.76 -7.41 P>|t| 0.000 0.000 = = 138.24 0.0000 [95% Conf. Interval] 1.332782 -.9218564 1.866254 -.5359219 From previous slide… Female Ue_sick Partner Age Age-2 Badhealth Cons Within R2 Between r2 Overall r2 FE 1.95 *** -0.30 ** 0.11 *** -0.00 *** 1.23 *** 6.25 *** RE 1.49 *** 2.04 *** -1.94 *** 0.11 ** -0.00 *** 1.43 ** 5.18 *** BE 1.47 *** 2.03 *** -0.01 0.08 *** -0.00 *** 2.28 *** 3.95 *** OLS 1.41 *** 2.03 *** -0.08 0.10 *** -0.00 *** 1.84 *** 4.45 *** 0.050 0.191 0.129 0.050 0.224 0.147 0.048 0.232 0.148 0.149 Our estimate of 1.60 for the coefficient on “female” is slightly higher than, but definitely in the same ball-park as, those produced by the other methods. Improving specification Recall our problem with the “partner” coefficient OLS and between estimates show no significant relationship between partnership status and LIKERT scores FE and (to a lesser extent) RE show a significant negative relationship. FE estimates coefficient on deviation from mean – likely to reflect moving in together (which makes you temporarily happy) and splitting up (which makes you temporarily sad). Investigate this by including variables to capture these events Female Ue_sick Partner Age Age-2 Badhealth Cons Within R2 Between r2 Overall r2 FE 1.95 *** -0.30 ** 0.11 *** -0.00 *** 1.23 *** 6.25 *** RE 1.49 *** 2.04 *** -1.94 *** 0.11 ** -0.00 *** 1.43 ** 5.18 *** BE 1.47 *** 2.03 *** -0.01 0.08 *** -0.00 *** 2.28 *** 3.95 *** OLS 1.41 *** 2.03 *** -0.08 0.10 *** -0.00 *** 1.84 *** 4.45 *** 0.050 0.191 0.129 0.050 0.224 0.147 0.048 0.232 0.148 0.149 Generate variables reflecting changes . sort pid wave . gen get_pnr = (partner == 1 & partner[_n-1] == 0) if pid == pid[_n-1] & wave == wave[_n-1] + 1 (5078 missing values generated) . gen lose_pnr = (partner == 0 & partner[_n-1] == 1) if pid == pid[_n-1] & wave == wave[_n-1] + 1 (5078 missing values generated) Note: we will lose some observations Fixed effects . . xtreg LIKERT partner get_pnr lose_pnr female ue_sick age age2 badh, fe Fixed-effects (within) regression Group variable: pid Number of obs Number of groups = = 21264 2764 R-sq: Obs per group: min = avg = max = 1 7.7 13 within = 0.0574 between = 0.1839 overall = 0.1333 corr(u_i, Xb) F(7,18493) Prob > F = 0.1460 LIKERT Coef. partner get_pnr lose_pnr female ue_sick age age2 badhealth _cons .3186429 -.0793952 2.64016 (dropped) 1.894659 .0734274 -.0008799 1.284593 6.796602 sigma_u sigma_e rho 3.7857335 4.030519 .46871319 F test that all u_i=0: Std. Err. t P>|t| = = 160.80 0.0000 [95% Conf. Interval] .143112 .2116739 .2371252 2.23 -0.38 11.13 0.026 0.708 0.000 .0381301 -.4942956 2.175372 .5991557 .3355053 3.104947 .1530311 .0240822 .0002464 .045967 .5570247 12.38 3.05 -3.57 27.95 12.20 0.000 0.002 0.000 0.000 0.000 1.594704 .0262241 -.0013629 1.194494 5.704782 2.194614 .1206308 -.0003969 1.374693 7.888422 (fraction of variance due to u_i) F(2763, 18493) = 4.83 Prob > F = 0.0000 Coeff on having a partner now slightly positive; getting a partner is insignificant; losing a partner is now large and positive Random effects . xtreg LIKERT partner get_pnr lose_pnr female ue_sick age age2 badh, re Random-effects GLS regression Group variable: pid Number of obs Number of groups = = 21264 2764 R-sq: Obs per group: min = avg = max = 1 7.7 13 within = 0.0571 between = 0.2213 overall = 0.1545 Random effects u_i ~ Gaussian corr(u_i, X) = 0 (assumed) LIKERT Coef. partner get_pnr lose_pnr female ue_sick age age2 badhealth _cons .281375 -.0897335 2.76626 1.450748 1.892352 .0719139 -.0007748 1.470353 5.457217 sigma_u sigma_e rho 2.9604042 4.030519 .3504325 Std. Err. .113251 .204547 .2284331 .1324675 .1388821 .0159222 .0001621 .0414036 .3436851 Wald chi2(8) Prob > chi2 z 2.48 -0.44 12.11 10.95 13.63 4.52 -4.78 35.51 15.88 P>|z| 0.013 0.661 0.000 0.000 0.000 0.000 0.000 0.000 0.000 = = 1922.41 0.0000 [95% Conf. Interval] .0594072 -.4906382 2.318539 1.191116 1.620148 .0407069 -.0010926 1.389203 4.783606 (fraction of variance due to u_i) .5033428 .3111713 3.21398 1.710379 2.164556 .1031209 -.000457 1.551502 6.130827 similar Proportion of total residual variance attributable to the u’s - c.f. random slopes models later Collating the coefficients: Partner Get partner Lose partner FE 0.32 ** -0.07 2.64 *** RE 0.28 ** -0.09 ** 2.77 *** BE 0.29 -2.85 ** 7.17 *** OLS 0.17 ** -0.10 3.19 *** Partner FE -0.30 ** RE -1.94 *** BE -0.01 OLS -0.08 Hausman test again Have we cleaned up the specification sufficiently that the Hausman test will now fail to reject random effects? . quietly xtreg LIKERT partner get_pnr lose_pnr female ue_sick age age2 badh, fe . estimates store fixed . quietly xtreg LIKERT partner get_pnr lose_pnr female ue_sick age age2 badh, re . hausman fixed . Coefficients (b) (B) fixed . partner get_pnr lose_pnr ue_sick age age2 badhealth .3186429 -.0793952 2.64016 1.894659 .0734274 -.0008799 1.284593 .281375 -.0897335 2.76626 1.892352 .0719139 -.0007748 1.470353 (b-B) Difference .0372679 .0103383 -.1260999 .0023072 .0015135 -.0001051 -.1857594 sqrt(diag(V_b-V_B)) S.E. .0874944 .0544645 .0636136 .0642673 .0180675 .0001855 .0199676 b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(7) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 116.04 Prob>chi2 = 0.0000 No! Although the chi-squared statistic is smaller now (at 116.04), than previously (at 123.96) Thinking about time Under FE, including “wave” or “year” as a continuous variable is not very useful, since it is treated as the deviation from the individual’s mean. We may not want to treat time as a linear trend (for example, if we are looking for a cut point related to social policy) Also, wave is very much correlated with individuals’ ages Can do FE or RE including time periods as dummies May be referred to as “two-way fixed effects” Generate each dummy variable separately, or…. local i = 1 while `i' <= 15 { gen byte W`i' = (wave == `i') local i = `i' + 1 } Time variables insignificant here (as we would expect) . xtreg LIKERT partner get_pnr lose_pnr female ue_sick age age2 badh W*, fe Fixed-effects (within) regression Group variable: pid Number of obs Number of groups = = 21264 2764 R-sq: Obs per group: min = avg = max = 1 7.7 13 within = 0.0580 between = 0.1811 overall = 0.1323 corr(u_i, Xb) F(19,18481) Prob > F = 0.1423 LIKERT Coef. partner get_pnr lose_pnr female ue_sick age age2 badhealth W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 _cons .3193454 -.072553 2.648729 (dropped) 1.894834 .071427 -.0008821 1.282999 -.0140737 -.0554759 .1273198 -.0761569 .0865111 -.0104289 -.1120629 (dropped) .2739767 .0881723 -.0358824 -.0671728 .0610156 (dropped) 6.873039 sigma_u sigma_e rho 3.7904487 4.0304244 .46934486 F test that all u_i=0: Std. Err. t = = 59.92 0.0000 P>|t| [95% Conf. Interval] .1431496 .2117186 .2372293 2.23 -0.34 11.17 0.026 0.732 0.000 .038759 -.487541 2.183737 .5999317 .3424349 3.11372 .1531005 .1200867 .0002464 .0460178 1.540443 1.422781 1.303812 1.185396 1.07344 .9562925 .8400402 12.38 0.59 -3.58 27.88 -0.01 -0.04 0.10 -0.06 0.08 -0.01 -0.13 0.000 0.552 0.000 0.000 0.993 0.969 0.922 0.949 0.936 0.991 0.894 1.594743 -.1639541 -.0013651 1.1928 -3.033485 -2.844257 -2.428272 -2.399643 -2.01753 -1.884851 -1.758619 2.194925 .3068081 -.0003991 1.373199 3.005338 2.733306 2.682911 2.247329 2.190553 1.863993 1.534493 .6086295 .4963143 .385874 .279283 .1898793 0.45 0.18 -0.09 -0.24 0.32 0.653 0.859 0.926 0.810 0.748 -.9189933 -.8846495 -.7922312 -.6145932 -.3111654 1.466947 1.060994 .7204663 .4802477 .4331966 6.064719 1.13 0.257 -5.01437 18.76045 (fraction of variance due to u_i) F(2763, 18481) = 4.83 Prob > F = 0.0000 Extending panel data models to discrete dependent variables Panel data extensions to logit and probit models Recap from Week 1: These models cover discrete (categorical) outcomes, eg psychological morbidity; whether one has a job;. Think of other examples. Outcome variable is always 0 or 1. Estimate: Pr(Y 1) F ( X , ) Pr(Y 0) 1 F ( X , ) OLS (linear probability model) would set F(X,β) = X’β + ε Inappropriate because: Heteroscedasticity: the outcome variable is always 0 or 1, so ε only takes the value -x’β or 1-x’β More seriously, one cannot constrain estimated probabilities to lie between 0 and 1. Extension of logit and probit to panel data: We won’t do the maths! But essentially, STATA maximises a likelihood function derived from the panel data specification Both random effects and fixed effects Random effects is SLOW!! First, generate the categorical variable indicating psychological morbidity . gen byte PM = (hlghq2 > 2) if hlghq2 >= 0 & hlghq2 != . Fixed effects estimates – xtlogit (clogit) . xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh, fe note: multiple positive outcomes within groups encountered. note: 1221 groups (6462 obs) dropped because of all positive or all negative outcomes. note: female omitted because of no within-group variance. Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -5844.5165 -5829.2179 -5829.2122 -5829.2122 Conditional fixed-effects logistic regression Group variable: pid Log likelihood Coef. partner get_pnr lose_pnr ue_sick age age2 badhealth .0960128 .0368568 1.231475 .7533968 -.03383 .0000894 .5386858 = = 14802 1543 Obs per group: min = avg = max = 2 9.6 13 LR chi2(7) Prob > chi2 = -5829.2122 PM Number of obs Number of groups Std. Err. .0917139 .13587 .1469964 .0970111 .0162808 .0001715 .0298361 z 1.05 0.27 8.38 7.77 -2.08 0.52 18.05 P>|z| 0.295 0.786 0.000 0.000 0.038 0.602 0.000 = = 517.04 0.0000 [95% Conf. Interval] -.0837432 -.2294436 .9433672 .5632586 -.0657398 -.0002468 .4802081 .2757688 .3031572 1.519583 .9435351 -.0019203 .0004256 .5971636 Is losing a partner necessarily causing the psychological morbidity? Losing a partner, being unemployed or sick, and being in bad health are associated with psychological morbidity Negative in age throughout the human life span Adding some more variables: We know that women sometimes suffer from post-natal depression. Try total number of children, and children aged 0-2 Total number of children is insignificant, but children 0-2 is significant. . xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, fe note: multiple positive outcomes within groups encountered. note: 1221 groups (6462 obs) dropped because of all positive or all negative outcomes. note: female omitted because of no within-group variance. Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -5839.5118 -5824.2036 -5824.1975 -5824.1975 Conditional fixed-effects logistic regression Group variable: pid Log likelihood Coef. partner get_pnr lose_pnr ue_sick age age2 badhealth nch02 .0470255 .0679186 1.217756 .749727 -.0295734 .0000582 .537545 .249448 = = 14802 1543 Obs per group: min = avg = max = 2 9.6 13 LR chi2(8) Prob > chi2 = -5824.1975 PM Number of obs Number of groups Std. Err. .0931317 .1363361 .1472094 .0970536 .0163456 .0001719 .0298374 .0785737 z 0.50 0.50 8.27 7.72 -1.81 0.34 18.02 3.17 P>|z| 0.614 0.618 0.000 0.000 0.070 0.735 0.000 0.001 = = 527.07 0.0000 [95% Conf. Interval] -.1355092 -.1992952 .9292311 .5595054 -.0616102 -.0002787 .4790647 .0954464 .2295603 .3351324 1.506282 .9399487 .0024635 .0003951 .5960253 .4034497 Next step??? Yes, we should separate men and women sort female by female: xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, fe Men PM Coef. partner get_pnr lose_pnr ue_sick age age2 badhealth nch02 -.0262595 .2042066 1.335693 .9009421 .0141781 -.0004864 .5628403 .0458965 Std. Err. z P>|z| .151735 .2165868 .2314295 .1397474 .0265837 .0002804 .047939 .1268808 -0.17 0.94 5.77 6.45 0.53 -1.73 11.74 0.36 0.863 0.346 0.000 0.000 0.594 0.083 0.000 0.718 Std. Err. z P>|z| [95% Conf. Interval] -.3236547 -.2202957 .8820997 .6270421 -.037925 -.0010359 .4688817 -.2027854 .2711357 .6287089 1.789287 1.174842 .0662812 .0000632 .656799 .2945784 Women PM Coef. partner get_pnr lose_pnr ue_sick age age2 badhealth nch02 .0930161 -.0122303 1.13012 .6032882 -.0570441 .0004039 .5222259 .3840788 .1181743 .1751243 .1901842 .1357316 .0208069 .0002185 .0382135 .1011092 0.79 -0.07 5.94 4.44 -2.74 1.85 13.67 3.80 0.431 0.944 0.000 0.000 0.006 0.065 0.000 0.000 [95% Conf. Interval] -.1386013 -.3554676 .7573657 .3372591 -.0978248 -.0000245 .4473288 .1859084 .3246336 .3310069 1.502874 .8693174 -.0162633 .0008322 .597123 .5822493 Relationship between PM and young children is confined to women Any other gender differences? Back to random effects Random-effects logistic regression Group variable: pid Number of obs Number of groups = = 21264 2764 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 1 7.7 13 Log likelihood Wald chi2(9) Prob > chi2 = -10377.058 PM Coef. Std. Err. partner get_pnr lose_pnr female ue_sick age age2 badhealth nch02 _cons .0565392 .032454 1.309734 .686486 .7131287 -.013065 .0000337 .6613526 .2653162 -2.871645 .0695474 .1320281 .1389371 .0712769 .0839162 .0094552 .0000961 .0261188 .0743185 .2033651 /lnsig2u .6496376 sigma_u rho 1.38378 .3679062 z 0.81 0.25 9.43 9.63 8.50 -1.38 0.35 25.32 3.57 -14.12 0.416 0.806 0.000 0.000 0.000 0.167 0.726 0.000 0.000 0.000 959.52 0.0000 [95% Conf. Interval] -.0797712 -.2263163 1.037422 .5467859 .5486559 -.0315968 -.0001546 .6101607 .1196546 -3.270233 .1928496 .2912244 1.582046 .8261862 .8776015 .0054667 .0002221 .7125446 .4109779 -2.473057 .0571876 .537552 .7617232 .0395675 .013299 1.308362 .3422473 1.463545 .3943355 Likelihood-ratio test of rho=0: chibar2(01) = P>|z| = = 2038.50 Prob >= chibar2 = 0.000 Estimates are VERY similar to FE Testing between FE and RE quietly xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, fe estimates store fixed quietly xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, re hausman fixed . Coefficients (b) (B) fixed . partner get_pnr lose_pnr ue_sick age age2 badhealth nch02 .0470255 .0679186 1.217756 .749727 -.0295734 .0000582 .537545 .249448 .0565392 .032454 1.309734 .7131287 -.013065 .0000337 .6613526 .2653162 (b-B) Difference -.0095137 .0354646 -.0919776 .0365983 -.0165083 .0000245 -.1238076 -.0158682 sqrt(diag(V_b-V_B)) S.E. .0619409 .0340015 .0486529 .0487594 .0133334 .0001425 .014425 .0255066 b = consistent under Ho and Ha; obtained from xtlogit B = inconsistent under Ha, efficient under Ho; obtained from xtlogit Test: Ho: difference in coefficients not systematic chi2(8) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 149.76 Prob>chi2 = 0.0000 Random effects is rejected again. Random effects probit No fixed effects command available, as there does not exist a sufficient statistic allowing the fixed effects to be conditioned out of the likelihood. Random-effects probit regression Group variable: pid Number of obs Number of groups = = 21264 2764 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 1 7.7 13 Log likelihood Wald chi2(9) Prob > chi2 = -10370.501 PM Coef. Std. Err. partner get_pnr lose_pnr female ue_sick age age2 badhealth nch02 _cons .0334017 .0183513 .7646656 .3924276 .4189777 -.0077306 .0000201 .3825895 .1530239 -1.657895 .0399311 .0757428 .0800772 .0407552 .048681 .0054309 .0000552 .0149317 .0431233 .1165019 /lnsig2u -.4475525 sigma_u rho .799494 .3899428 z 0.84 0.24 9.55 9.63 8.61 -1.42 0.36 25.62 3.55 -14.23 P>|z| 995.53 0.0000 [95% Conf. Interval] -.0448618 -.1301019 .6077173 .3125488 .3235648 -.018375 -.000088 .3533239 .0685039 -1.886235 .1116651 .1668045 .921614 .4723063 .5143906 .0029138 .0001283 .4118551 .237544 -1.429556 .0552927 -.5559243 -.3391807 .0221031 .0131534 .7573255 .364491 .8440105 .4160085 Likelihood-ratio test of rho=0: chibar2(01) = 0.403 0.809 0.000 0.000 0.000 0.155 0.715 0.000 0.000 0.000 = = 2056.20 Prob >= chibar2 = 0.000 Why aren’t the sets of coefficients more similar? Partner Get partner Lose partner Female UE/sick Age Age-squared Bad health Kids 0-2 Cons Logit 0.057 0.032 1.310 *** 0.686 *** 0.713 *** -0.013 0.000 0.661 *** 0.265 *** -2.871 *** Probit 0.033 *** 0.018 *** 0.765 *** 0.392 ** 0.419 *** -0.007 -0.000 0.383 ** 0.153 *** -1.658 *** Remember the conversion scale from Week 1… Other models Random-effects Tobit No fixed-effects specification available Potential problems, if random effects is rejected And it’s not possible to use the Hausman test to test this, since this relies on being able to estimate fixed effects model. Random Coefficients Models (AKA multilevel models)