SC968 Panel data methods for sociologists Lecture 1, part 1 A review of concepts for regression modelling Or things you should know already Overview Models OLS, logit and probit Mathematically and practically Interpretation of results, measures of fit and regression diagnostics Model specification Post-estimation commands STATA competence Ordinary Least Squares (OLS) Value of dependent variable for individual i (LHS variable) yi xi11 xi 2 2 xi 3 3 ......... xiK K i Intercept (constant) Value of explanatory variable 1 for person i Coefficient on variable 1 Examples yi = mental health x1 = sex x2 = age x3 = marital status x4 = employment status x5 = physical health yi = hourly pay x1 = sex x2 = age x3 = education x4 = job tenure x5 = industry x6 = region Residual (disturbance, error term) Total no. of explanatory variables (RHS variables or regressors) is K OLS yi xi11 xi 2 2 xi 3 3 ......... xiK K i In vector form In matrix form yi xi ' i Vector of explanatory variables y X ' Vector of coefficients 1 2 yi xi1 xi 2 xi 3 . . xiK * 3 i . . K Note: you will often see x’β written as xβ y1 x11 y2 x21 y x 3 31 y4 x41 y5 x51 . . . . y N x N 1 x12 x22 x32 x42 x52 . . xN 2 x13 . . x1K 1 x23 . . x2 K 1 2 x33 . . x3 K 2 3 x43 . . x4 K * 3 . x53 . . x5 K . . . .. . . . . . . . K . N x N 3 . . x NK OLS Also called “linear regression” Assumes dependent variable is a linear combination of dependent variables, plus disturbance “Least squares”: β’s estimated so as to minimise the sum of the ε’s. min ( i )2 min{ ' } b ( X ' X ) 1 X ' y Assumptions Residuals have zero mean………………………………. E ( i ) 0 Follows that ε’s and X’s are uncorrelated………………. E ( i | X i ) 0 E ( i X i ) 0 violated if a regressor is endogenous Homoscedasticity: all ε’s have same variance ………… Var ( i ) 2 Classic example: food consumption and income Cure by using weighted least squares Nonautocorrelation: ε’s uncorrelated with each other … E ( i j ) 0 Eg, number of children in female labour supply models Cure by (eg) Instrumental Variables Data sets where the same individual appears multiple times Adjust standard errors: ‘cluster’ option in STATA Distubances are iid (normally distributed, zero mean, constant variance) When is OLS appropriate? When you have a continuous dependent variable When the assumptions are not obviously violated As a first step in research to get ball-park estimates We will use them a lot for this purpose Worked examples Eg, you would use it to estimate regressions for height, but not for whether a person has a university degree. Coefficients, P-values, t-statistics Measures of fit (R-squared, adjusted R-squared) Thinking about specification Post-estimation commands Regression diagnostics. A note on the data All examples (in lectures and practicals) drawn from a 20% sample of the British Household Panel Survey (BHPS) – more about the data later! Summarize monthly earned income . sum incm if age >= 17 & age <= 64, d incm 1% 5% 10% 25% 50% 75% 90% 95% 99% Percentiles 43 156 268.6667 615.3333 Smallest 1 1.25 2 2.416667 1073.088 1690 2471.848 3061.355 5003.849 Largest 9207.083 9333.333 10000 10000 Obs Sum of Wgt. 16696 16696 Mean Std. Dev. 1282.831 1008.308 Variance Skewness Kurtosis 1016685 2.19295 11.94321 First worked example Monthly labour income, for people whose labour income is >= £1 For illustrative purposes only. Not an example of good practice. MS = SS/df . do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp" Analysis of variance (ANOVA) table. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64 Source R-squared = Model SS / Total SS SS df MS Model Residual 4.8145e+09 7 1.1811e+10 16450 687785597 718000.667 Total 1.6626e+10 16457 1010245.5 incm Coef. female age age2 partner ed_sec ed_deg mth_int _cons -594.9641 101.0994 -1.155281 155.7992 380.5032 1076.674 -5.059072 -819.931 Std. Err. 13.26812 3.859657 .0479992 16.62703 14.36582 20.54526 4.036446 78.80064 t -44.84 26.19 -24.07 9.37 26.49 52.40 -1.25 -10.41 T-stat = coefficient / standard error Number of obs F( 7, 16450) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.210 0.000 = = = = = = 16458 957.92 0.0000 0.2896 0.2893 847.35 Tests whether all coeffs except constant are jointly zero [95% Conf. Interval] -620.9711 93.53401 -1.249364 123.2085 352.3446 1036.403 -12.97094 -974.3888 -568.9571 108.6647 -1.061197 188.39 408.6618 1116.945 2.8528 -665.4732 Coefficients + or – 1.96 standard errors Root MSE = sqrt(MS) What do the results tell us? . do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp" . reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64 Source SS df MS Model Residual 4.8145e+09 7 1.1811e+10 16450 687785597 718000.667 Total 1.6626e+10 16457 1010245.5 incm Coef. female age age2 partner ed_sec ed_deg mth_int _cons -594.9641 101.0994 -1.155281 155.7992 380.5032 1076.674 -5.059072 -819.931 Std. Err. 13.26812 3.859657 .0479992 16.62703 14.36582 20.54526 4.036446 78.80064 t -44.84 26.19 -24.07 9.37 26.49 52.40 -1.25 -10.41 Number of obs F( 7, 16450) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.210 0.000 = = = = = = 16458 957.92 0.0000 0.2896 0.2893 847.35 [95% Conf. Interval] -620.9711 93.53401 -1.249364 123.2085 352.3446 1036.403 -12.97094 -974.3888 -568.9571 108.6647 -1.061197 188.39 408.6618 1116.945 2.8528 -665.4732 All coefficients except month of interview are significant 29% of variation explained Being female reduces income by nearly £600 per month Income goes up with age and then down 16458 observations…..oops, this is from panel data, so there are repeated observations on individuals. Add ,cluster(pid) as an option . reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64, cluster(pid) Linear regression Number of obs F( 7, 2465) Prob > F R-squared Root MSE = = = = = 16458 135.26 0.0000 0.2896 847.35 (Std. Err. adjusted for 2466 clusters in pid) incm Coef. female age age2 partner ed_sec ed_deg mth_int _cons -594.9641 101.0994 -1.155281 155.7992 380.5032 1076.674 -5.059072 -819.931 Robust Std. Err. 31.81172 7.323088 .0933813 30.87227 30.36746 64.45131 4.126102 132.8455 t -18.70 13.81 -12.37 5.05 12.53 16.71 -1.23 -6.17 P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.220 0.000 [95% Conf. Interval] -657.3445 86.73932 -1.338395 95.26099 320.9549 950.2898 -13.15006 -1080.431 -532.5836 115.4594 -.9721666 216.3375 440.0516 1203.058 3.031912 -559.4306 Coefficients, R-squared etc are unchanged from previous specification But standard errors are adjusted: standard errors larger, t-statistics are lower Let’s get rid of the “month” variable . reg incm female age age2 partner ed_sec ed_deg Linear regression if age >= 17 & age <= 64, cluster(pid) Number of obs F( 6, 2466) Prob > F R-squared Root MSE = = = = = 16460 156.78 0.0000 0.2895 847.33 (Std. Err. adjusted for 2467 clusters in pid) incm Coef. female age age2 partner ed_sec ed_deg _cons -594.8596 100.9827 -1.153834 155.5618 381.0247 1076.837 -866.2836 Robust Std. Err. 31.80682 7.325995 .0934155 30.87778 30.36183 64.44019 125.9787 t -18.70 13.78 -12.35 5.04 12.55 16.71 -6.88 P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.000 [95% Conf. Interval] -657.2304 86.617 -1.337015 95.01275 321.4874 950.4745 -1113.319 -532.4887 115.3485 -.9706534 216.1109 440.562 1203.199 -619.2486 Think about the female coefficient a bit more. Could it be to do with women working shorter hours? Control for weekly hours of work . reg incm female age age2 partner ed_sec ed_deg hrsm Linear regression if age >= 17 & age <= 64, cluster(pid) Number of obs F( 7, 2262) Prob > F R-squared Root MSE = = = = = 13998 247.67 0.0000 0.4580 690.95 (Std. Err. adjusted for 2263 clusters in pid) incm Coef. female age age2 partner ed_sec ed_deg hrsm _cons -314.6874 79.55289 -.873335 148.0265 340.68 996.7434 5.654682 -1495.805 Robust Std. Err. 34.32954 6.372918 .0817518 26.07885 26.67171 59.88369 .2467777 111.8223 t -9.17 12.48 -10.68 5.68 12.77 16.64 22.91 -13.38 P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 [95% Conf. Interval] -382.0081 67.05551 -1.033651 96.88551 288.3764 879.3107 5.170747 -1715.09 -247.3667 92.05027 -.7130186 199.1675 392.9835 1114.176 6.138616 -1276.52 Is the coefficient on hours of work reasonable? £5.65 for every additional hour worked – certainly in the right ball park. Looking at 2 specifications together Linear Number regression of obs F( 6, 2466) Prob > F R-squared Root MSE incm Coef. female age age2 partner ed_sec ed_deg _cons -594.8596 100.9827 -1.153834 155.5618 381.0247 1076.837 -866.2836 = = = = = 16460 156.78 0.0000 0.2895 847.33 Robust Std. Err. 31.80682 7.325995 .0934155 30.87778 30.36183 64.44019 125.9787 Number of obs F( 7, 2262) Prob > F R-squared Root MSE t -18.70 13.78 -12.35 5.04 12.55 16.71 -6.88 P>|t| = = = = = 13998 247.67 0.0000 0.4580 690.95 Robust [95% Conf. Interval] incm Coef. Std. Err. 0.000 -657.2304 -532.4887 female -314.6874 34.32954 0.000 86.617 115.3485 age 79.55289 6.372918 0.000 -1.337015 -.9706534 age2 -.873335 .0817518 0.000 95.01275 216.1109 partner 148.0265 26.07885 0.000 321.4874 440.562 ed_sec 340.68 26.67171 0.000 950.4745 1203.199 ed_deg 996.7434 59.88369 0.000 -1113.3195.654682 -619.2486 hrsm .2467777 _cons -1495.805 111.8223 t -9.17 12.48 -10.68 5.68 12.77 16.64 22.91 -13.38 R-squared jumps from 29% to 46% Coefficient on female goes from -595 to -315 Almost half the effect of gender is explained by women’s shorter hours of work Age, partner and education coefficients are also reduced in magnitude, for similar reasons Number of observations reduces from 16460 to 13998 – missing data on hours P>|t| [95% 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -382.0 67.05 -1.033 96.88 288.3 879.3 5.170 -1715 Interesting post-estimation activities Number of obs F( 7, 2262) Prob > F R-squared Root MSE incm Coef. female age age2 partner ed_sec ed_deg hrsm _cons -314.6874 79.55289 -.873335 148.0265 340.68 996.7434 5.654682 -1495.805 = = = = = 13998 247.67 0.0000 0.4580 690.95 Robust Std. Err. 34.32954 6.372918 .0817518 26.07885 26.67171 59.88369 .2467777 111.8223 What age does income peak? Income = Y + β1*age + β2*age2 t -9.17 12.48 -10.68 5.68 12.77 16.64 22.91 -13.38 = β1+ 2β2*age P>|t|d(Income)/d(age) [95% Conf. Interval] 0.000Derivative -382.0081 = zero -247.3667 when 0.000 67.05551 92.05027 0.000 -1.033651 -.7130186 = - β1/2β 2 0.000age 96.88551 199.1675 0.000 288.3764 392.9835 = -79.552/(-0.873*2) 0.000 879.3107 1114.176 0.000 5.170747 6.138616 0.000 -1715.09 = 45.5 -1276.52 Is the effect of university qualifications statistically different from the effect of secondary education? . test ed_sec = ed_deg ( 1) ed_sec - ed_deg = 0 F( 1, 2262) = Prob > F = 110.75 0.0000 A closer look at “couple” coefficient . bysort female: reg incm age age2 partner ed_sec ed_deg hrsm if age >= 17 & age <= 64, cluster(pid) -> female = 0 Linear regression Number of obs F( 6, 1095) Prob > F R-squared Root MSE = = = = = 6776 115.53 0.0000 0.3452 787.93 Men benefit much more than women from being in a couple. Other coefficients also differ between men and women, but with current specification, we can’t test whether differences are significant. (Std. Err. adjusted for 1096 clusters in pid) incm Coef. age age2 partner ed_sec ed_deg hrsm _cons 113.3119 -1.257366 213.351 356.7472 1082.255 3.930412 -1907.107 Robust Std. Err. 10.56356 .1316253 46.9817 41.91151 89.21241 .3784925 175.5681 t 10.73 -9.55 4.54 8.51 12.13 10.38 -10.86 P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.000 [95% Conf. Interval] 92.5848 -1.515633 121.1667 274.5113 907.2087 3.18776 -2251.595 134.039 -.9991 305.5354 438.9832 1257.302 4.673065 -1562.62 -> female = 1 Linear regression Number of obs F( 6, 1166) Prob > F R-squared Root MSE = = = = = 7222 125.30 0.0000 0.4830 564.65 (Std. Err. adjusted for 1167 clusters in pid) incm Coef. age age2 partner ed_sec ed_deg hrsm _cons 56.20989 -.6229372 84.15365 277.2823 819.3002 6.806946 -1382.844 Robust Std. Err. 7.327688 .0961605 29.27677 31.66175 73.74637 .3051015 133.0607 t 7.67 -6.48 2.87 8.76 11.11 22.31 -10.39 P>|t| 0.000 0.000 0.004 0.000 0.000 0.000 0.000 [95% Conf. Interval] 41.83296 -.8116041 26.7126 215.1619 674.6098 6.208337 -1643.909 70.58682 -.4342702 141.5947 339.4026 963.9906 7.405556 -1121.779 Logit and Probit Developed for discrete (categorical) dependent variables Eg, psychological morbidity, whether one has a job…. Think of other examples. Outcome variable is always 0 or 1. Estimate: Pr(Y 1) F ( X , ) Pr(Y 0) 1 F ( X , ) OLS (linear probability model) would set F(X,β) = X’β + ε Inappropriate because: Heteroscedasticity: the outcome variable is always 0 or 1, so ε only takes the value -x’β or 1-x’β More seriously, one cannot constrain estimated probabilities to lie between 0 and 1. Logit and Probit Looking for a function which lies between 0 and 1: Cumulative normal distribution: Probit model 'x Pr(Y 1) (t ).dt ( X ' ) Logistic distribution: Logit (logistic) model e x Pr(Y 1) ( x ) x 1 e They are very similar! Note how they lie between 0 and 1 (vertical axis) http://www.gseis.ucla.edu/courses/ed231c/notes3/probit1.html Maximum likelihood estimation Likelihood function: product of Pr(y=1) = F(x’β) Pr(y=0) = 1- F(x’β) for all observations where y=0 (think of the probability of flipping exactly four heads and two tails, with six dice) for all observations where y=1 Log likelihood written as ln L w j ln F ( x j ) w j ln[ 1 F ( x j )] js js Estimated using an iterative procedure STATA chooses starting values for β’s Computes slopes of likelihood function at these values Adjusts β’s accordingly Stops when slope of LF is ≈0 Can take time! Let’s look at whether a person works . tab jbstat, m current economic activity Freq. Percent Cum. missing or wild -7 not answered self-employed employed unemployed retired maternity leave family care ft studt, school lt sick, disabld gvt trng scheme other . 13 66 2 2,204 14,702 1,120 4,726 320 1,964 1,394 1,057 67 124 9,793 0.03 0.18 0.01 5.87 39.15 2.98 12.59 0.85 5.23 3.71 2.81 0.18 0.33 26.08 0.03 0.21 0.22 6.08 45.24 48.22 60.80 61.66 66.89 70.60 73.41 73.59 73.92 100.00 Total 37,552 100.00 gen byte work = (jbstat == 1 | jbstat == 2) if jbstat >= 1 & jbstat != . Logit regression: whether have a job All the iterations . logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pi Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood = = = = = -9174.0313 -7909.9067 -7838.4288 -7838.2372 -7838.2372 Logistic regression 2* (LL of this model – LL of null model) Number of obs Wald chi2(8) Prob > chi2 Pseudo R2 Log pseudolikelihood = -7838.2372 = = = = 17268 613.59 0.0000 0.1456 (Std. Err. adjusted for 2430 clusters in pid) work Coef. female age age2 badhealth partner ed_sec ed_deg nkids _cons -.8001156 .3578242 -.0046546 -.5213826 .4681257 .602653 .8734892 -.477714 -3.666352 Robust Std. Err. .090802 .0282831 .0003504 .0404068 .0943383 .0900282 .1468462 .0391116 .5216913 z -8.81 12.65 -13.28 -12.90 4.96 6.69 5.95 -12.21 -7.03 P>|z| 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 [95% Conf. Interval] -.9780842 .3023904 -.0053414 -.6005785 .2832261 .4262009 .585676 -.5543714 -4.688848 -.622147 .4132581 -.0039679 -.4421868 .6530253 .7791051 1.161302 -.4010567 -2.643856 Interpret like Rsquared, but is computed differently From these coefficients, can tell whether estimated effects are positive or negative Whether they’re significant Something about effect sizes – but difficult to draw inferences from coefficients Comparing logit and probit female age age2 badhealth partner ed_sec ed_deg nkids _cons Logit -0.800 0.358 -0.005 -0.521 0.468 0.603 0.873 -0.478 -3.666 Probit -0.455 0.206 -0.003 -0.300 0.284 0.343 0.476 -0.275 -2.112 Probit * 1.6 -0.728 0.330 -0.004 -0.479 0.455 0.548 0.762 -0.441 -3.380 Scaling factor proposed by Amemiya (1981) Multiply Probit coefficients by 1.6 to get an approximation to Logit Other authors have suggested a factor of 1.8 Marginal effects After logit or Probit estimation, type “mfx” into the command line Calculates marginal effects of each of the RHS variables on the dependent variable . Slope of the function for continuous variables Effect of change from 0 to 1 in a dummy variable Can also calculate elasticities By default, calculates mfx at means of dependent variables Can also calculate at medians, or at specified points mfx Marginal effects after logit y = Pr(work) (predict) = .81734048 variable female* age age2 badhea~h partner* ed_sec* ed_deg* nkids dy/dx -.1182405 .0534214 -.0006949 -.0778398 .0755794 .0866619 .105861 -.0713203 Std. Err. .013 .004 .00005 .00633 .01644 .01255 .01407 .0059 z -9.09 13.35 -13.90 -12.29 4.60 6.91 7.52 -12.08 P>|z| [ 95% C.I. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -.143723 .045578 -.000793 -.090249 .043352 .062063 .078281 -.082891 ] -.092758 .061265 -.000597 -.065431 .107806 .111261 .133441 -.059749 (*) dy/dx is for discrete change of dummy variable from 0 to 1 X .525712 39.8687 1705.71 2.12746 .770848 .398077 .134063 .732221 Marginal effects female* age age2 badhea~h partner* ed_sec* ed_deg* nkids Constant Logit -0.118 0.053 -0.001 -0.078 0.076 0.087 0.106 -0.071 Probit -0.122 0.056 -0.001 -0.081 0.082 0.090 0.109 -0.075 Logit and Probit mfx are very similar indeed OLS is actually not too bad OLS -0.114 0.057 -0.001 -0.086 0.075 0.094 0.118 -0.077 -0.045 Odds ratios Only an option with logit Type “or” in, after the comma as an option Reports odds ratios: that is, how many times more (or less) likely the outcome becomes Results >1 show an increased probability, results <1 show decrease . if the variable is 1 rather than 0, in the case of a dichotomous variable for each unit increase of the variable, for a continuous variable logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pid) or Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood = = = = = -9174.0313 -7909.9067 -7838.4288 -7838.2372 -7838.2372 Logistic regression Number of obs Wald chi2(8) Prob > chi2 Pseudo R2 Log pseudolikelihood = -7838.2372 = = = = 17268 613.59 0.0000 0.1456 (Std. Err. adjusted for 2430 clusters in pid) work Odds Ratio female age age2 badhealth partner ed_sec ed_deg nkids .449277 1.430214 .9953562 .5936991 1.596998 1.826959 2.395254 .6201995 Robust Std. Err. .0407952 .0404509 .0003488 .0239895 .150658 .1644779 .3517339 .024257 z -8.81 12.65 -13.28 -12.90 4.96 6.69 5.95 -12.21 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 .3760308 1.353089 .9946728 .5484942 1.327405 1.531428 1.796205 .5744333 .5367907 1.511735 .99604 .6426296 1.921345 2.179521 3.194091 .6696121 Other post-estimation commands Likelihood ratio test “lrtest” How to do it Adding an extra variable to the RHS always increases the likelihood But, does it add “enough” to the likelihood? LR test calculates L0/L1 (Lrestricted/Lunrestricted) and calculates chi-squared stat with d.f. equal to the number of variables you are dropping. Null hypothesis: restricted specification. Only works on nested models, ie, where the RHS variables in one model are a subset of the RHS variables in the other. Run the full model Type “estimates store NAME” Run a smaller model Type “estimates store ANOTHERNAME” ….. And so on for as many models as you like Type “lrtest NAME ANOTHERNAME” Be careful….. Sample sizes must be the same for both models Won’t happen if the dropped variable is missing for some observations Solve problem by running the biggest model first and using e(sample) . do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp" . logit LR test - example work age age2 badhealth partner ed_sec ed_deg nkids r_* Iteration 0: Iteration 1: Iteration 2: Iteration Similar3:but Iteration 4: if age >= 21 & age <= 60 & wave == 15 log likelihood = -548.06325 log likelihood = -480.90757 log likelihood = -477.04783 logidentical likelihood = -477.02974 not regression to previous log likelihood = -477.02974 examples Add regional variables, decide which ones to keep Logistic regression Number of obs Looks as though Scotland might stay, also possibly SW, NW, N LR chi2(14) = = Prob > chi2 = Log likelihood = age -477.02974 = . logit work age2 badhealth partner ed_secPseudo ed_degR2 nkids r_* work Coef. age age2 badhealth partner ed_sec ed_deg nkids r_lon r_mid r_sw r_nw r_nth r_wls r_sco _cons .3093037 -.0039101 -.5337105 .244436 .7737744 1.356818 -.3589658 -.5363941 -.3796683 -.7379424 -.6369382 -.6270993 -.4251579 -1.183256 -3.042685 Std. Err. .0599295 .000729 .0893498 .1984392 .1749884 .2734631 .0813107 .3247594 .2807851 .34484 .3140179 .2940544 .3862621 .3413128 1.133771 . estimates store ALL . quietly logit . estimates store DROPREG . quietly logit . estimates store KEEP4 . quietly logit . estimates store KEEPSCOT z 5.16 -5.36 -5.97 1.23 4.42 4.96 -4.41 -1.65 -1.35 -2.14 -2.03 -2.13 -1.10 -3.47 -2.68 P>|z| 0.000 0.000 0.000 0.218 0.000 0.000 0.000 0.099 0.176 0.032 0.043 0.033 0.271 0.001 0.007 1066 142.07 0.0000 0.1296 if age >= 21 & age <= 60 & wave == 15 [95% Conf. Interval] .1918441 -.0053389 -.708833 -.1444977 .4308035 .8208405 -.5183318 -1.172911 -.929997 -1.413816 -1.252402 -1.203435 -1.182218 -1.852216 -5.264836 .4267633 -.0024814 -.358588 .6333698 1.116745 1.892796 -.1995998 .1001226 .1706604 -.0620684 -.0214744 -.0507633 .3319019 -.5142949 -.8205342 work age age2 badhealth partner ed_sec ed_deg nkids if e(sample) work age age2 badhealth partner ed_sec ed_deg nkids r_sco r_sw r_nw r_nth if e(sample) work age age2 badhealth partner ed_sec ed_deg nkids r_sco if e(sample) LR test - example . lrtest ALL DROPREG Likelihood-ratio test (Assumption: DROPREG nested in ALL) . lrtest ALL lrtest ALL LR chi2(3) = Prob > chi2 = 3.34 0.3422 LR chi2(6) = Prob > chi2 = 7.60 0.2689 LR chi2(3) = Prob > chi2 = 4.26 0.2347 LR chi2(1) = Prob > chi2 = 6.59 0.0102 KEEPSCOT Likelihood-ratio test (Assumption: KEEPSCOT nested in ALL) . 14.19 0.0479 REJECT nested specification KEEP4 Likelihood-ratio test (Assumption: KEEP4 nested in ALL) . LR chi2(7) = Prob > chi2 = DON’T REJECT nested spec lrtest KEEP4 KEEPSCOT Likelihood-ratio test (Assumption: KEEPSCOT nested in KEEP4) . lrtest KEEPSCOT DROPREG Likelihood-ratio test (Assumption: DROPREG nested in KEEPSCOT) Reject dropping all regional variables against keeping full set Don’t reject dropping all but 4, over keeping full set Don’t reject dropping all but Scotland, over keeping full set Don’t reject dropping all but Scotland, over dropping all but 4 [and just to check: DO reject dropping all regional variables against dropping all but Scotland] Again, specification is illustrative only This is not an example of a “finished” labour supply model! How could one improve the model? Model specification Theoretical considerations, Empirical considerations Parsimony Stepwise regression techniques Regression diagnostics Interpreting results Spotting “unreasonable” results Other models Other models to be aware of, but not covered on this course: Multinomial logit and probit Ordered models (ologit, oprobit) for ordered outcomes • • • Multinomial models (mlogit, mprobit) for multiple outcomes with no obvious ordering • • Working in public, private or voluntary sector Choice of nursery, childminder or playgroup for pre-school care Heckman selection model For modelling two-stage procedures • • • Levels of education, Number of children Excellent, good, fair or poor health Earnings, conditional on having a job at all Having a job is modelled as a probit, earnings are modelled as OLS Used particularly for women’s earnings Tobit model for censored or truncated data Typically, for data where there are lots of zeros • • Expenditure on rarely-purchased items, eg cars Children’s weights, in an experiment where the scales broke and gave a minimum reading of 10kg Competence in STATA Best results in this course if you already know how to use STATA competently. Check you know how to Get data into STATA (use and using commands) Manipulate data, (merge, append, rename, drop, save) Describe your data (describe, tabulate, table) Create new variables (gen, egen) Work with subsets of data (if, in, by) Do basic regressions (regress, logit, probit) Run sessions interactively and in batch mode Organise your datasets and do-files so you can find them again. If you can’t do these, upgrade your knowledge ASAP! Could enroll in STATA net course 101 Costs $110 ESRC might pay Courses run regularly www.stata.com SC968 Panel data methods for sociologists Lecture 1, part 2 Introducing Longitudinal Data Overview Cross-sectional and longitudinal data Types of longitudinal data Types of analysis possible with panel data Data management – merging, appending, long and wide forms Simple models using longitudinal data Cross-sectional and longitudinal data First, draw the distinction between macro- and micro-level data Micro level: firms, individuals Macro level: local authorities, travel-to-work areas, countries, commodity prices Both may exist in cross-sectional or longitudinal forms We are interested in micro-level data But macro-level variables are often used in conjunction with micro-data Cross-sectional data Contains information collected at a given point in time (More strictly, during a given time window) • • Workplace Industrial Relations Survey (WIRS) General Household Survey (GHS) Many cross-sectional surveys are repeated annually, but on different individuals Longitudinal data Contains repeated observations on the same subjects Types of longitudinal data Time-series data Repeated interviews at irregular intervals “Panel” surveys Usually annual intervals, sometimes two-yearly BHPS, ECHP, PSID, SOEP Some surveys have both cross-sectional and panel elements UK cohort studies: NCDS (1958), BHPS (1970), MCS (2000) Repeated interviews at regular intervals Eg, commodity prices, exchange rates Panels more expensive to collect LFS, EU-SILC both have a “rolling panel” element Other sources of longitudinal data Retrospective data (eg work or relationship history) Linkage with external data (eg, tax or benefit records) – particularly in Scandinavia May be present in both cross-sectional or longitudinal data sets Analysis with longitudinal data The “snapshot” versus the “movie” Essentially, longitudinal data allow us to observe how events evolve Study “flows” as well as “stocks”. Example: unemployment Cross-sectional analysis shows steady 5% unemployment rate Does this mean that everyone is unemployed one year out of five? That 5% of people are unemployed all the time? Or something in between Very different implications for equality, social policy, etc The BHPS Interviews about 10,000 adults in about 6,000 households Interviews repeated annually People followed when they move People join the sample if they move in with a sample member Household-level information collected from “head of household” Individual-level information collected from people aged 17+ Young people aged 11-16 fill in a youth questionnaire BHPS is being upgraded to Understanding Society Much larger and wider-ranging survey BHPS sample being retained as part of US sample Data set used for this course is a 20% sample of BHPS, with selected variables The BHPS All files prefixed with a letter indicating the year Several files each year, containing different information hhsamp hhresp indall indresp egoalt income information on sample households household-level information on households that actually responded info on all individuals in responding households info on respondents to main questionnaire (adults) file showing relationship of household members to one another incomes Extra files each year containing derived variables: All variables within each file also prefixed with this letter 1991: a 1992: b………. and so on, so far up to p Work histories, net income files And others with occasional modules, eg life histories in wave 2 bjobhist blifemst bmarriag bcohabit bchildnt Some BHPS files 768.1k 10.7M 1626.3k 330.6k 1066.4k 541.3k 303.8k aindall.dta aindresp.dta ahhresp.dta ahhsamp.dta aincome.dta aegoalt.dta ajobhist.dta 635.3k 978.2k 11.0M 1499.7k 257.1k 1073.0k 546.5k 237.8k bindsamp.dta bindall.dta bindresp.dta bhhresp.dta bhhsamp.dta bincome.dta begoalt.dta bjobhist.dta 23.5k 284.4k 34.3k 766.4k 272.4k bchildad.dta bchildnt.dta bcohabit.dta blifemst.dta bmarriag.dta 624.3k 975.6k 11.0M 1539.0k 287.4k 1008.9k 542.2k 237.8k 1675.0k Extra modules in Wave 2 cindsamp.dta cindall.dta cindresp.dta chhresp.dta chhsamp.dta cincome.dta cegoalt.dta cjobhist.dta clifejob.dta 616.7k 943.7k 11.2M 1508.9k 301.9k 1019.7k 531.8k 245.0k 129.0k dindsamp.dta dindall.dta dindresp.dta dhhresp.dta dhhsamp.dta dincome.dta degoalt.dta djobhist.dta dyouth.dta 4977.3k 1027.7k xwaveid.dta xwlsten.dta Following sample members Youth module introduced 1994 Cross-wave identifiers Person and household identifiers BHPS (along with other panels such as ECHP, SOEP, ECHP) is a household survey – so everyone living in sample households becomes a member Need identifiers to 1. 2. Associate the same individual with him- or herself in different waves Link members of same household with each other in the same wave - the HID identifier Note: no such thing as a longitudinal household! Household composition changes, household location changes….. HID is a cross-sectional concept only! What it looks like: 4 waves of data, sorted by pid and wave. . list pid wave hgsex age jbstat mastat in 1/30, clean 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. pid 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10042571 10042571 10042571 10051538 10051538 10051538 10051538 10051562 10051562 10051562 10051562 10059377 10059377 10059377 10059377 10064966 10064966 10064966 10064966 10076166 10076166 10076166 wave 1 2 3 4 1 2 3 4 1 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 hgsex female female female female male male male male male male male female female female female female female female female female female female female male male male male female female female age 59 60 61 62 30 31 32 33 59 60 62 22 23 24 25 4 5 6 7 46 47 48 49 70 70 71 72 77 78 79 jbstat retired retired retired retired employed employed employed employed unemploy lt sick, retired unemploy family c unemploy family c . . . . employed employed employed self-emp retired retired retired retired retired retired retired mastat never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma . . . . never ma never ma never ma never ma widowed widowed widowed widowed widowed widowed widowed Observations in rows, variables in columns. Blue stripes show where one individual ends & another begins Not present at 2nd wave A child, so no data on job or marital status (Can also use ,nol option) . list pid wave hgsex age jbstat mastat in 1/30, clean nol 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. pid 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10042571 10042571 10042571 10051538 10051538 10051538 10051538 10051562 10051562 10051562 10051562 10059377 10059377 10059377 10059377 10064966 10064966 10064966 10064966 10076166 10076166 10076166 wave 1 2 3 4 1 2 3 4 1 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 hgsex 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 age 59 60 61 62 30 31 32 33 59 60 62 22 23 24 25 4 5 6 7 46 47 48 49 70 70 71 72 77 78 79 jbstat 4 4 4 4 2 2 2 2 3 8 4 3 6 3 6 . . . . 2 2 2 1 4 4 4 4 4 4 4 mastat 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 . . . . 6 6 6 6 3 3 3 3 3 3 3 Joining data sets together 1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. 6. 7. 7. 8. 8. 9. 9. 10. 10. 11. 11. 12. 12. 13. 13. 14. 14. 15. 15. 16. 16. 17. 17. 18. 18. 19. 19. 20. 20. 21. 21. 22. 22. 23. 23. 24. 24. 25. 25. 26. 26. 27. 27. 28. 28. 29. 29. 30. 30. 31. 31. 32. 32. 33. 33. 34. 34. 35. 35. 36. 36. 37. 37. 38. 38. 39. 39. 40. 40. 41. 41. 42. 42. 43. 43. 44. 44. 45. 45. 46. 46. 47. 47. 48. 48. 49. 49. 50. 50. pid 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10042571 10042571 10042571 10042571 10051538 10051538 10051538 10051538 10051538 10051538 10051538 10051562 10051562 10051562 10051562 10051562 10051562 10051562 10051562 10059377 10059377 10059377 10059377 10059377 10064966 10064966 10064966 10064966 10064966 10064966 10064966 10076166 10076166 10076166 10076166 10076166 10076166 10076166 10076166 10081763 10081763 10081763 10081763 10081763 10081763 10081763 10081763 10081798 10081798 10081798 10081798 10081798 10081798 10081798 10081798 10091831 10091831 10091831 10091831 10091831 10091831 10091831 10091831 10091866 10091866 10091866 10091866 10091866 10091866 10091866 10091866 10091904 10091904 10091904 10091904 10091904 10091904 wave 1 2 3 4 1 2 3 4 1 3 4 1 2 3 4 1 2 3 4 4 1 1 2 3 4 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 hgsex female female female female male male male male male male male female female female female female female female female female female female female female female male male male male male male male female female female female female female female female male male male male male male male male female female female female female female female female male male male male male male male male female female female female female female female female male male male male male male age 59 60 61 62 30 31 32 33 59 60 62 22 23 24 25 4 5 6 7 46 47 48 49 70 70 70 71 71 72 72 77 77 78 78 79 79 79 79 71 71 72 72 73 73 74 74 72 72 73 73 74 74 75 75 49 49 50 50 50 50 51 51 48 48 48 48 49 49 50 50 11 11 11 11 12 12 jbstat retired retired retired retired employed employed employed employed unemploy lt sick, retired unemploy family c unemploy family c . . . . . employed employed employed employed employed employed self-emp self-emp retired retired retired retired retired retired retired retired retired retired retired retired retired retired retired retired . . . . . . . . retired retired retired retired retired retired retired retired . . . . . . . . maternit maternit employed employed employed employed family family c c . . . . . . Adding extra observations: “append” command mastat never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma never ma . . . . . never never ma ma never never ma ma never never ma ma never ma never ma widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed widowed . . . . . . . . married married married married married married married married . . . . . . . . married married married married married married married married . . . . . . hlghq1 hlghq1 77 12 12 10 10 11 11 77 88 12 12 77 11 11 77 66 11 11 66 88 10 10 .. .. .. .. 12 12 10 10 14 14 17 17 missing missing missing missing 18 18 17 17 66 77 77 proxy proxy re re .. .. .. .. 88 55 77 missing missing .. .. .. .. 17 17 11 11 proxy proxy re re missing missing .. .. .. hlstat hlstat excellen excellen excellen excellen excellen excellen excellen excellen excellen excellen fair fair fair fair good good fair fair good good fair fair good good excellen excellen excellen excellen good good .. .. .. .. fair fair good good fair fair poor poor fair fair good good good good poor poor excellen excellen excellen excellen excellen excellen excellen excellen .. .. .. .. good good excellen excellen excellen excellen excellen excellen .. .. .. .. good good good good good good good good .. .. .. Adding extra variables: “merge” command Whether appending or merging Whether appending or merging The data set you are using at the time is called the “master” data The data set you want to merge it with is called the “using” data Make sure you can identify observations properly beforehand Make sure you can identify observations uniquely afterwards Appending Use this command to add more observations Relatively easy Check first that you are really adding observations you don’t already have (or that if you are adding duplicates, you really want to do this) Syntax: append using using_data STATA simply sticks the “using” data on the end of the “master” data STATA re-orders the variables if necessary. If the using data contain variables not present in the master data, STATA sets the values of these variables to missing in the using data (and vice versa if the master data contains variables not present in the using data) Merging is more complicated Use “merge” to add more variables to a data set Master data: age.dta pid wave age 28005 1 30 19057 1 59 28005 2 31 19057 3 61 19057 4 62 28005 4 33 Using data: sex.dta pid wave sex 19057 1 female 19057 3 female 28005 1 male 28005 2 male 28005 4 male 42571 1 male 42571 3 male First, make sure both data sets are sorted the same way use sex.dta sort pid wave save, replace use age.dta sort pid wave Merging Master data: age.dta pid wave age 19057 1 59 19057 3 61 19057 4 62 28005 1 30 28005 2 31 28005 4 33 Using data: sex.dta pid wave sex 19057 1 female 19057 3 female 28005 28005 28005 42571 42571 1 2 4 1 3 male male male male male Notice that both data sets don’t contain the same observations • Merge 1:1 pid wave using sex New in STATA this year: shows you are expecting one “using” observation for each “master” observation pid 19057 19057 19057 28005 28005 28005 42571 42571 wave 1 3 4 1 2 4 1 3 age 59 61 62 30 31 33 . . sex female female . male male male male male _merge 3 3 1 3 3 3 2 2 STATA creates a variable called _merge after merging • • • 1: observation in master but not using data 2: observation in using but not master data 3: observation in both data sets Options available for discarding some observations – see manual More on merging Previous example showed one-to-one merging Not every observation was in both data sets, but every observation in the master data was matched with a maximum of only one observation in the using data – and vice versa. Many-to-one merging: Household-level data sets contain only one observation per household (usually <1 per person) Regional data (eg, regional unemployment data), usually one observation per region Sample syntax: merge m:1 hid wave using hhinc_data hid 1604 2341 3569 4301 4301 4956 5421 6363 6827 6827 pid age 19057 59 28005 30 42571 59 51538 22 51562 4 59377 46 64966 70 76166 77 81763 71 81798 72 hid 1604 2341 3569 4301 4956 5421 6363 6827 h/h income 780 1501 268 394 1601 225 411 743 hid 1604 2341 3569 4301 4301 4956 5421 6363 6827 6827 pid 19057 28005 42571 51538 51562 59377 64966 76166 81763 81798 age 59 30 59 22 4 46 70 77 71 72 h/h income 780 1501 268 394 394 1601 225 411 743 743 One-to-many merging Job and relationship files contain one observation per episode (potentially >1 per person) Income files contain one observation per source of income (potentially >1 per person) Sample syntax: merge 1:m pid wave using births_data Long and wide forms 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. The data we have here is in “long” form One row for each person/wave combination From a few slides back: pid 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10042571 10042571 10042571 10051538 10051538 10051538 10051538 10051562 10051562 10051562 10051562 10059377 10059377 10059377 10059377 10064966 10064966 10064966 10064966 10076166 10076166 10076166 wave 1 2 3 4 1 2 3 4 1 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 hgsex female female female female male male male male male male male female female female female female female female female female female female female male male male male female female female age 59 60 61 62 30 31 32 33 59 60 62 22 23 24 25 4 5 6 7 46 47 48 49 70 70 71 72 77 78 79 Wide form However, it’s also possible to put longitudinal data into “wide” form One observation per person, with different variables relating to different years of data Sex doesn’t change [usually] Age at wave 1, and so on 1. 2. 3. 4. 5. 6. 7. 8. pid 10019057 10028005 10042571 10051538 10051562 10059377 10064966 10076166 sex1 female male male female . female male female age1 59 30 59 22 4 46 70 77 age2 60 31 . 23 5 47 70 78 age3 61 32 60 24 6 48 71 79 age4 62 33 62 25 7 49 72 79 The reshape command Switching from long to wide: • In BHPS, this becomes • reshape wide [stubnames], i(pid) j(wave) What are stub names? They are a list of variables which vary between years Variables like sex or ethnicity would not normally be included in this list Switching from wide to long: Exactly the opposite • reshape wide [stubnames], i(id) j(year) reshape long [stubnames], i(id) j(wave) Lots more info and examples in STATA manual Simple models using longitudinal data Auto-regressive and time-lagged models Models of change But first: the GHQ Use this for lots of analysis in the lectures and practical sessions General Health Questionnaire Different versions: BHPS carries the GHQ-12, with 12 questions. Have you recently: been able to concentrate on whatever you're doing ? lost much sleep over worry ? felt that you were playing a useful part in things ? felt capable of making decisions about things ? felt constantly under strain? felt you couldn't overcome your difficulties ? been able to enjoy your normal day to day activities ? been able to face up to problems ? been feeling unhappy or depressed? been losing confidence in yourself? been thinking of yourself as a worthless person ? been feeling reasonably happy, all things considered ? Answer each question on 4-point scale not at all - no more than usual - rather more - much more GHQ (ghq) 1: likert | Freq. Percent Cum. ------------------------+----------------------------------missing or wild | 582 2.10 2.10 proxy respondent | 1,202 4.33 6.43 0 | 77 0.28 6.70 1 | 109 0.39 7.10 2 | 149 0.54 7.63 3 | 288 1.04 8.67 4 | 504 1.82 10.49 5 | 867 3.12 13.61 6 | 2,229 8.03 21.64 7 | 2,265 8.16 29.80 8 | 2,355 8.48 38.28 9 | 2,426 8.74 47.02 10 | 2,259 8.14 55.16 11 | 2,228 8.03 63.19 12 | 2,478 8.93 72.11 13 | 1,316 4.74 76.85 14 | 1,115 4.02 80.87 15 | 876 3.16 84.03 16 | 714 2.57 86.60 17 | 635 2.29 88.89 18 | 499 1.80 90.68 19 | 439 1.58 92.27 20 | 381 1.37 93.64 21 | 318 1.15 94.78 22 | 276 0.99 95.78 23 | 264 0.95 96.73 24 | 220 0.79 97.52 25 | 134 0.48 98.00 26 | 103 0.37 98.38 27 | 96 0.35 98.72 28 | 59 0.21 98.93 29 | 66 0.24 99.17 30 | 47 0.17 99.34 31 | 47 0.17 99.51 32 | 35 0.13 99.64 33 | 26 0.09 99.73 34 | 20 0.07 99.80 35 | 29 0.10 99.91 36 | 26 0.09 100.00 ------------------------+----------------------------------Total | 27,759 100.00 HLGHQ1 in BHPS Sum of scores LIKERT scale We recode <0 values to missings, rename LIKERT Consider as a continuous variable GHQ subjective wellbeing (ghq) 2: caseness Freq. Percent Cum. missing or wild proxy respondent 0 1 2 3 4 5 6 7 8 9 10 11 12 582 1,202 13,222 3,569 2,189 1,581 1,143 933 720 561 529 417 385 343 383 2.10 4.33 47.63 12.86 7.89 5.70 4.12 3.36 2.59 2.02 1.91 1.50 1.39 1.24 1.38 2.10 6.43 54.06 66.92 74.80 80.50 84.61 87.98 90.57 92.59 94.50 96.00 97.38 98.62 100.00 HLGHQ2 Caseness scale Recodes answers 3-4 as 1, and adds up Scores above 2 used to indicate psychological morbidity Time-lagged models Start with simple OLS model The Likert score is a measure of psychological wellbeing derived from a battery of questions . reg LIKERT age age2 female ue_sick partner if age >= 18 Source SS df MS Model Residual 37842.892 5 682847.462 25102 7568.5784 27.2029106 Total 720690.354 25107 28.7047578 LIKERT Coef. age age2 female ue_sick partner _cons .0797637 -.0007342 1.593608 3.562843 -.044241 8.298458 Std. Err. .0111716 .0001119 .0661958 .1249977 .0788756 .2374816 t 7.14 -6.56 24.07 28.50 -0.56 34.94 Number of obs F( 5, 25102) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 0.575 0.000 = = = = = = 25108 278.23 0.0000 0.0525 0.0523 5.2156 [95% Conf. Interval] .0578667 -.0009535 1.46386 3.31784 -.1988419 7.83298 .1016607 -.000515 1.723356 3.807846 .1103598 8.763936 Generate lagged variable . . sort pid wave . capture drop LIKERT_lag . gen LIKERT_lag = LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 (14738 missing values generated) . . * check: . list pid wave LIKERT LIKERT_lag in 1/30, clean 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. pid 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10042571 wave 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 LIKERT 7 12 10 11 12 . 12 12 12 12 12 12 . 11 11 7 8 12 7 8 8 7 9 7 8 7 12 9 13 11 LIKERT~g . 7 12 10 11 12 . 12 12 12 12 12 12 . 11 . 7 8 12 7 8 8 7 9 7 8 7 12 9 . NB: the 1/30 here is just so it will fit on the page. You should check many more observations than this! OLS, with lagged dependent variable . reg LIKERT LIKERT_lag age age2 female ue_sick partner if age >= 18 Source SS df MS Model Residual 163104.604 6 453760.485 21456 27184.1006 21.1484193 Total 616865.089 21462 28.7421997 LIKERT Coef. LIKERT_lag age age2 female ue_sick partner _cons .4752892 .0272471 -.0002391 .8414271 2.128451 .0967488 4.593374 Std. Err. .0060424 .0108394 .0001079 .0638746 .1222784 .0759926 .2365749 t 78.66 2.51 -2.22 13.17 17.41 1.27 19.42 Also possible to include lagged explanatory variables Number of obs F( 6, 21456) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.012 0.027 0.000 0.000 0.203 0.000 = 21463 = 1285.40 = 0.0000 = 0.2644 = 0.2642 = 4.5987 R-squared rockets from 5% to 26% [95% Conf. Interval] .4634456 .0060011 -.0004506 .7162282 1.888777 -.0522022 4.12967 .4871327 .0484931 -.0000276 .966626 2.368126 .2456999 5.057079 Big & very significant coefficient on lagged variable Coeff on “ue_sick” falls from 3.6 to 2.1 Models of change yi xi ...... i Start with OLS model [simplified, but imagine more variables] yi1 xi1 ...... i1 Separate model for each year – suffix denotes year yi 2 xi 2 ...... i 2 ( yi 2 yi1 ) ( xi 2 xi1 ) ...... ( i 2 i1 ) Subtract 1st from 2nd model yi xi ...... i Or, express in terms of change Generate difference variables capture drop dif* sort pid wave gen dif_LIKERT = gen dif_age = gen dif_age2 = gen dif_female = gen dif_ue_sick = gen dif_partner = LIKERT age age2 female ue_sick partner - LIKERT[_n-1] age[_n-1] age[_n-1] female[_n-1] ue_sick[_n-1] partner[_n-1] if if if if if if pid pid pid pid pid pid == == == == == == pid[_n-1] pid[_n-1] pid[_n-1] pid[_n-1] pid[_n-1] pid[_n-1] & & & & & & wave wave wave wave wave wave == == == == == == Check you understand why dif_female will [very nearly] always be zero wave[_n-1] wave[_n-1] wave[_n-1] wave[_n-1] wave[_n-1] wave[_n-1] + + + + + + 1 1 1 1 1 1 Check for sensible results! . list pid wave age dif_age in 1/30, clean 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. pid 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10042571 wave 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 age 59 60 61 62 63 64 65 66 67 67 68 69 71 71 73 30 31 32 33 34 35 36 37 38 39 40 41 42 43 59 dif_age . 1 1 1 1 1 1 1 1 0 1 1 2 0 2 . 1 1 1 1 1 1 1 1 1 1 1 1 1 . More checking…. . list pid wave LIK dif_LIK in 1/30, clean 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. pid 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10019057 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10028005 10042571 wave 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 LIKERT 7 12 10 11 12 . 12 12 12 12 12 12 . 11 11 7 8 12 7 8 8 7 9 7 8 7 12 9 13 11 dif_LI~T . 5 -2 1 1 . . 0 0 0 0 0 . . 0 . 1 4 -5 1 0 -1 2 -2 1 -1 5 -3 4 . Obvious problems Interview times may mean difference of 100% in age difference variable Most differences are zero Moving into unemployment or partnership is given equal and opposite weighting to moving out. No real reason why this should be the case There are MUCH better ways to use these data! Nevertheless, let’s proceed! Results . reg LIKERT age age2 female ue_sick partner if age >= 18 Source . SS df MS Model Residual 37842.892 5 682847.462 25102 7568.5784 27.2029106 Total 720690.354 25107 28.7047578 LIKERT Coef. age age2 female ue_sick partner _cons .0797637 -.0007342 1.593608 3.562843 -.044241 8.298458 Std. Err. .0111716 .0001119 .0661958 .1249977 .0788756 .2374816 t 7.14 -6.56 24.07 28.50 -0.56 34.94 Number of obs F( 5, 25102) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 0.575 0.000 = = = = = = 25108 278.23 0.0000 0.0525 0.0523 5.2156 [95% Conf. Interval] .0578667 -.0009535 1.46386 3.31784 -.1988419 7.83298 .1016607 -.000515 1.723356 3.807846 .1103598 8.763936 reg dif_LIKERT dif_age - dif_partner if age >= 18 Source SS df MS Model Residual 5058.58867 4 608713.117 21456 1264.64717 28.3702982 Total 613771.706 21460 28.6007319 dif_LIKERT Coef. dif_age dif_age2 dif_ue_sick dif_partner _cons .3757715 -6.24e-07 1.857857 -.5948999 -.3104285 Std. Err. .1227943 .0000212 .149281 .1780776 .1364378 t 3.06 -0.03 12.45 -3.34 -2.28 Number of obs F( 4, 21456) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.002 0.977 0.000 0.001 0.023 = = = = = = 21461 44.58 0.0000 0.0082 0.0081 5.3264 [95% Conf. Interval] .1350856 -.0000422 1.565255 -.9439452 -.5778567 .6164574 .000041 2.150458 -.2458545 -.0430003 Age increase equal and opposite to constant Female drops out Coeffs on sick and partner significant