Multilevel Event History Models with Applications to the Analysis of Recurrent Employment Transitions Fiona Steele Outline • The discrete-time approach • Multilevel models and examples for: – Recurrent events – Multiple states • Handling large datasets • Examples of other applications • Estimation/software Why use discrete-time methods? • Events times are often measured in discrete time units, e.g. months or years. • Straightforward to allow and test for non-proportional hazards. • We can use familiar models for discrete response data. For more complex data structures and processes, we can use existing estimation procedures for multilevel models. Restructuring data for a discrete-time analysis: Individual-based file E.g. records for 2 individuals INDIVIDUAL (j) DURATION (tj) EVENT 1 5 0 2 3 1 EVENT AGE (xj) 20 35 =1 if event observed (uncensored) = 0 if censored. Restructuring data for a discrete-time analysis: Person-period file j t ytj xj 1 1 1 1 1 2 2 2 1 2 3 4 5 1 2 3 0 0 0 0 0 0 0 1 20 20 20 20 20 35 35 35 ytj = 1 if event occurs to individual j in interval t = 0 if event does not occur in interval t Discrete-time hazard function Denote by p tj the probability that individual j has an event during interval t , given no event before the start of t . p tj Pr( y tj 1 | y t 1, j y t 2 , j y1 j 0 ) Pr( y tj 1) p tj is a discrete-time approximation to the continuous-time hazard function. Call p tj the discrete-time hazard. A simple discrete-time logit model We can fit a logit regression model of the form: logit [ p tj ] α z tj β x tj T T The covariates xtj can be constant over time or time-varying. ztj is vector of functions of time (e.g. polynomials or dummy variables) and αTztj is the logit of the baseline hazard function. Other link functions possible, e.g. clog-log or probit. Recurrent events • Analyse duration of periods of continuous exposure (episodes), e.g. employment episodes, birth intervals, partnerships • There may be unobserved individual-specific (i.e. time-invariant) factors which affect the probability of an event for all of an individual’s episodes – referred to as unobserved heterogeneity or frailty Hierarchical data structure Repeated events lead to a two-level hierarchical structure Level 2: Individuals Level 1: Episodes 2-level model for recurrent events logit [ p tij ] α z tij β x tij u j T T p tij is probability of event in time interval t during episode i of individual j x tij are covariates which might be time-varying or defined at the episode or individual level uj random effect representing unobserved characteristics of individual j – unobserved heterogeneity or frailty Assume u j ~ N ( 0 , u ) 2 Example: women’s employment • Duration of non-employment spells; event is (re)entry into employment • Data are subsample from British Household Panel Study: 1401 women, 2290 episodes and 15314 person-year records • Employment, birth and union histories collected retrospectively at wave 2. These were linked to subsequent panel data to form continuous histories • Focus on effects of duration non-employed and time-varying indicators of number and age of children, but also adjust for age, characteristics of previous job (if any) Unobserved individual heterogeneity • Estimated standard deviation of woman-level random effect is 0.65 (se=0.09) – significant variation between women in log-odds of entering employment due to unmeasured time-invariant characteristics • Failure to account for unobserved heterogeneity (UH) leads to overstatement of negative duration effects and understatement of positive duration effects • After accounting for UH, effects of time-varying covariates (e.g. duration and number/age children) are subject-specific, i.e. within-woman effects Duration effects before and after allowing for unobserved heterogeneity Duration non-employed (ref is < 1 yr) Before After [1,2) years -0.788* -0.648* [2,3) -1.133* -0.927* [3,4) -1.489* -1.225* [4,5) -1.372* -1.077* [5,6) -1.256* -0.930* [6,7) -1.353* -1.025* [7,8) -1.575* -1.248* [8,9) -1.688* -1.350* 9+ years -2.130* -1.765* * p<0.05 Estimates from multilevel logit model of entry into employment Child indicator Est. (SE) Imminent birth (within 1 year) -0.836* (0.124) 1 child -0.204* (0.096) 2 -0.356* (0.142) 1 child 0.244* (0.117) 2 0.428* (0.115) No. children age <=5 years (ref = 0) No. children age > 5 years (ref = 0) * p<0.05 Modelling transitions between multiple states An individual may pass through various ‘states’, e.g. employment and non-employment. Suppose there are 2 states, and denote by pstij the probability of a transition from state s. T logit p stij α s z stij β s x stij u sj , T s 1, 2 where (u1j, u2j) ~ bivariate normal Note: Generalises to multinomial logit for > 2 states Multiple states: data structure (1) Start with an episode-based file, e.g. j i Stateij tij EVENT Ageij ij 1 1 E 3 1 16 1 2 NE 2 0 19 States are employment (E) and non-employment (NE) Notes: (i) t in years; (ii) EVENTij =1 if uncensored, 0 if censored; (iii) age, in years, at start of episode. Multiple states: data structure (2) Convert to discrete-time format: t ytij Eij NEij Eij*Ageij NEij*Ag eij 1 2 3 0 0 1 1 1 1 0 0 0 16 16 16 0 0 0 1 2 0 0 0 0 1 1 0 0 19 19 Eij dummy for Employment, NEij dummy for Non-Employment Example: transitions between employment and non-employment • corr(u1j, u2j)=0.58, se=0.13, so large positive residual correlation between E→NE and NE→E – Women with high (low) chance of entering E tend to have a high (low) chance of leaving E – Positive correlation arises from two sub-groups: short spells of E and NE, and longer spells of both types • BUT little impact on estimates for child indicators on (re)entry into employment Handling large datasets • Although flexible, a drawback of the discrete-time approach is that the analysis file can be very large. This is a particular problem when we wish to fit complex models with multiple correlated random effects. • Two possible approaches: – Group time intervals – More efficient algorithms, e.g. reparameterisation in MCMC estimation (Browne et al. 2009) Grouped time intervals Suppose we analyse 6-month rather than monthly intervals. Need to allow for different lengths of exposure time. In any 6-month interval, some will have the event or be censored after 1st month while others will be exposed for full 6 months. Denote by ntij exposure time in grouped interval t. Estimate binomial logit model with response ytij and denominator ntij Note: intervals do not need to be the same width. Example of grouped time intervals Suppose an individual is observed to have an event during the 17th month, and we wish to group durations into 6-month intervals (t). j i t ntij ytij y*tij 1 1 1 6 0 0 1 1 2 6 0 0 1 1 3 5 1 0.2 Implications of aggregation • Need to assume that hazard function is constant within the grouped intervals. • Need to fix values of time-varying covariates within intervals, e.g. value at start. • In practice, aggregation has little impact on estimated baseline hazard or effects of episode/individual-level covariates. But impact on coefficients of time-varying covariates can be substantial. Examples of other applications • Hospital admissions: length of stay or duration between admissions – Repeated episodes nested within patients if multiple admissions – Hospital and GP effects using cross-classified multilevel model (GPs refer to multiple hospitals, and hospitals take patients from multiple GPs) • Area effects on mortality or fertility – Repeated birth intervals (for fertility) for individuals nested within areas Area effects on mortality: alternative approaches • As in employment example, set up person-period file with multiple records per person, e.g. Kravdal (2006) • Define a single binary response for each person and include number of years of exposure as offset in a Poisson regression, e.g. Tarkiainen et al. (2009). Could also treat as binomial response (as for grouped time intervals). • If few, categorical covariates apply Poisson regression to aggregate data (1 record for each combination of t and covariate values) Area effects on mortality: Multilevel Poisson modelling of aggregate data (1) • Suppose we want to estimate effect of age, sex and area characteristics on individual mortality risk • Suppose we group age into four 5-year age categories. Then for each area define 8 cells, one for each age-sex combination • For area j denote by yij the observed number of deaths for age-sex cell i • Denote the total population at risk of mortality in cell i of area j by nij, or might use expected number of deaths Eij Area effects on mortality: Multilevel Poisson modelling of aggregate data (2) • Analyse (yij, nij) using 2-level Poisson model • Define age and sex dummies characterising cells and include these and area-level variables as predictors • Application to cancer mortality: Langford and Day (2001) - No. deaths for small areas (i) within regions (j) within EC nations (k). Covariates at regional level • Application to teenage conception: Diamond et al. (2002) – No. conceptions for age-year cell (i) within electoral wards (j). Deprivation indicators at ward level Software • Recurrent events and multiple states. Any software for multilevel binary responses • Binomial models for grouped intervals. GLLAMM, MLwiN, WinBUGS • Simultaneous equations models for correlated processes. aML, GLLAMM, MLwiN, Sabre, WinBUGS. aML is the most general (mixed response types at different levels) References Browne, W. J., Steele, F., Golalizadeh, M. & Green, M. (2009). The use of simple reparameterisations in MCMC estimation of multilevel models with applications to discretetime survival models. JRSS A, 172, 579-598. Diamond, I., Clements, S., Stone, N. and Ingham, R. (2002) Spatial variation in teenage conceptions in south and west England. Journal of the Royal Statistical Society, Series A, 162: 273-289. Goldstein, H., Pan, H. and Bynner, J. (2004) “A flexible procedure for analysing longitudinal event histories using a multilevel model.” Understanding Statistics, 3: 85-99. Kravdal, Ø (2006) Does place matter for cancer survival in Norway? A multilevel analysis of the importance of hospital affiliation and municipality socio-economic resources. Health and Place, 12: 527-537. Langford, I. H. and Day, R.J. (2001) Poisson Regression. In A.H. Leyland and H. Goldstein (ed) Multilevel Modelling of Health Statistics. London: Wiley. Chapter 4. References Steele, F., Goldstein, H. and Browne, W. (2004) “A general multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics.” Statistical Modelling, 4: 145-159. Steele, F. (2011) Multilevel discrete-time event history models with applications to the analysis of recurrent employment transitions (with discussion). Australian and New Zealand Journal of Statistics (to appear). Tarkiainen, L., Martikainen, P., Laaksonen, M. and Leyland, A.H. (2009) Comparing the effects of neighbourhood characteristics on all-cause mortality using two hierarchical areal units in the capital region of Helsinki. Health and Place, 16: 409-412. See also downloadable materials: http://www.cmm.bris.ac.uk/MLwiN/tech-support/workshops/materials/models.shtml http://www.cmm.bris.ac.uk/MLwiN/tech-support/workshops/materials/eha.shtml