Constructing Data for Event Histories: data formats and introductory analyses Talk prepared for Workshop 2a : Event History Analysis, of the ESRC seminar series “Analysing Longitudinal Data – Bridging the gap between methodology and sociological research”. Paul Lambert Cardiff University School of Social Sciences lambertp@cardiff.ac.uk 13 June 2003 Files for use with this paper can be downloaded from: http://www.cf.ac.uk/socsi/main/lambertp/downloads.html 1 1. Data Formats 1.1 The State Space Event history data is data collected on the duration of ‘episodes’ within a ‘state space’. Its fundamental components are the identification of state membership, and information on the length of time within episodes of each state. Additional structures to event history datasets are all built onto these basic building blocks. Figure 1: Illustration of episodes within a state space : Lifetime work histories for 3 respondents born in 1935 Person 1 State space FT work PT work Not in work Person 2 FT work PT work Not in work Person 3 FT work PT work Not in work 1950 1960 1970 1980 1990 2000 Time 2 1.2 Forms of State Spaces The simplest event history data is ‘single state single episode’ : it concerns only the duration spent in one state type for a single episode per respondent. Classic examples of survival data have this form, for instance, the time from manufacture that it takes before a metal component breaks down (time to failure). More complex datasets can involve multiple states and/or multiple episodes per data subjects. Such data is more commonly found in social science applications. The diagram above illustrates a ‘multi-state multi-episode’ data structure, where respondents are distinguished into 3 different worklife states, and several different episodes can describe their full life histories. Table 1 below describes a selection of event history data formats, which we later return to in section 4 below. Table 1 : Description of selected event history data formats 1) Single State Single episode One episode of a certain type recorded for each subject. Example: First post-school occupation until it ends (for any reason) 2) Single episode competing risks One episode of a certain type recorded for each subject, but type of ending events classified into more than one state of interest. Example: First marriage duration, comparing if it ends in separation, divorce, widowhood, etc. 3) Mutli-episode single state More than one episode recorded for (at least some) subjects, but always the same type, and ending in same way. Example (relatively rare in social sciences): Prison sentence duration until release 4) Multistate multi-episode More than one episode in more than one state for subjects. Example (very common): Life history in economic activity states since leaving school to retirement. 5) Time varying covariates (to any of above) Extension to any of above structures whereby properties of explanatory covariates are not fixed over time, but change in value during the duration of some episodes. For many users, one of the harder tasks in undertaking event history analysis lies in making the transition from original data sources, to the construction of ‘neat’ data files in some of the formats described above. In the main examples following this talk, we work on pre-prepared neat files of three example data format types. However, associated with those exercises is additional SPSS syntax which illustrates how the extracted files were produced from the two data sources used. More interested users might like to check and replicate that data construction. 3 1.3 Rectangular data files Event history data can be stored in two alternative rectangular data file formats which allow for statistical analysis. More common are continuous time datasets (also referred to as ‘spell files’ or ‘event oriented files’). Here each case represents a single episode, with information on the nature of the episode and its duration inherent to the analysis. Table 2: Illustration of a continuous time (multistate multi-episode) dataset Case Person 1 2 3 4 5 6 7 . 1 1 2 2 2 2 3 . Start time 1 158 1 22 106 149 1 . End time 158 170 22 106 149 170 10 . Duration 157 12 21 84 43 21 9 . Origin State 1 (FT) 3 (NW) 3 (NW) 1 (FT) 3 (NW) 2 (PT) 1 (FT) . Destination state 3 (NW) 3(NW) 1 (FT) 3 (NW) 2 (PT) 2 (PT) 2 (PT) {Other vars, person/state} . Equally, discrete time datasets can be used whereby the sequence of events is partitioned up into distinctive time units. Each case in the data file then represents a time spell for a certain subject within a certain state. Additional variables can indicate person level, state level, and period level, characteristics – thus an advantage of discrete time datasets is that they can more easily handle information on time-varying covariates. The discrete time units can cover whatever duration the available information allows them to. However, error is introduced if a transition between states does not occur at exactly the transition between discrete time units; the discrete time state is usually chosen as the main state throughout the discrete time unit. In worklife history analyses for example, monthly discrete time units are usually thought adequate to capture most state changes reasonably accurately. However annual panel surveys are an example of a yearly discrete time dataset where the longer length of the discrete time period is likely to call into question any assumptions about the duration of spells. 4 Table 3: Illustration of a discrete time (multi-state multi-episode) dataset Case Person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 . 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 . Discrete Time 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 . Approx real time 5 20 35 50 65 80 95 110 125 140 155 170 5 20 35 50 . State 1 FT 1 FT 1 FT 1 FT 1 FT 1 FT 1 FT 1 FT 1 FT 1 FT 3 NW 3 NW 3 NW 3 NW 1 FT 1 FT . End of state 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 . {Other person, state, or time unit level variables} It is possible to translate continuous time data into a discrete time dataset using data management facilities within modern statistical packages. The reverse translation, whilst possible, is less satisfactory, due to the approximation to continuous time available via the discrete time format. 5 2. Software for Dealing with Event History Data As event history data can be expressed in a rectangular data file, it can in principle be read in almost any statistical analysis package. However, the functions involved in conducting an event history data analysis are more specific. Historically, fewer statistical packages had functionality for these types of analysis, and specialist packages and programmes were developed. More recently however it has become routine for major analysis packages to incorporate event history data functions. Some selective comments are presented in table 3, though other potentially relevant packages are not covered. Table 3: Selected Software and comments for data handling and statistical analysis of event history data structures Multi-purpose packages SPSS Basic descriptives, Kaplan-Meir, and Cox regression models. Example applications eg in Blossfeld et al (1989); . STATA Wide range of relevant functions, with estensive user contributions through ‘STB’ systems, see Cleves et al (2002); Rabe-Hesketh (2002) SAS Wide range of relevant functions – of the ‘big three’ social statistics packages, both STATA and SAS have many more analytical routines than SPSS does. Some example commands in Allison (1984); Blossfeld et al (1989); GLIM Historically widely used in the UK for both social and physical science event history applications. See Aitkin et al (1989: chpt6); Allison (1984). Splus/R Difficult language to learn, but one which is used by statisticians for ‘cutting edge’ survival data analysis. See for example Venables and Ripley (1999) MLwiN Popular analytical package for dealing with hierarchically nested data structures. Allows extension to event history analysis, eg Goldstein et al (1998) lEM Powerful, though pithy, freeware illustrating event history applications through log-linear modelling formats, see Vermunt (1997): http://www.kub.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html Specialist packages TDA Freely available from http://www.stat.ruhr-uni-bochum.de/tda.html , a simple and intuitive software thoroughly illustrated in Blossfeld and Rohwer (2002). SABRE Extension facility for GLIM, specifically for analysis of binary recurrent events (Barry et al 1990). 6 3. Data collection In any discussion of event history analysis, it is important to remember that the data collection methods used to obtain the state-space histories are also distinctive. In the social sciences, the most common technique involves retrospective questioning of a respondent about the relevant parts of their life history. Alternatively, longitudinal panel and cohort studies, though still utilising (short term) retrospective accounts for precise details, can involve repeated contacts over time with the same respondents, and thus collect event history information progressively through the lifecourse. Such methods of data collection, however, tend to introduce two features which distinguish applications in social science research. Traditionally, statistical methods of event history analysis have been developed with regard to duration data on physical and medical science observations. These often feature very accurate measurements of the timing of clearly defined events, along with analysis in terms of only a limited number of (purportedly accurately measured) other relevant variables. By contrast, social science applications are characterised by wider scope for inaccuracies of duration measurements (as well as other variable and sampling uncertainties). There also tends to be a greater need to link multiple pieces of information from alternative sources onto the appropriate duration data records (see examples below). The first potential variable inaccuracy concerns the reliability of retrospective recall data. Certainly a number of techniques have been developed to maximise the accuracy of retrospective records (eg Taris 2000:p8-12). However methodological studies regularly suggest that respondents’ longitudinal retrospective accounts are subject to high levels of error (eg Dex & McCulloch 1997, Elias 1997, Jacobs 2002, Solga 2001). Moreover, the circumstances of missing data errors and representative sampling are more complex with retrospective data which is, for instance, only available for the sample of survivors (cf Tuma 1994). Secondly, as highlighted by Tuma (1994), the centrality of a clearly defined state space to an event history analysis often encourages undesirable simplifications in event history data constructions. This occurs at the level of the state space specification, where it is analytically and cognitively easier to keep the state space categories very simple, whereas a better substantive account, and often different subsequent results, might be obtained from more complex specifications. Also, the same issues arise in the specification of other variables which are to be related to the durations analysed. In particular, the available analytical methods for dealing with time-varying covariates also encourage simpler variable specifications than might otherwise be preferred1. 1 A related point, which particularly concerns event history studies of longer durations, is that the most appropriate categorisations of state space levels, or of other time varying covariates, may not be stable over time – for instance, the most attractive occupational categorisation in 2003 may not be the best one to apply to 1963, though both periods may feature in the same study. 7 In the examples below a series of illustrations of event history data structures are presented from two large scale government funded datasets, the British Household Panel Study (BHPS) and the Family and Working Lives Survey (FWLS)2. The datasets also serve to illustrate examples of data collection, although it might be noted that those event history analyses which are found in the social sciences tend more often to come from specially constructed and limited availability longitudinal records, rather than such general purpose surveys as the BHPS and FWLS. FWLS The FWLS was a one-off cross-sectional survey conducted in 1994, sampling some 11,000 adults (and also their partners). The sample, which included inflated ‘boost’ subsamples from four pre-identified ethnic minority groups, were interviewed crosssectionally about their current circumstances, then given further interviews in order to elicit information on their adult life histories of employment and of general life circumstances. Appendix 1 illustrates the questioning format used to obtain the retrospective information. The first page shows the diary style ‘events matrix’ life history record which is used to produce the ‘events matrix’ life history data file. The second shows the series of questions on job events, which were repeated for each recorded job, which formed the basis of the FWLS’s ‘jobs grid’ data record. To use the FWLS life histories, then, it is usually desirable to link information between the cross-sectional data and the ‘events matrix’ and ‘jobs grid’ resources. In fact, due to software and resource limitations at the time of the survey collection, the FWLS is available from the UK data archive only in a minority package format (TDA). However, if translated into more widely used format (here we use SPSS), it is quickly possible to access the datasets and undertake preliminary analyses. SPSS syntax for a series of relevant commands is provided in the workshop excerises of section 4. BHPS The BHPS is an ongoing panel survey where, each year, the same respondents are contacted and information on their current circumstances, and descriptions of basic demographic and economic experiences over the last year, are recorded. In addition, at certain stages of the sampling, retrospective longitudinal records have been collected which describe employment and life history circumstances since school leaving age (if the first BHPS panel contact came when the respondent was already in their adult life). Between these records then the BHPS offers extensive life history information for its respondents. However, the multiple data sources make the construction of complete BHPS life history records relatively complex, but one research project has concentrated specifically on this issue and has produced, with 2 These datasets were accessed through the UK Data Archive at the University of Essex. 8 periodic updates and rereleases, supplementary BHPS data files of combined work life history records (Halpin 2002). The diagram below, a copy from Halpin (1998:p68), illustrates how the BHPS records are combined in these files. Ultimately, a general file listing continuous life histories in employment, ‘ljemp’, is produced, although some details from other BHPS components are not preserved in it and users may choose to access them separately. 9 4. Workshop sessions: Illustrations of example event history records with selected analyses USE THE SPSS SYNTAX FILES (ATTACHED TO THIS FILE; ALSO AVAILABLE ON MACHINES) TO OPEN THE APPROPRIATE DATASETS AND CONDUCT SOME ILLUSTRATIVE ANALYSES. EXPERIMENT WITH VARIATIONS ON THE COMMANDS GIVEN AND DATASETS USED. **It will be necessary to alter the paths at the top of the syntax files to point to an appropriate directory on the machine that you are using** 4.1) Single-state single episode Example: From the BHPS lifetime work history file, the duration of the first lifetime full time job (or self-employed job). Work through the syntax file workedegs_41_ssse.sps . ** A) Look at the average first job lengths and number of censored cases. ** B) Do first job lengths vary by occupational type? ** C) Some example Life tables and Kaplan-Meir estimates. ** D) File matching: link other individual level information. ** E) Explore some gender differences in first job durations. ** Extension: try to follow the syntax used to created the simplified data file used above, found in workedegs_41_mkdat1.sps . 10 4.2) Single episode competing risks Example: From the FWLS jobs grid file, only those employed at age 30, by next employment status event (either unemployment; not working; still working and job changes towards advantage; still working and job changes towards disadvantage; or right-censoring). Work through the syntax file workedegs_42_secr.sps . ** A) Look at the average job-at-30 lengths after age 30, number of censored cases. ** B) Look at competing risks (destinations); Match in ethnicity data and compare destination types by ethnicity. ** C) Compare durations by competing risks, split by gender or ethnicity. ** Extension: try to follow the syntax used to created the simplified data file used above, found in workedegs_42_mkdat.sps 11 4.3) Multi-state multi-episode Example: From the FWLS events matrix, the cohabitation status histories of all respondents (always either cohabiting, married, divorced, separated, single). Work through the syntax file workedegs_43_msme.sps . ** A) Look at the distributions of cohabitation types and durations. ** B) Look at relations between marital status states and ethnic group and gender. Observe that event simple ‘MSME’ datasets become complicated... ** Extension: note that the syntax used to created the simplified data file is found in workedegs_43_mkdat.sps 12 References: Aitkin M, Anderson D, Francis B, Hinde J. 1989. Statistical Modelling in GLIM. Oxford: Clarendon Press Allison PD. 1984. Event History Analysis : Regression for Longitudinal Event Data. Beverley Hills: Sage Barry J, Francis B, Davies R. 1990. Software for the Analysis of Binary Recurrent Events : A guide for users. Lancaster: Centre for Applied Statistics, Lancaster University Blossfeld H-P, Hamerle A, Mayer KU. 1989. Event History Analysis. Hillsdale, New Jersey: Lawrence Erlbaum Associates Blossfeld H-P, Rohwer G. 2002. Techniques of Event History Modelling: New Approaches to Causal Analysis, 2nd Edition. Mawah, NJ: Lawrence Erlbaum Associates Cleves M, Gould WW, Gutierrez R. 2002. An Introduction to Survival Analysis Using Stata. College Station, Texas: Stata Press. 290 pp. Dex S, McCulloch A. 1997. The Reliability of Retrospective Unemployment History Data. Colchester: Working Paper 97-17 of the Institute for Social and Economic Research, University of Essex Elias P. 1997. Who Forgot They Were Unemployed? Colchester: Working Paper 9719 of the Institute for Social and Economic Research, University of Essex Goldstein H, Rasbash J, Plewis I, Draper D, Browne W, et al. 1998. A user's guide to MLwiN. London: Multilevel Models Project, Institute of Education, University of London Halpin B. 1998. Unified BHPS work-life histories : Combining multiple sources into a user-friendly format. Bulletin de Methodologie Sociologique 60: 34-79 Halpin B. 2002. British Household Panel Survey Combined Work-Life History Data, 1990-1999 [computer file]. 3rd ed, Economic and Social Research Council Research Centre on Micro-Social Change, University of Essex, Institute for Social and Economic Research; distributed by The Data Archive, University of Essex, Colchester Jacobs S. 2002. Reliability and Recall of Unemployment Events Using Retrospective Data. Work, Employment and Society 16: 537-48 Rabe-Hesketh S, Everitt B. 2002. A Handbook of Statistical Analyses using Stata. Second edition. London: Chapman & Hall / CRC. 168 pp. Solga H. 2001. Longitudinal surveys and the study of occupational mobility: Panel and retrospective design in comparison. Quality & Quantity 35: 291-309 Taris TW. 2000. A Primer in Longitudinal Data Analysis. London: Sage Tuma N. 1994. Event History Analysis. In Analysing Social and Political Change : A casebook of methods, ed. A Dale, RB Davies. London: Sage Venables WN, Ripley BD. 1999. Modern Applied Statistics With S-PLUS. New York: Springer-Verlag Vermunt JK. 1997. lEM : A general program for the analysis of categorical data. Tilburg, Netherlands: Tilburg University 13