Constructing Data for Event Histories: data collection, data

advertisement
Constructing Data for Event Histories: data formats and
introductory analyses
Talk prepared for Workshop 2a : Event History Analysis, of the ESRC seminar series
“Analysing Longitudinal Data – Bridging the gap between methodology and
sociological research”.
Paul Lambert
Cardiff University School of Social Sciences
lambertp@cardiff.ac.uk
13 June 2003
Files for use with this paper can be downloaded from:
http://www.cf.ac.uk/socsi/main/lambertp/downloads.html
1
1. Data Formats
1.1 The State Space
Event history data is data collected on the duration of ‘episodes’ within a ‘state
space’. Its fundamental components are the identification of state membership, and
information on the length of time within episodes of each state. Additional structures
to event history datasets are all built onto these basic building blocks.
Figure 1: Illustration of episodes within a state space : Lifetime work histories
for 3 respondents born in 1935
Person 1
State space
FT work
PT work
Not in work
Person 2
FT work
PT work
Not in work
Person 3
FT work
PT work
Not in work
1950
1960
1970
1980
1990
2000
Time
2
1.2 Forms of State Spaces
The simplest event history data is ‘single state single episode’ : it concerns only the
duration spent in one state type for a single episode per respondent. Classic examples
of survival data have this form, for instance, the time from manufacture that it takes
before a metal component breaks down (time to failure). More complex datasets can
involve multiple states and/or multiple episodes per data subjects. Such data is more
commonly found in social science applications. The diagram above illustrates a
‘multi-state multi-episode’ data structure, where respondents are distinguished into 3
different worklife states, and several different episodes can describe their full life
histories. Table 1 below describes a selection of event history data formats, which we
later return to in section 4 below.
Table 1 : Description of selected event history data formats
1) Single State Single episode
One episode of a certain type recorded for each subject. Example: First post-school occupation until it
ends (for any reason)
2) Single episode competing risks
One episode of a certain type recorded for each subject, but type of ending events classified into more
than one state of interest. Example: First marriage duration, comparing if it ends in separation,
divorce, widowhood, etc.
3) Mutli-episode single state
More than one episode recorded for (at least some) subjects, but always the same type, and ending in
same way. Example (relatively rare in social sciences): Prison sentence duration until release
4) Multistate multi-episode
More than one episode in more than one state for subjects. Example (very common): Life history in
economic activity states since leaving school to retirement.
5) Time varying covariates (to any of above)
Extension to any of above structures whereby properties of explanatory covariates are not fixed over
time, but change in value during the duration of some episodes.
For many users, one of the harder tasks in undertaking event history analysis lies in
making the transition from original data sources, to the construction of ‘neat’ data
files in some of the formats described above. In the main examples following this talk,
we work on pre-prepared neat files of three example data format types. However,
associated with those exercises is additional SPSS syntax which illustrates how the
extracted files were produced from the two data sources used. More interested users
might like to check and replicate that data construction.
3
1.3 Rectangular data files
Event history data can be stored in two alternative rectangular data file formats which
allow for statistical analysis. More common are continuous time datasets (also
referred to as ‘spell files’ or ‘event oriented files’). Here each case represents a single
episode, with information on the nature of the episode and its duration inherent to the
analysis.
Table 2: Illustration of a continuous time (multistate multi-episode) dataset
Case
Person
1
2
3
4
5
6
7
.
1
1
2
2
2
2
3
.
Start
time
1
158
1
22
106
149
1
.
End
time
158
170
22
106
149
170
10
.
Duration
157
12
21
84
43
21
9
.
Origin
State
1 (FT)
3 (NW)
3 (NW)
1 (FT)
3 (NW)
2 (PT)
1 (FT)
.
Destination
state
3 (NW)
3(NW)
1 (FT)
3 (NW)
2 (PT)
2 (PT)
2 (PT)
{Other vars,
person/state}
.
Equally, discrete time datasets can be used whereby the sequence of events is
partitioned up into distinctive time units. Each case in the data file then represents a
time spell for a certain subject within a certain state. Additional variables can indicate
person level, state level, and period level, characteristics – thus an advantage of
discrete time datasets is that they can more easily handle information on time-varying
covariates. The discrete time units can cover whatever duration the available
information allows them to. However, error is introduced if a transition between states
does not occur at exactly the transition between discrete time units; the discrete time
state is usually chosen as the main state throughout the discrete time unit. In worklife
history analyses for example, monthly discrete time units are usually thought adequate
to capture most state changes reasonably accurately. However annual panel surveys
are an example of a yearly discrete time dataset where the longer length of the
discrete time period is likely to call into question any assumptions about the duration
of spells.
4
Table 3: Illustration of a discrete time (multi-state multi-episode) dataset
Case
Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
.
Discrete
Time
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
.
Approx
real time
5
20
35
50
65
80
95
110
125
140
155
170
5
20
35
50
.
State
1 FT
1 FT
1 FT
1 FT
1 FT
1 FT
1 FT
1 FT
1 FT
1 FT
3 NW
3 NW
3 NW
3 NW
1 FT
1 FT
.
End of
state
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
.
{Other person, state, or
time unit level variables}
It is possible to translate continuous time data into a discrete time dataset using data
management facilities within modern statistical packages. The reverse translation,
whilst possible, is less satisfactory, due to the approximation to continuous time
available via the discrete time format.
5
2. Software for Dealing with Event History Data
As event history data can be expressed in a rectangular data file, it can in principle be
read in almost any statistical analysis package. However, the functions involved in
conducting an event history data analysis are more specific. Historically, fewer
statistical packages had functionality for these types of analysis, and specialist
packages and programmes were developed. More recently however it has become
routine for major analysis packages to incorporate event history data functions. Some
selective comments are presented in table 3, though other potentially relevant
packages are not covered.
Table 3: Selected Software and comments for data handling and statistical
analysis of event history data structures
Multi-purpose packages
SPSS
Basic descriptives, Kaplan-Meir, and Cox regression models. Example
applications eg in Blossfeld et al (1989); .
STATA
Wide range of relevant functions, with estensive user contributions through ‘STB’
systems, see Cleves et al (2002); Rabe-Hesketh (2002)
SAS
Wide range of relevant functions – of the ‘big three’ social statistics packages,
both STATA and SAS have many more analytical routines than SPSS does. Some
example commands in Allison (1984); Blossfeld et al (1989);
GLIM
Historically widely used in the UK for both social and physical science event
history applications. See Aitkin et al (1989: chpt6); Allison (1984).
Splus/R
Difficult language to learn, but one which is used by statisticians for ‘cutting edge’
survival data analysis. See for example Venables and Ripley (1999)
MLwiN
Popular analytical package for dealing with hierarchically nested data structures.
Allows extension to event history analysis, eg Goldstein et al (1998)
lEM
Powerful, though pithy, freeware illustrating event history applications through
log-linear modelling formats, see Vermunt (1997):
http://www.kub.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html
Specialist packages
TDA
Freely available from http://www.stat.ruhr-uni-bochum.de/tda.html , a simple and
intuitive software thoroughly illustrated in Blossfeld and Rohwer (2002).
SABRE
Extension facility for GLIM, specifically for analysis of binary recurrent events
(Barry et al 1990).
6
3. Data collection
In any discussion of event history analysis, it is important to remember that the data
collection methods used to obtain the state-space histories are also distinctive. In the
social sciences, the most common technique involves retrospective questioning of a
respondent about the relevant parts of their life history. Alternatively, longitudinal
panel and cohort studies, though still utilising (short term) retrospective accounts for
precise details, can involve repeated contacts over time with the same respondents,
and thus collect event history information progressively through the lifecourse.
Such methods of data collection, however, tend to introduce two features which
distinguish applications in social science research. Traditionally, statistical methods of
event history analysis have been developed with regard to duration data on physical
and medical science observations. These often feature very accurate measurements of
the timing of clearly defined events, along with analysis in terms of only a limited
number of (purportedly accurately measured) other relevant variables. By contrast,
social science applications are characterised by wider scope for inaccuracies of
duration measurements (as well as other variable and sampling uncertainties). There
also tends to be a greater need to link multiple pieces of information from alternative
sources onto the appropriate duration data records (see examples below).
The first potential variable inaccuracy concerns the reliability of retrospective recall
data. Certainly a number of techniques have been developed to maximise the accuracy
of retrospective records (eg Taris 2000:p8-12). However methodological studies
regularly suggest that respondents’ longitudinal retrospective accounts are subject to
high levels of error (eg Dex & McCulloch 1997, Elias 1997, Jacobs 2002, Solga
2001). Moreover, the circumstances of missing data errors and representative
sampling are more complex with retrospective data which is, for instance, only
available for the sample of survivors (cf Tuma 1994).
Secondly, as highlighted by Tuma (1994), the centrality of a clearly defined state
space to an event history analysis often encourages undesirable simplifications in
event history data constructions. This occurs at the level of the state space
specification, where it is analytically and cognitively easier to keep the state space
categories very simple, whereas a better substantive account, and often different
subsequent results, might be obtained from more complex specifications. Also, the
same issues arise in the specification of other variables which are to be related to the
durations analysed. In particular, the available analytical methods for dealing with
time-varying covariates also encourage simpler variable specifications than might
otherwise be preferred1.
1
A related point, which particularly concerns event history studies of longer durations, is that the most
appropriate categorisations of state space levels, or of other time varying covariates, may not be stable
over time – for instance, the most attractive occupational categorisation in 2003 may not be the best
one to apply to 1963, though both periods may feature in the same study.
7
In the examples below a series of illustrations of event history data structures are
presented from two large scale government funded datasets, the British Household
Panel Study (BHPS) and the Family and Working Lives Survey (FWLS)2. The
datasets also serve to illustrate examples of data collection, although it might be noted
that those event history analyses which are found in the social sciences tend more
often to come from specially constructed and limited availability longitudinal records,
rather than such general purpose surveys as the BHPS and FWLS.
FWLS
The FWLS was a one-off cross-sectional survey conducted in 1994, sampling some
11,000 adults (and also their partners). The sample, which included inflated ‘boost’
subsamples from four pre-identified ethnic minority groups, were interviewed crosssectionally about their current circumstances, then given further interviews in order to
elicit information on their adult life histories of employment and of general life
circumstances. Appendix 1 illustrates the questioning format used to obtain the
retrospective information. The first page shows the diary style ‘events matrix’ life
history record which is used to produce the ‘events matrix’ life history data file. The
second shows the series of questions on job events, which were repeated for each
recorded job, which formed the basis of the FWLS’s ‘jobs grid’ data record.
To use the FWLS life histories, then, it is usually desirable to link information
between the cross-sectional data and the ‘events matrix’ and ‘jobs grid’ resources. In
fact, due to software and resource limitations at the time of the survey collection, the
FWLS is available from the UK data archive only in a minority package format
(TDA). However, if translated into more widely used format (here we use SPSS), it is
quickly possible to access the datasets and undertake preliminary analyses. SPSS
syntax for a series of relevant commands is provided in the workshop excerises of
section 4.
BHPS
The BHPS is an ongoing panel survey where, each year, the same respondents are
contacted and information on their current circumstances, and descriptions of basic
demographic and economic experiences over the last year, are recorded. In addition,
at certain stages of the sampling, retrospective longitudinal records have been
collected which describe employment and life history circumstances since school
leaving age (if the first BHPS panel contact came when the respondent was already in
their adult life). Between these records then the BHPS offers extensive life history
information for its respondents. However, the multiple data sources make the
construction of complete BHPS life history records relatively complex, but one
research project has concentrated specifically on this issue and has produced, with
2
These datasets were accessed through the UK Data Archive at the University of Essex.
8
periodic updates and rereleases, supplementary BHPS data files of combined work
life history records (Halpin 2002).
The diagram below, a copy from Halpin (1998:p68), illustrates how the BHPS records
are combined in these files. Ultimately, a general file listing continuous life histories
in employment, ‘ljemp’, is produced, although some details from other BHPS
components are not preserved in it and users may choose to access them separately.
9
4. Workshop sessions: Illustrations of example event history records with
selected analyses
USE THE SPSS SYNTAX FILES (ATTACHED TO THIS FILE; ALSO
AVAILABLE ON MACHINES) TO OPEN THE APPROPRIATE DATASETS AND
CONDUCT SOME ILLUSTRATIVE ANALYSES. EXPERIMENT WITH
VARIATIONS ON THE COMMANDS GIVEN AND DATASETS USED.
**It will be necessary to alter the paths at the top of the syntax files to point to an
appropriate directory on the machine that you are using**
4.1) Single-state single episode
Example: From the BHPS lifetime work history file, the duration of the first lifetime
full time job (or self-employed job).
Work through the syntax file workedegs_41_ssse.sps .
** A) Look at the average first job lengths and number of censored cases.
** B) Do first job lengths vary by occupational type?
** C) Some example Life tables and Kaplan-Meir estimates.
** D) File matching: link other individual level information.
** E) Explore some gender differences in first job durations.
** Extension: try to follow the syntax used to created the simplified data file
used above, found in workedegs_41_mkdat1.sps .
10
4.2) Single episode competing risks
Example: From the FWLS jobs grid file, only those employed at age 30, by next
employment status event (either unemployment; not working; still working and job
changes towards advantage; still working and job changes towards disadvantage; or
right-censoring).
Work through the syntax file workedegs_42_secr.sps .
** A) Look at the average job-at-30 lengths after age 30, number of censored cases.
** B) Look at competing risks (destinations); Match in ethnicity data and compare
destination types by ethnicity.
** C) Compare durations by competing risks, split by gender or ethnicity.
** Extension: try to follow the syntax used to created the simplified data file
used above, found in workedegs_42_mkdat.sps
11
4.3) Multi-state multi-episode
Example: From the FWLS events matrix, the cohabitation status histories of all
respondents (always either cohabiting, married, divorced, separated, single).
Work through the syntax file workedegs_43_msme.sps .
** A) Look at the distributions of cohabitation types and durations.
** B) Look at relations between marital status states and ethnic group and gender.
Observe that event simple ‘MSME’ datasets become complicated...
** Extension: note that the syntax used to created the simplified data file
is found in workedegs_43_mkdat.sps
12
References:
Aitkin M, Anderson D, Francis B, Hinde J. 1989. Statistical Modelling in GLIM.
Oxford: Clarendon Press
Allison PD. 1984. Event History Analysis : Regression for Longitudinal Event Data.
Beverley Hills: Sage
Barry J, Francis B, Davies R. 1990. Software for the Analysis of Binary Recurrent
Events : A guide for users. Lancaster: Centre for Applied Statistics, Lancaster
University
Blossfeld H-P, Hamerle A, Mayer KU. 1989. Event History Analysis. Hillsdale, New
Jersey: Lawrence Erlbaum Associates
Blossfeld H-P, Rohwer G. 2002. Techniques of Event History Modelling: New
Approaches to Causal Analysis, 2nd Edition. Mawah, NJ: Lawrence Erlbaum
Associates
Cleves M, Gould WW, Gutierrez R. 2002. An Introduction to Survival Analysis Using
Stata. College Station, Texas: Stata Press. 290 pp.
Dex S, McCulloch A. 1997. The Reliability of Retrospective Unemployment History
Data. Colchester: Working Paper 97-17 of the Institute for Social and
Economic Research, University of Essex
Elias P. 1997. Who Forgot They Were Unemployed? Colchester: Working Paper 9719 of the Institute for Social and Economic Research, University of Essex
Goldstein H, Rasbash J, Plewis I, Draper D, Browne W, et al. 1998. A user's guide to
MLwiN. London: Multilevel Models Project, Institute of Education, University
of London
Halpin B. 1998. Unified BHPS work-life histories : Combining multiple sources into
a user-friendly format. Bulletin de Methodologie Sociologique 60: 34-79
Halpin B. 2002. British Household Panel Survey Combined Work-Life History Data,
1990-1999 [computer file]. 3rd ed, Economic and Social Research Council
Research Centre on Micro-Social Change, University of Essex, Institute for
Social and Economic Research; distributed by The Data Archive, University
of Essex, Colchester
Jacobs S. 2002. Reliability and Recall of Unemployment Events Using Retrospective
Data. Work, Employment and Society 16: 537-48
Rabe-Hesketh S, Everitt B. 2002. A Handbook of Statistical Analyses using Stata.
Second edition. London: Chapman & Hall / CRC. 168 pp.
Solga H. 2001. Longitudinal surveys and the study of occupational mobility: Panel
and retrospective design in comparison. Quality & Quantity 35: 291-309
Taris TW. 2000. A Primer in Longitudinal Data Analysis. London: Sage
Tuma N. 1994. Event History Analysis. In Analysing Social and Political Change : A
casebook of methods, ed. A Dale, RB Davies. London: Sage
Venables WN, Ripley BD. 1999. Modern Applied Statistics With S-PLUS. New York:
Springer-Verlag
Vermunt JK. 1997. lEM : A general program for the analysis of categorical data.
Tilburg, Netherlands: Tilburg University
13
Download