Managing and Analyzing Longitudinal Data

Managing and Analyzing Longitudinal Data
COPAFS Quarterly Meeting
June 1, 2012
Patricia Ruggles
Catherine Ruggles
Longitudinal Data are Hard to Use
Longitudinal databases tend to be very complex
Creating analysis files typically involves major data
Complex documentation and record linkage issues: searching and
understanding variable lists, record structures, and other features
requires patience and persistence
Files are often hierarchical as well as linked across time periods;
variables need to be moved across record types, new variables
need to be created involving more than one record type, etc.
Longitudinal analyses involve complex relationships
across records and variables and therefore can be
conceptually difficult to plan and carry out
Results: Under-use and Misuse
Analysts shy away from using large longitudinal data
sets such as SIPP because understanding and
restructuring the data is frustrating, expensive and
When such datasets are used it is often for crosssectional rather than longitudinal analyses—e.g.,
topical modules in SIPP—or to compare two points
in time, rather than to examine patterns of activity
over time
As a result: under-use, funding difficulties, low return
on our investment in data collection and preparation
Longitudinal Analysis Steps
Step 1: Understanding the Data
Step 2: Preparing Data for Analysis
Explore metadata and data and choose
appropriate variables
Recode and create variables as necessary
Step 3: Performing Analyses
Perform cross-sectional and longitudinal
analyses as desired
Step 1: Understanding the Data
Many longitudinal datasets are very large and not necessarily
well documented
For example: The 2008 SIPP has 48 months of data on just under
120,000 unique individuals, and contains more than 1000
Documentation exists in many places, but it can be hard to link
specific variables to the appropriate questions in the
questionnaire, and to understand issues such as the universe to
which each variable applies
A key need for longitudinal data users, therefore, is a better way of
exploring the available data and linking it to the appropriate
Orlin has made the ability to search and understand both data
and metadata a key feature of our system
Let’s do a quick tour of the data and metadata exploration system
The Welcome Page
Variable List for SIPP
Exploring SIPP Metadata and Data
To see the available variables, click on the person-month
record type in the metadata tab on the Welcome Page
There are over 1000 variables—one of the things that
makes SIPP hard to use!
 To find a specific variable, type its name or any other
identifying information in the search box
 This brings up all variables meeting the search criteria—
e.g., typing employment will bring up the 39 variables relating to
employment, along with their labels and codes
To select a specific variable, click on it
Will show its codes, frequencies, and summary statistics
Also, hyperlinks to related variables and to all citations for
this variable in questionnaires, code books, and user guide
Variable Search Results:
Employment Status Recode Variable
Viewing the Data
In addition to hyperlinks to other metadata, the
metadata are linked directly to the data
For example—clicking on the number of cases with a
specific code value in the frequency table will bring
up all the case records with that value
Users can choose which variables on those records they
wish to inspect, using a drop down check list
This aids in debugging, understanding complex variable
Finding the Information You Need
The search and hyper-linking features of the Orlin
System address the first of the difficulties in working
with SIPP discussed earlier in our presentation
Many users give up before they even get to longitudinal
analysis, because it can be so hard to find the right
variable and its associated documentation
SIPP documentation is still a bit patchy, but by hyperlinking all existing documentation for every variable the
Orlin System makes it much easier to understand exactly
what the variable means
The system also includes a global search function, which
allows users to search across all aspects of the system
for any specific phrase or term
Step 2: Preparing Data for Analysis
Longitudinal data require substantial manipulation and
recoding before analysis, even after finding the right
Creating usable data extracts that preserve necessary
information on relationships between units of analysis and their
individual components can be complex even in cross-sectional
Adding a time dimension means moving information across
both record types and points in time
Sample attrition, the addition of special supplements,
inconsistencies in responses across waves of the survey, and
weighting problems pose additional difficulties
Users need help in understanding and dealing with
these issues
The Longitudinal Unit of Analysis
Longitudinal Surveys such as SIPP, the Health and Retirement
Survey, etc. typically contain data on several potential units of
analysis or record types, such as households, persons,
welfare units, medical records, etc.
For most types of longitudinal analysis, only units that are
unchanging over time can be usefully linked across time
For example— can’t link households over time because they
change too much from period to period
For most demographic surveys the person-month (or person-year)
record is the basic longitudinal unit—simply a string of linked
records across time for each person
Information from associated units or record types must then be
linked to the longitudinal unit at the appropriate point in time
Restructuring Longitudinal Data
Creating the necessary links is very difficult using sequential data
processing packages such as SAS
 The process will require several steps, each of which means a
new pass through the data set
 For example, to track each person’s household income in each
month of the survey using SAS:
1. Find the correct household for person 1 this month
2. Create a summary variable for household income that month
3. Attach that variable to the person-record for that month
4. Repeat for next month for person 1
5. After creating household income variables for each month for
person one, repeat for persons 2 – 50,000
This gets old fast, especially because it has to be repeated for
many variables—age of head, welfare recipiency—and for many
record types—subfamilies, welfare units, etc.
The Orlin Approach to Restructuring Data
The Orlin system uses database technology to keep
track of variables and their linkages across both
record types and time
This greatly simplifies the process of transforming
variables as needed, creating new variables, and
making sure that all variables are useable
appropriately in longitudinal analyses
This also simplifies the process of recoding variables
and performing other data transformations that are
typically needed in both cross-sectional and
longitudinal analyses
SIPP Data Structure in the Orlin System
We will use the 2008 SIPP panel to illustrate how the
Orlin restructuring system works.
The basic record type is the person-month record,
which is the series of all of the months of data for a
specific person.
We have also created records for each unique
person, family or household that ever appears in the
Records are stored in a database system that
understands their linkages, which makes it easy to
create variables that draw on data from different
record types or different points in time.
Preparing Data for Analysis
Finding the right variables is only the first step
Even in cross-sectional analyses, variables may need to be
recoded for a specific analysis—for example, by collapsing
the number of codes
Sometimes new variables need to be created by combining
information from two or more existing variables—for
example, using income and family size to calculate
equivalent income across different families
Sometimes information on other people must be used in
conjunction with variables on the person-month record—for
example, to identify workers with pre-school children
All of these examples require data transformations
and the creation of new variables
Data Transformations
The Orlin System allows intelligent data transformations
because records are linked internally in a database, and the
system understands those links
Transformations such as recodes and the calculation of new
variables require two steps in the Orlin System:
First, the new variable is defined, using the system’s templates
Second, when a satisfactory definition has been created, it is run
on the data to actually create the variable
New variables can be created using either a small sample of
about 35,000 person-month records, or the full sample, which
includes about 2.6 million records.
The small sample runs in the foreground and takes up to 5 mins.
The full sample runs in the background and takes considerably
longer, depending on the complexity of the transformation.
Creating a New Variable Definition
The first step in transforming data is to define the
new variable you want to create
Second step: Run the new definition on the data to
create the new variable
Orlin automatically tracks every change, every new
definition, and all output
Template for Variable Definition
Example: Run Variable Creation for ANY_WORK
Audit Trail
Complex Transformations
A particular strength of the Orlin System is its ability
to handle complex data transformations, such as
creating variables that use data from different record
types and/or different months
Example: creating AVERAGE_EARNINGS for an
individual across all months of the panel
Create new variable definition as before, specifying new
variable name and source variable (TPEARN)
Select create a complex variable
Select “average” under function type
Select sample and run
Example: Complex Data Transformations
Step 3: Performing Analyses
After transforming our data as needed, we are ready to
analyze them
To analyze data using the Orlin System, press the
Analyze button on the home page button bar
Specific analyses such as crosstabs, regressions, and duration
analyses can be performed by clicking on the appropriate
A template will appear asking for the information needed for the
requested analysis: for example, for a regression, the type of
regression, the dependent variable, and the independent
Analyses use the R statistical system
Results of data transformations can also be exported for
analysis in statistical packages such as SAS, SPSS and Stata
Example: Regression Results
Longitudinal Analyses
In addition to standard cross-sectional analyses, the
Orlin System allows various types of time-related
In particular, it can perform two main types of
longitudinal analysis:
Analysis of transitions—changes in state such as moving
from employment to unemployment—and the relationship
of such changes to other variables or other changes
Analysis of spells—periods of time over which a changed
state persists, such as a spell of unemployment—and the
effects of other variables on the duration of such spells
Defining Transition Variables
Clicking on the Create Transition Variable button in the
transform area brings up a template that allows the user
to define the specific state change of interest
Example: STOP_WORK
This variable is defined as a change from the status of working to
the status of not working
It uses the ANY_WORK variable we previously defined
The user can choose to identify either in the last month worked or
the first month not working, by choosing to compare to the
previous or following month
The variable uses the time variable SEQUENCE, which is simply
the sequence number of the month (eg, 32 for the 32nd month in
the panel)
Create a Transition Variable: STOP_WORK
Defining Spells
A spell is a period of time defined by two
transitions—into the state of interest (such as
unemployment), and out of the state
A spell may occur even if only one transition is observed
—if for example someone becomes unemployed but the
panel ends before the unemployment spell does
Such as spell would be right-censored—no ending can be
Spells can also be left-censored—an ending is observed,
but no beginning
Statistical techniques exist to analyze spells
durations, accounting for censoring
Duration Analysis
Standard duration analyses essentially calculate the
proportion of all those observed in a spell at a given
point in time who exit the spell at that point—in other
words, the “hazard” of leaving the spell
Analyses can take into account the effects of various
independent variables on predicted durations
The Orlin System allows a variety of different models
to be explored
All of these duration models operate on the spell
Create a Spell Record
Spell Record Variables
Example: Spell Records
Analyzing Spell Records
The basic spell record includes only basics relating
to the spell itself
To analyze durations in conjunction with anything
else, therefore, the independent variables of interest
have to be moved to the spell record
This can be done using the create variable definition
screen, choosing the option to move a variable
Duration Analysis
Duration Analysis: Results
Analyzing longitudinal datasets requires three steps:
All of these are hard to do using analysis packages such as
SAS, Stata or SPSS
The goal of the Orlin System is to simplify all three steps
Finding the appropriate information
Restructuring it for longitudinal analysis
Performing the analysis and examining the results
We link and provide search capabilities across data and metadata
We use database technology to keep track of both data and
metadata, cross-sectionally and over time
We provide easy-to-use templates to guide the analyst through the
entire process
If you are interested in learning more or becoming a beta user,
see our website,, or contact us
Thank You!