The Case of the NPHS in Canada.

advertisement
Functional Databases for
Longitudinal Analyses
and Tips of the Trade:
The Case of the NPHS in Canada.
Amélie Quesnel-Vallée
McGill University
Émilie Renahy
University of Toronto
Data matrix structures:
“wide” and “long” formats
• Wide format
• Long format
Source: http://www.ats.ucla.edu/stat/stata/modules/reshapel.htm
Preparing data for longitudinal analyses
• One basic, common variable naming rule for
reshaping from wide to long
• Marker for time of data collection (cycle,
calendar year, etc) is:
– a numerical stub,
– at the end of the variable name
– Ex: VARNAME2012
The National Population Health Survey
• “In the fall of 1991, the National Health Information
Council recommended that an ongoing national survey
of population health be conducted.”
– Motivated by “economic and fiscal pressures on the health
care systems and the requirement for information with
which to improve the health status of the population in
Canada.”
• In 1992, Statistics Canada received funding to carry out
the NPHS
• It is composed of three components: the Households,
the Health Institutions, and the North components.
Source: http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3225&lang=en&db=imdb&adm=8&dis=2
The Longitudinal Household
Component of the NPHS
• Biennial, from 1994/95-2010/11 (9 cycles)
• n=17,276 for the longitudinal household
component (69.7% response rate in cycle 9)
• Multistage, stratified random sampling, designed
to ensure adequate representation across major
urban centers, smaller towns, and rural areas in
all provinces.
• People living in Native reserves, military bases,
institutions, and some remote areas of Ontario
and Québec were excluded.
Source: http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3225&lang=en&db=imdb&adm=8&dis=2
Preparing NPHS data for
longitudinal analyses
• NPHS variable naming rules: xxxCYCLEzzzz, where
– xxx refers to the questionnaire section
– CYCLE refers to the data collection cycle
– zzzz refers to the specific question
• Two idiosyncratic challenges:
– Location: cycle is positioned in the middle
– Identifier: One digit, either a number or a letter,
depending on the period of data collection
• From 1994 to 2002, numbers are used (4, 6, 8, 0, or 2 respectively)
• From 2004-2010, letters (A-D) are used because numbers would
not have provided unique cycle identifiers
Solution: Development of a SAS macro
• Two options:
– User-specific list of variables: Recommended!
– Full data matrix: Time consuming and prone to
errors with time-invariant variables in long format
• Available in both official languages
• To be made available to RDC users across
Canada
Using the package, easy as 1, 2, 3
1. Read important comments and warning
For instance, if the variable was not measured in a given cycle, the
macro will create a variable with all missing values
2. Replace all XXX by the relevant info.
Hint: use the 'Find' option (Ctrl+F) to find them all!
3. Run the macro in SAS: Select all (Ctrl+A) then click on
the menu Run \ Submit (or F3 button).
-> Three pairs of wide and long format datasets will be
created, allowing the use of any statistical software:
• 2 SAS dataset
• 2 Comma Separated Values (.cvs)
• 2 Tab Delimited File (.txt)
WARNING!
It is the researcher's responsibility to verify:
1. Whether the question was asked in all cycles
2. Whether the response categories were the same
across all cycles
To this end, consult the NPHS documentation.
Summarizing longitudinal information
• Using egen in Stata on a wide matrix
– anycount: Count the number of events (e.g. poor
health) experienced by a respondent over time
– anymatch: Detect presence or absence of event
over a time period
– concat: Creates a summary “trajectory” of events
for an individual over a time period.
Source: http://www.stata.com/help.cgi?egen
WARNING
• Missing values are often turned into “0” in
egen
• Always declare missing values on created
variables
Row* commands in egen
• rowmiss: Gives the number of missing values in
varlist for each observation (row).
• rownonmiss: Gives the number of nonmissing
values in varlist for each observation (row) -- this
is the value used by rowmean() for the
denominator in the mean calculation.
• rowmean, rowmedian, rowmax, rowmin:
Respectively creates the (row) means, medians,
max and min of the variables in varlist, ignoring
missing values.
Acknowledgements
Download