Functional Databases for Longitudinal Analyses and Tips of the Trade: The Case of the NPHS in Canada. Amélie Quesnel-Vallée McGill University Émilie Renahy University of Toronto Data matrix structures: “wide” and “long” formats • Wide format • Long format Source: http://www.ats.ucla.edu/stat/stata/modules/reshapel.htm Preparing data for longitudinal analyses • One basic, common variable naming rule for reshaping from wide to long • Marker for time of data collection (cycle, calendar year, etc) is: – a numerical stub, – at the end of the variable name – Ex: VARNAME2012 The National Population Health Survey • “In the fall of 1991, the National Health Information Council recommended that an ongoing national survey of population health be conducted.” – Motivated by “economic and fiscal pressures on the health care systems and the requirement for information with which to improve the health status of the population in Canada.” • In 1992, Statistics Canada received funding to carry out the NPHS • It is composed of three components: the Households, the Health Institutions, and the North components. Source: http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3225&lang=en&db=imdb&adm=8&dis=2 The Longitudinal Household Component of the NPHS • Biennial, from 1994/95-2010/11 (9 cycles) • n=17,276 for the longitudinal household component (69.7% response rate in cycle 9) • Multistage, stratified random sampling, designed to ensure adequate representation across major urban centers, smaller towns, and rural areas in all provinces. • People living in Native reserves, military bases, institutions, and some remote areas of Ontario and Québec were excluded. Source: http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3225&lang=en&db=imdb&adm=8&dis=2 Preparing NPHS data for longitudinal analyses • NPHS variable naming rules: xxxCYCLEzzzz, where – xxx refers to the questionnaire section – CYCLE refers to the data collection cycle – zzzz refers to the specific question • Two idiosyncratic challenges: – Location: cycle is positioned in the middle – Identifier: One digit, either a number or a letter, depending on the period of data collection • From 1994 to 2002, numbers are used (4, 6, 8, 0, or 2 respectively) • From 2004-2010, letters (A-D) are used because numbers would not have provided unique cycle identifiers Solution: Development of a SAS macro • Two options: – User-specific list of variables: Recommended! – Full data matrix: Time consuming and prone to errors with time-invariant variables in long format • Available in both official languages • To be made available to RDC users across Canada Using the package, easy as 1, 2, 3 1. Read important comments and warning For instance, if the variable was not measured in a given cycle, the macro will create a variable with all missing values 2. Replace all XXX by the relevant info. Hint: use the 'Find' option (Ctrl+F) to find them all! 3. Run the macro in SAS: Select all (Ctrl+A) then click on the menu Run \ Submit (or F3 button). -> Three pairs of wide and long format datasets will be created, allowing the use of any statistical software: • 2 SAS dataset • 2 Comma Separated Values (.cvs) • 2 Tab Delimited File (.txt) WARNING! It is the researcher's responsibility to verify: 1. Whether the question was asked in all cycles 2. Whether the response categories were the same across all cycles To this end, consult the NPHS documentation. Summarizing longitudinal information • Using egen in Stata on a wide matrix – anycount: Count the number of events (e.g. poor health) experienced by a respondent over time – anymatch: Detect presence or absence of event over a time period – concat: Creates a summary “trajectory” of events for an individual over a time period. Source: http://www.stata.com/help.cgi?egen WARNING • Missing values are often turned into “0” in egen • Always declare missing values on created variables Row* commands in egen • rowmiss: Gives the number of missing values in varlist for each observation (row). • rownonmiss: Gives the number of nonmissing values in varlist for each observation (row) -- this is the value used by rowmean() for the denominator in the mean calculation. • rowmean, rowmedian, rowmax, rowmin: Respectively creates the (row) means, medians, max and min of the variables in varlist, ignoring missing values. Acknowledgements