Class 1 - Katherine Keyes

advertisement
Analysis of Complex Survey Data
Katherine M. Keyes
Kmk2104@columbia.edu
Purpose of this class
• Teach you how to analyze complex survey data
using SUDAAN
• Provide you with the tools to:
– 1) find datasets that fit your research interests;
– 2) download and manage those datasets;
– 3) do your own analyses
Structure of the class
•
•
•
•
1:00-2:00
2:00-3:30
3:30-3:45
3:45-5:00
Lecture
Guided exercise
Break
Independent research project
Today’s schedule
• Introduction to each other
• Key concepts in complex surveys
• Introduction to the NHANES
– Focus on describing the complexities in sample and design
weights
• PREPARING AN ANALYTIC DATASET
–
–
–
–
–
–
Locate variables
Download data files
Append and merge datasets
Clean and recode data
Format and label variables
Save datasets
Who am I?
Who are you?
What is ‘complex survey data’
• Complex survey data usually refers to sample
designs in which respondents have been
sampled in a way that is multi-stage, stratified,
unequally weighted, and/or clustered.
• Because of these design elements, the sample
is no longer “randomly selected”, which
violates the assumptions of basic large-sample
statistics
What is ‘complex survey data’
• Because of this, we need to take into account
the design elements when estimating
standard errors.
Two types of weights commonly used
• SAMPLE WEIGHTS: adjust for oversampling of certain typically hard
to reach groups (e.g., young people) and informative nonresponse
• DESIGN WEIGHTS: adjust the standard errors for the nonrandom
probability of selection into the sample
• TAKE HOME MESSAGE:
• Sample weights affect the ESTIMATES and not the STANDARD
ERRORS
• Design weights affect the STANDARD ERRORS and not the
ESTIMATES
• We need SUDAAN to incorporate the design weights.
Design weights: what are they
• Strata: larger geographic unit
• Primary Sampling Units (PSUs): generally
single counties or groups of small counties
• Households
Introduction to the data we will be
using in this class
• National Health and Nutrition Examination
Survey
• “A program of studies designed to assess the
health and nutritional status of adults and
children in the United States. The survey is
unique in that it combines interviews with
physical examinations.”
• http://www.cdc.gov/nchs/nhanes.htm
Introduction to the data we will be
using in this class
Years
19591962
Survey
name
NHES I
Age range 18-79
19631965
19661970
19711975
19761980
19891991
19911994
19992000
20012002
20032004
20052006
20072008
20092010
NHANES NHANES
NHANES NHANESI III Phase III Phase NHANES NHANES NHANES NHANES NHANES NHANES
NHES II NHES III
I
I
I
II
99-00
01-02
03-04
05-06
07-08
09-10
12-17
12-17
1-74
1-74
1-74
1-74
0-75
0-75
0-75
0-75
0-75
0-75
Domains of inquiry in the NHANES
•
•
•
•
•
•
•
•
•
•
•
•
Demographic background
Housing characteristics
Smoking
Consumer behavior
Income
Food security
Tracking and tracing
Acculturation
Arthritis
Audiometry
Blood pressure
Cardiovascular disease
•
•
•
•
•
•
•
•
•
•
•
•
Dermatology
Diabetes
Dietary screener
Dietary behavior
Early childhood
Health insurance
Hospital utilization and access
to care
Immunization
Kidney conditions
Occupation
Oral health
Osteoporosis
Domains of inquiry in the NHANES
• Physical activity and physical
fitness
• Physical functioning
• Respiratory Health and
Disease
• Sleep disorders
• Weight history
• Reproductive health
• Illegal drug use
• Depression
• Alcohol use
• Pesticide use
• Bowel health
Physical exam includes measures of:
•
•
•
•
•
Arthritis
Audiometry
Bone density (DXA)
Anthropometry
Oral Glucose Tolerance
Test
• Oral Health
• Physician’s Exam
• Respiratory Health
Laboratory components include
measures of:
• Venipuncture
• Urine collection
• Bone mineral status
markers
• Diabetes profile
• Infectious disease
profile
• Oral HPV
• C-reative protein
• Thyroid profile
• Standard biochemical
profile
• Kidney disease profile
• Pregnancy test
• Prostate Specific Antigen
• Nutritional biochemistries
and hematologies
• STD profile
• Blood lipids
• Environmental health
profile
DNA
• Blood samples for DNA purification were collected
from participants age 20 or more years in survey
years 1999-2002 and 2007-2008.
• These are restricted access data
Landmark findings and public health
results
• High blood lead levels
– Lead out of gasoline
• Low folate levels
– Mandatory food fortification
• Rising levels of obesity
– Public health action plan
• Racial/ethnic disparities in Hepatitis B
– Universal vaccination of all infants and children
NHANES not for you?
• The concepts we will discuss apply to many other
publicly available datasets, and you are encouraged
to use these data for your in-class project if your
research questions are not covered in the NHANES
• Where can I find other publicly available datasets?
– ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/
SAMPLE WEIGHTING
IN THE NHANES
Design weights: variable names
• Strata: SDMVSTRA
• PSU: SDMVPSU
Sample weights in the NHANES
• If only data from the interviewed sample is used, then the
appropriate SAS variable is:
– WTINT2YR
• If data from the medical examination is used, then the appropriate
SAS variable is:
– WTMEC2YR
• Some data are only collected on sub-samples of NHANES
participants. These data are generally not publicly available or are
only released a few years after the main interview data. If you are
using data on a subsample of NHANES participants, appropriate
subsample weights must be used and they are included on any
data file where relevant.
Combining NHANES samples
• For NHANES 1999-2000, SDMVSTRA is
numbered 1 to 13; for NHANES 2001-2002
SDMVSTRA is numbered 14-28; for NHANES
2003-2004 SDMVSTRA is numbered 29-43;
etc.
• Therefore, two year NHANES cycles can be
combined without any recoding of this
variable
Combining NHANES samples:
1999-2006
• For the 1999-2002 and 2003-2006 survey
periods, Mexican Americans were
oversampled but non-Mexican American
Hispanics were not oversampled.
• Therefore, estimates for Hispanics that are not
Mexican Americans are generally unreliable
and should not be analyzed
• Further, estimates for ‘all Hispanics’ should
not be calculated
Combining NHANES samples:
2007-2008, 2009-2010
• The sample design of NHANES 2007-2010 is
different than the sample designs for earlier
cycles.
• Adolescents were no longer oversampled
• Non-Mexican American Hispanics were
oversampled, allowing for estimates of “all
Hispanics” (but smaller subgroups remain
unreliable).
Summary: combining samples
• The NHANES sample designs for the periods 1999-2002 and
2003-2006 were similar, such that combining data cycles
within these periods does not present any analytic issues.
• When combining with the 2007-2008 data, however, data
users should not create estimates for total Hispanics for the
2005-2008 data period.
• For non-Hispanic white, non-Hispanic black, and Mexican
American sample domains, rescaling the sample weights to
create four-year weights should be sufficient
• But users should check estimates carefully to see if the four
year estimates and sampling errors are consistent with
each set of 2 year estimates.
Reweighting the data when combining
samples
• When combining two or more 2-year cycles of the continuous
NHANES, the user must calculate new sample weights before
beginning any analysis of the data.
• A set of four year weights has already been created for the 19992002 data (e.g., for the MEC sample it’s WTMEC4YR).
• For four year estimates for 2001-2004, one can create a new
variable for a four year weight by assigning ½ of the 2 year weight
for 2001-2002 if the person was sampled in 2001-2002 or assigning
½ of the 2 year weight for 2003-2004 if the person was sampled in
2003-2004.
• For an estimate for the 6-years of 1999-2004, a 6-year weight
variable can be created by assigning 2/3 of the 4 year weight for
1999-2002 if the person was sampled between 1999-2002 or
assigning 1/3 or the 2 year weight for 2003-2004 if the person was
sampled in 2003-2004.
LAB #1:
PREPARING AN ANALYTIC DATASET
Open the Word document “Lab 1:
Preparing an analytic dataset”
Download