Analysis of Large Database - Georgetown

advertisement
Analysis of Large
“Population-based” Databases
for Clinical Research
John Kwagyan, PhD
jkwagyan@howard.edu
Design, Biostatistics & Population Studies
Georgetown-Howard Center Clinical Translational Science
1
…………
…………
That we are in the midst of crisis is now well understood.
Our nation is at war,………….
Our economy is badly weakened, ………..
Homes have been lost; jobs shed; businesses shuttered.
Our health care is too costly; our schools fail too many;..
These are the indicators of crisis, subject to data
and statistics.
…………
…………
…………
Pres. Barack Obama (Inaugural speech)
Sequence of Steps in a Research Project
• Conceptualization
• Planning/Design
• Execution
Data Collection & Processing
Data Analysis
• Interpretation
• Reporting
- Abstracts, Presentation, Publication
3
Outline
•
•
•
•
Types, Uses & Opportunities
National & Institutional Databases
Access
Analysis & Statistical Issues
4
Types, Uses & Opportunities
5
Types of Large Databases
• (Health) Survey Databases
NHANES
• (Health) Administrative Databases
HCUP
Discharge & Mortality Databases
Specialty Databases- e.g. stroke
• Clinical trials
AASK, ALLHAT
6
Uses of Large Databases
• Secondary Analysis
~ publications
• Pilot Data for grant proposals
• Power Exploration
• Hypothesis Generation & Testing
• Estimate of Summary Statistics
-prevalence, incidence, mortality, etc
7
Advantages using large databases
•
•
•
•
•
Large Sample
Fast & Easily (Some) Accessible
Provide population Estimates
Can test trend over time
Observational, cross-sectional, longitudinal
8
Limitations & Challenges
• Non-Experimental: (Survey & Administrative)
• Most are cross sectional
• May require special skills
-special statistical techniques & software usage
• Statistical Issues to address
• May involve long bureaucracy
-Written request or proposal
- IRB approval
• May cost a fee & travel
9
Funding Opportunities
Secondary Analysis
R03, R21 mechanisms---• Obtain data collected by the parent study or by Ancillary
Studies to prepare a scientific manuscript for publication on a
topic (aims) that has not yet been addressed.
• Receive limited preliminary study data summaries, to prepare
a proposal for funding of secondary analyses of data .
• Obtain specimens (e.g. blood, urine, imaging scans) for new
assays or analyses to be conducted using an outside funding
source.
10
11
12
Nces.ed.org/nationsreportcard/researchcenter/funding.asp
Funding Opportunities
13
National Databases
14
National Health & Nutrition Examination
Survey (NHANES): www.cdc.org/nchs/nhanes.html
• Population : Adult & Children
• Method: Face-Face Interview, Physical Exams
• Content: Anthropometry, Respiratory disease,
chronic & infectious disease, mental health & cognitive
functioning, reproductive history & sexual behavior
•
Data: N~5000/yr since 1999; Initiated in 1960
• Notes: Supplemental food survey, online tutorial
15
National Health Interview Survey
(NHIS) : www.cdc.org/nchs/nhis.html
• Population : Household (Families) Adult & Children
• Method: Face-Face Interview, Physical Exams
• Content: Health conditions & behaviors, access to & use
of health services; Genetic testing,
•
Data: N ~35,000 Households (~87,500 persons) Initiated in
1957
• Notes: Data used widely by the DHHS to monitor trends in
illness and disability and to track progress toward achieving
national health objectives.
16
Surveillance Epidemiology and End Results
(SEER): http://seer.cancer.org
• Population : Children to Adult
• Method: Data collected from cancer registries that cover
~28% of the US population; follow-up with individual cases
until death
• Content: Cancer incidence, prevalence, and survival data;
limited demographics (age, race/ethnicity, region)
•
Data: Cancer cases in registries, >6Million cases
• Notes: Need specialized software to analyze (SEER*Stat
or SEER*Prep) downloaded from website; Must sign user
agreement to obtain.
17
Healthcare Cost & Utilization Project (HCUP)
http://www.ahrq.org/data/hcup
• Population : All ages
• Method: A family of healthcare databases and tools
• Content: Databases enable research on a broad range of
health policy issues, including cost and quality of health
services, medical practice patterns, access to health care
programs, and outcomes of treatments.
•
Data: Cancer cases in registries,
• Notes: Databases are available for purchase through a
central distributor
18
African America Study of Kidney Disease &
Hypertension(AASK):www.niddkrepository.org/
• Population : Adult African Americans, 18-70 years
• Method: Participants followed for 2years to measure the
long-term effects of blood pressure control in patients with
kidney disease attributed to high blood pressure.
• Content: BP, markers of kidney function
• Data: 1094
• Notes: Largest and longest study of chronic kidney disease
in African Americans
19
CDC Wonder
wonder.cdc.org
• Wide-ranging Online Health related Datasets for
Epidemiologic Research
• Each data set can be queried using a series of
menus
• Provides an online tool for retrieving and
analyzing data
20
CDC Wonder
21
Institutional (GHUCCTS)
Databases
22
Institutional (GHUCCTS) Databases
• Obesity Project
- HU
• Family Genetics Study of Prostate Cancer-HU
• HIV in DC – HU
• Memory Disorder Study - HU
• Spinal Cord Disease Database - MRI
• Stroke Database - MRI/NRH
• Brain Injury Database- MRI/NRH
• National Capital Spinal Cord Injury Model System – MRI/NRH
• Strong Heart Study- MRI
• The VA Decision Support System Database (DSS) – VA
• ……..
• ………
23
Access/Retrieval
24
Data Access/Retrieval
• May require special request or proposals
- aims, etc
-preparation of detailed analysis plans
• Understand the database structure
• Extraction of requisite data for specific objectives
• Application of appropriate linkage techniques for
multiple data sources
• Process & Storage
25
Database Structure
• Relational Structure: (1-to-1)
represented by a table of rows & columns
~ attributes are listed in columns
ID, AGE, GENDER, …..
~unique identifiers
• Hierarchical (Nested) Structure: (1-to-many)
allows for multiplicity of attributes whiles preserving
relationships
26
Data Structure
RELATIONAL
PID
100
101
102
WARD
1
1
1
2
2
2
Age genderdisease_status
45 Male
0
56 female
1
67 female
0
PID
100
101
102
100
101
102
HIERARCHICAL/NESTED
WARD
1
1
1
1
2
2
2
2
FAMID
1
1
2
2
1
1
2
2
PID
100
101
100
101
100
101
100
101
27
Data Analysis Methods
28
Types of Data Endpoints
• Continuous Data
- BP, BMI, TC, LDL, HDL, Blood Sugar
• Categorical Data
- Hypertension, Obese, Dyslipidemia, Diabetes
• Count Data
0, 1, 2, 3
• Survival (Time-to-Event) Data
- time-to-cardiac event, time-to-death
29
Partition Data Into Subsets
Core partitioning ~ arises naturally
•
•
•
•
Race
Gender
Age Group
Geographic Region
Time partitioning
• 2000-2010
• 1995-2000; 2000-05
30
Descriptive Analysis
By Partition
 Measures of Central Tendency
Means, Median, Mode, etc
 Rates –
Prevalence, Incidence, Survival, Mortality
 Variability
SD, range, IQR
31
Visualization Methods
Exploratory Analysis
Apply visualization methods by subsets
 Charts
 Scatter Plot matrix
~ continuous measures
 Trellis plot
~ all measures
32
33
Trellis Plot
34
Inference
Statistical Tests
The method used depends on
1. Outcome measure
Univariate
Multivariate
2. Study design
35
Continuous Data
Parametric Tests
• Paired T-tests ~ noncomparative open-label
studies (pre-post studies)
• Two Sample T-test ~
comparative studies (eg.
parallel-group designs )
• ANOVA (F-Test) ~
comparing multiple groups
(eg, parallel-groups designs,
factorial designs)
Non-Parametric Equivalent
• Wilcoxon Signed Rank
Test
• Wilcoxon Rank Sum Test
• Kruskal-Wallis Test
36
Categorical Data
What is the question?
Compare rates:
prevalence, incidence, mortality!
• Chi-square Test
• McNemar Test (pre-post designs)
• Mantel-Haenzel test- heterogeneity
37
Survival Data
Question?
Compare survival rates!
Survival curves, hazard ratios
•
•
•
Kaplan-Meier Estimator
Log- Rank Test
Likelihood Ratio Test
38
Regression Methods
• used when it is necessary to adjust for
different covariate/confounding effects
Cholesterol level ~ gender, age, diet
39
Regression Methods
• Continuous Data
~ Linear Regression Models
• Categorical Data
~ Logistic Regression Models
~ Conditional Regression Models
• Survival Data
~ Proportional Hazard Regression
40
Multi-Level Models
Hierarchical (Nested) Models
• Multilevel Regression
• Mixed Effect Models
• Nested Models
-GEE
-Proc Nested
• Bayesian Approaches
41
Multivariable Methods
Use to analyze multiple outcomes jointly
TC ~ gender, age, diet
Risk factors
univariate
[HDL, LDL, TG] ~ gender, age, diet
Multivariable
42
Multivariable Methods
•
•
•
•
•
MANOVA
Discriminant Analysis
Factor Analysis
Cluster Analysis
Principal Component Analysis
43
Statistical Issues
44
Statistical Issues
• Sampling error
• Missing data
• high likelihood of finding a significant
difference due to chance alone
• Potential for bias result is substantial
45
Recommendations for Health Survey
Data
Use ~
•
•
•
•
Statistical weights
Stratification
Clustering
Variance Estimation
46
Use of Statistical Weights
• The statistical weight of a sampled person is the
number of people in the population that the person
represents.
• If sampling rate is 1/1000
Each sampled person represents 1000 people
Each sampled person would have a sample weight of 1000
• Weights derived from
selection probabilities
response rates
post-stratification adjustments (e.g. gender, education, etc)
47
Stratification
• Population divided before sampling into disjoint,
exhaustive groups (strata)
Members termed primary sampling units (PSUs)
Independent samples are taken in each strata
• Strata formed by similar demographic areas
48
Clustering
Hierarchical (Nested) Data
• Persons residing in a small area (cluster) may have
similar characteristics
• Responses of subjects in clusters may be correlated
• Dependence between subjects leads to inflate
variance
• Correlation must be accounted for in the analysis
49
Variance Estimation
Use appropriate variance estimation methods:
Linearization: Uses a Taylor series expansion to
estimate variance of non-linear estimators
Default method for most stats programs
Replication methods: Calculates different parameter
estimates for each replicate and combines these to
estimate variance. Jackknife, etc
50
Summary
• Fast and easily accessible
• Provides several uses and opportunities
• Large databases will continue to provide important
findings for clinical research
• Mindful of statistical issues
• Use weighting, clustering or stratification when
appropriate
51
Thank you
52
Download