Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies Georgetown-Howard Center Clinical Translational Science 1 ………… ………… That we are in the midst of crisis is now well understood. Our nation is at war,…………. Our economy is badly weakened, ……….. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many;.. These are the indicators of crisis, subject to data and statistics. ………… ………… ………… Pres. Barack Obama (Inaugural speech) Sequence of Steps in a Research Project • Conceptualization • Planning/Design • Execution Data Collection & Processing Data Analysis • Interpretation • Reporting - Abstracts, Presentation, Publication 3 Outline • • • • Types, Uses & Opportunities National & Institutional Databases Access Analysis & Statistical Issues 4 Types, Uses & Opportunities 5 Types of Large Databases • (Health) Survey Databases NHANES • (Health) Administrative Databases HCUP Discharge & Mortality Databases Specialty Databases- e.g. stroke • Clinical trials AASK, ALLHAT 6 Uses of Large Databases • Secondary Analysis ~ publications • Pilot Data for grant proposals • Power Exploration • Hypothesis Generation & Testing • Estimate of Summary Statistics -prevalence, incidence, mortality, etc 7 Advantages using large databases • • • • • Large Sample Fast & Easily (Some) Accessible Provide population Estimates Can test trend over time Observational, cross-sectional, longitudinal 8 Limitations & Challenges • Non-Experimental: (Survey & Administrative) • Most are cross sectional • May require special skills -special statistical techniques & software usage • Statistical Issues to address • May involve long bureaucracy -Written request or proposal - IRB approval • May cost a fee & travel 9 Funding Opportunities Secondary Analysis R03, R21 mechanisms---• Obtain data collected by the parent study or by Ancillary Studies to prepare a scientific manuscript for publication on a topic (aims) that has not yet been addressed. • Receive limited preliminary study data summaries, to prepare a proposal for funding of secondary analyses of data . • Obtain specimens (e.g. blood, urine, imaging scans) for new assays or analyses to be conducted using an outside funding source. 10 11 12 Nces.ed.org/nationsreportcard/researchcenter/funding.asp Funding Opportunities 13 National Databases 14 National Health & Nutrition Examination Survey (NHANES): www.cdc.org/nchs/nhanes.html • Population : Adult & Children • Method: Face-Face Interview, Physical Exams • Content: Anthropometry, Respiratory disease, chronic & infectious disease, mental health & cognitive functioning, reproductive history & sexual behavior • Data: N~5000/yr since 1999; Initiated in 1960 • Notes: Supplemental food survey, online tutorial 15 National Health Interview Survey (NHIS) : www.cdc.org/nchs/nhis.html • Population : Household (Families) Adult & Children • Method: Face-Face Interview, Physical Exams • Content: Health conditions & behaviors, access to & use of health services; Genetic testing, • Data: N ~35,000 Households (~87,500 persons) Initiated in 1957 • Notes: Data used widely by the DHHS to monitor trends in illness and disability and to track progress toward achieving national health objectives. 16 Surveillance Epidemiology and End Results (SEER): http://seer.cancer.org • Population : Children to Adult • Method: Data collected from cancer registries that cover ~28% of the US population; follow-up with individual cases until death • Content: Cancer incidence, prevalence, and survival data; limited demographics (age, race/ethnicity, region) • Data: Cancer cases in registries, >6Million cases • Notes: Need specialized software to analyze (SEER*Stat or SEER*Prep) downloaded from website; Must sign user agreement to obtain. 17 Healthcare Cost & Utilization Project (HCUP) http://www.ahrq.org/data/hcup • Population : All ages • Method: A family of healthcare databases and tools • Content: Databases enable research on a broad range of health policy issues, including cost and quality of health services, medical practice patterns, access to health care programs, and outcomes of treatments. • Data: Cancer cases in registries, • Notes: Databases are available for purchase through a central distributor 18 African America Study of Kidney Disease & Hypertension(AASK):www.niddkrepository.org/ • Population : Adult African Americans, 18-70 years • Method: Participants followed for 2years to measure the long-term effects of blood pressure control in patients with kidney disease attributed to high blood pressure. • Content: BP, markers of kidney function • Data: 1094 • Notes: Largest and longest study of chronic kidney disease in African Americans 19 CDC Wonder wonder.cdc.org • Wide-ranging Online Health related Datasets for Epidemiologic Research • Each data set can be queried using a series of menus • Provides an online tool for retrieving and analyzing data 20 CDC Wonder 21 Institutional (GHUCCTS) Databases 22 Institutional (GHUCCTS) Databases • Obesity Project - HU • Family Genetics Study of Prostate Cancer-HU • HIV in DC – HU • Memory Disorder Study - HU • Spinal Cord Disease Database - MRI • Stroke Database - MRI/NRH • Brain Injury Database- MRI/NRH • National Capital Spinal Cord Injury Model System – MRI/NRH • Strong Heart Study- MRI • The VA Decision Support System Database (DSS) – VA • …….. • ……… 23 Access/Retrieval 24 Data Access/Retrieval • May require special request or proposals - aims, etc -preparation of detailed analysis plans • Understand the database structure • Extraction of requisite data for specific objectives • Application of appropriate linkage techniques for multiple data sources • Process & Storage 25 Database Structure • Relational Structure: (1-to-1) represented by a table of rows & columns ~ attributes are listed in columns ID, AGE, GENDER, ….. ~unique identifiers • Hierarchical (Nested) Structure: (1-to-many) allows for multiplicity of attributes whiles preserving relationships 26 Data Structure RELATIONAL PID 100 101 102 WARD 1 1 1 2 2 2 Age genderdisease_status 45 Male 0 56 female 1 67 female 0 PID 100 101 102 100 101 102 HIERARCHICAL/NESTED WARD 1 1 1 1 2 2 2 2 FAMID 1 1 2 2 1 1 2 2 PID 100 101 100 101 100 101 100 101 27 Data Analysis Methods 28 Types of Data Endpoints • Continuous Data - BP, BMI, TC, LDL, HDL, Blood Sugar • Categorical Data - Hypertension, Obese, Dyslipidemia, Diabetes • Count Data 0, 1, 2, 3 • Survival (Time-to-Event) Data - time-to-cardiac event, time-to-death 29 Partition Data Into Subsets Core partitioning ~ arises naturally • • • • Race Gender Age Group Geographic Region Time partitioning • 2000-2010 • 1995-2000; 2000-05 30 Descriptive Analysis By Partition Measures of Central Tendency Means, Median, Mode, etc Rates – Prevalence, Incidence, Survival, Mortality Variability SD, range, IQR 31 Visualization Methods Exploratory Analysis Apply visualization methods by subsets Charts Scatter Plot matrix ~ continuous measures Trellis plot ~ all measures 32 33 Trellis Plot 34 Inference Statistical Tests The method used depends on 1. Outcome measure Univariate Multivariate 2. Study design 35 Continuous Data Parametric Tests • Paired T-tests ~ noncomparative open-label studies (pre-post studies) • Two Sample T-test ~ comparative studies (eg. parallel-group designs ) • ANOVA (F-Test) ~ comparing multiple groups (eg, parallel-groups designs, factorial designs) Non-Parametric Equivalent • Wilcoxon Signed Rank Test • Wilcoxon Rank Sum Test • Kruskal-Wallis Test 36 Categorical Data What is the question? Compare rates: prevalence, incidence, mortality! • Chi-square Test • McNemar Test (pre-post designs) • Mantel-Haenzel test- heterogeneity 37 Survival Data Question? Compare survival rates! Survival curves, hazard ratios • • • Kaplan-Meier Estimator Log- Rank Test Likelihood Ratio Test 38 Regression Methods • used when it is necessary to adjust for different covariate/confounding effects Cholesterol level ~ gender, age, diet 39 Regression Methods • Continuous Data ~ Linear Regression Models • Categorical Data ~ Logistic Regression Models ~ Conditional Regression Models • Survival Data ~ Proportional Hazard Regression 40 Multi-Level Models Hierarchical (Nested) Models • Multilevel Regression • Mixed Effect Models • Nested Models -GEE -Proc Nested • Bayesian Approaches 41 Multivariable Methods Use to analyze multiple outcomes jointly TC ~ gender, age, diet Risk factors univariate [HDL, LDL, TG] ~ gender, age, diet Multivariable 42 Multivariable Methods • • • • • MANOVA Discriminant Analysis Factor Analysis Cluster Analysis Principal Component Analysis 43 Statistical Issues 44 Statistical Issues • Sampling error • Missing data • high likelihood of finding a significant difference due to chance alone • Potential for bias result is substantial 45 Recommendations for Health Survey Data Use ~ • • • • Statistical weights Stratification Clustering Variance Estimation 46 Use of Statistical Weights • The statistical weight of a sampled person is the number of people in the population that the person represents. • If sampling rate is 1/1000 Each sampled person represents 1000 people Each sampled person would have a sample weight of 1000 • Weights derived from selection probabilities response rates post-stratification adjustments (e.g. gender, education, etc) 47 Stratification • Population divided before sampling into disjoint, exhaustive groups (strata) Members termed primary sampling units (PSUs) Independent samples are taken in each strata • Strata formed by similar demographic areas 48 Clustering Hierarchical (Nested) Data • Persons residing in a small area (cluster) may have similar characteristics • Responses of subjects in clusters may be correlated • Dependence between subjects leads to inflate variance • Correlation must be accounted for in the analysis 49 Variance Estimation Use appropriate variance estimation methods: Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators Default method for most stats programs Replication methods: Calculates different parameter estimates for each replicate and combines these to estimate variance. Jackknife, etc 50 Summary • Fast and easily accessible • Provides several uses and opportunities • Large databases will continue to provide important findings for clinical research • Mindful of statistical issues • Use weighting, clustering or stratification when appropriate 51 Thank you 52