Herding Ponies: How big data methods facilitate collaborative analytics Changes in Outcomes Research New monikers… Patient Centered Outcomes Research Health Services Research Comparative Effectiveness Research Safety and Surveillance Changes in funding agencies PCORI - AHRQ FDA – CMS NIH Changes in research models More multi-site studies Larger “center-based” studies Greater interest in Patient Generated Data Greater interest in EHR-based data Less interest in claims Collaboration Frameworks From other disciplines Open Science Grid Physics, nanotechnolgy, structural biology Physics/Astrophysics Established practices and metadata standards 1 PB data in last science run, distributed worldwide ESGF >260 pubs in 2010 LIGO OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, 1.2 PB climate data delivered to 23,000 users; 600+ pubs Collage – Executable papers Computer science “Why hasn’t Outcomes Research adopted collaborative methods used in physics, climate science, and genomics?” - Everyone in data-driven research Adapting to Collaborative Science 1. Healthcare data are not collected for research Not standardized Not complete 2. Privacy protection has legal and ethical implications 3. Data is an asset 4. Data sharing is not incentivized supported by journals, funding agencies, or the business of healthcare Obtaining consent is expensive Data hoarding is rewarded and conservative Are Federated Research Networks the solution? In federated models data are not centralized. AHRQ and PCORI have invested heavily this approach. 5. Each data holder independently assumes responsibility for “data wrangling” and standardization 6. Requires distributed analysis as opposed to traditional central data pooling and analysis. If data are simply used to independently estimate one model per site, value-added for causal inference is similar to a meta-analysis 7. Requires greater levels of coordination of governance, standards, software, and policies. 8. High barriers to entry – what is the ROI? Federated Meta-Analysis vs. Distributed Analysis Meta-analysis • 1 Independently estimated model for each node in the network • Not iterative Parallel Meta-analysis (Independently Estimated Results) Distributed Analysis • One jointly estimated model using data from all sites • Typically iterative • Leverages computational power of the entire network Parallel Distributed Analytics (Jointly Estimated Results) Converged Estimate Results Site 1 model fit to 100 patients Query Portal Results Site 2 model fit to 50 patients Query Portal & Aggregator Model Fit 150 Patients Analysis Program Iterative Analysis Program Data Site 1 100 patients Data Site 2 50 patients Data Site 1 100 patients Data Site 2 50 patients Intermediate Statistics Site 1 Intermediate Statistics Site 2 What does this have to do with “big data?” Two (of 8) barriers to collaborative data science solved with “Big Data” methods Privacy protection has legal and ethical implications If data are simply used to independently estimate one model per site, value-added for causal inference is similar to a meta-analysis Bonus – specialized software or hardware like SAS and CMS repositories can be replaced with parallelized systems Parallel Evolution of Distributed Computing and Federated Research Networks AHRQ Distributed Research Network Projects launched CaGRID HMO Research Network adopts standard model 1993 1998 2003 2008 First map-reduce paper from Google Cluster Computing "The Grid" Published Statistical Query Model Introduced FDA MiniSentinel Launched Peer-to-Peer Networks (Napster) PCORnet Launched 2013 R-volution For Hadoop Apache Spark Project Launched MAD Lib In-Database Analytics Amazon EC2 GLORE Published “Big Data” Analytics vs. Outcomes Research Analytics “Big Data” in Distributed Environments Outcomes Research in Federated Research Networks Analysis Questions Patterns Predictions Classification Causal Inference Predictions Hypothesis testing Data Distribution Data can be randomly distributed across processors by a master Data are non-randomly anchored to sites # Nodes on network 100s or more 10s Data Governance constraints between network nodes Typically none or low Typically very high Data set size Very large Relatively small Query Distribution Platforms Apache Spark Hadoop Map-Reduce Apache Pig SHRINE PopMedNet TRIAD Common Analytic Platforms R-Volution/R-Hadoop Apache Mahout Spark Machine Learning Lib Spark Graph X Lib R SAS Stata Size of developer community 1000s Dozens “Big-Data” Methods are Incidentally “Privacy Preserving” Feature Clinical Research Rationale “Big Data” Rationale Federation in the form of multiple networked nodes or processing cores Multiple independently operating data partners Inefficient to rely on a single very powerful processor or specialized hardware Distributed computation across networked nodes (instead of central pooling of data) Transferring patient-level data incurs re-identification risks Inefficient to transfer large data sets across the network Distributed Computing Frameworks Grid Computing Architectures Statistical Query Oracle Mostly an academic effort Hadoop From Google Hundreds of developers 591 Active projects and organizations Apache Spark Berkeley Computer Science answer to Hadoop Most rapidly growing user base 99 Active projects and organizations Collaboration Frameworks In Outcomes Research SHRINE for I2B2 PopMedNet – for MiniSentinel, PCORnet TRIAD for CAGrid, SAFTINet DRN What distributed methods in the standard biostats toolbox are already supported in “Big Data” vs. Clinical Frameworks? Algorithm/Method Apache Spark Libraries Map-Reduce MultiCore or RHadoop Linear regression (weighted) X X Logistic regression X X Cox Proportional Hazard X Naïve Bayes X X Gaussian Discriminative Analysis X X X Neural Network Backpropagation Matrix Factorization X X PCA * X ICA * X Support Vector Machine X X Generalized Linear Models K-means Federated Clinical Research Networks X X Expectation Maximization X Random Forest Classifier X X No Longer a Technical Challenge We have the tools we need to overcome privacy and liability concerns. Now we “only” need to change culture. Moving Collaborative Outcomes Science Forward Policies (aka incentives) Payer-driven incentives for better data hygiene and standardization Payer incentives for sharing Funding agency incentives for collaborative data management vs. data hoarding Journal incentives HIPAA Clarification Infrastructure As a community - adopt existing easy-to-use, flexible platforms for sharing code and data Link clinical data and patient device infrastructure to research infrastructure Culture Clinician demand Patient demand Tenure and promotion transformation Replace “not invented here syndrome” with collective credit and shared efficiencies