Biomedical Big Data Initiative (BD2K) Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K) Myriad Data Types Genomic Other ‘Omic Imaging Phenotypic Exposure Clinical Data and Informatics Working Group acd.od.nih.gov/diwg.htm What Are the Big Problems to Solve? 1. Locating the data 2. Getting access to the data 3. Extending policies and practices for data sharing 4. Organizing, managing, and processing biomedical Big Data 5. Developing new methods for analyzing biomedical Big Data 6. Training researchers who can use biomedical Big Data effectively Overarching Strategy and Goals Two initiatives being proposed to overcome roadblocks Big Data to Knowledge (BD2K) – enable the biomedical research enterprise to maximize the value of biomedical data InfrastructurePlus – create an adaptive environment at NIH to sustain world-class biomedical research Big Data to Knowledge (BD2K): Overview Major trans-NIH initiative addressing an NIH imperative and key roadblock Aims to be catalytic and synergistic Overarching goal: By the end of this decade, enable a quantum leap in the ability of the biomedical research enterprise to maximize the value of the growing volume and complexity of biomedical data BD2K: Four Programmatic Areas I. Facilitating Broad Use of Biomedical Big Data II. Developing and Disseminating Analysis Methods and Software for Biomedical Big Data III. Enhancing Training for Biomedical Big Data IV. Establishing Centers of Excellence for Biomedical Big Data Area 1: Data Sharing & Access Facilitating usage and sharing of biomedical big data New Policies to Encourage Data & Software Sharing Index of Research Datasets to Facilitate Data Location & Citation Community-based Development of Data & Metadata Standards A. Policies to Facilitate Data Sharing. B. Data Catalog: Data Discovery, Citation, Links to Literature. C. Frameworks for Community-Based Solutions to Developing Data Standards. D. Enabling Research Use of Clinical Data. Area 2: Software and Systems Development Development of analysis methods and software Software to Meet Needs of the Biomedical Research Community Facilitating Data Analysis: Access to Large-scale Computing Dynamic Community Engagement of Users and Developers A. Grants for software development B. Software Registry: Making biomedical software findable and citable C. Cloud computing: Facilitating Data Analysis D. Dynamic Social Engagement via social media Area 2: Software and Systems Development Software Grants Current and emerging needs for using, managing, and analyzing the larger and more complex data sets inherent to biomedical Big Data Compression/Reduction Visualization Provenance Data Wrangling Area 2: Software and Systems Development Big Data needs Big Computing Cloud Computing Leveraging the cloud Storing and analyzing huge data sets Collaborative environment Developing appropriate policies for use of controlled access data in the cloud (dbGaP) Developing working relationships with major cloud providers AWS, Google, Microsoft (Azure) HPC More exploration with Supercomputing facilities Area 3: Training Enhancing computational training Increase Number of Computationally Skilled Trainees Strengthen the Quantitative Skills of All Researchers Enhance NIH Review and Program Oversight Area 4: Centers Establishing centers of excellence Collaborative environments & technologies Data integration Analysis & modeling methods Computer science & statistical approaches A. Investigator-initiated Centers B. NIH-specified Centers Big Data to Knowledge (BD2K) bd2k.nih.gov Biomedical Research as Part of the Digital Enterprise Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health Myriad Data Types Genomic Other ‘Omic Imaging Phenotypic Exposure Clinical Myriad Data Types Genomic Other ‘Omic Imaging Phenotypic Exposure Clinical Components of The Academic Digital Enterprise Consists of digital assets E.g. datasets, papers, software, lab notes Each asset is uniquely identified and has provenance, including access control E.g. publishing simply involves changing the access control Digital assets are interoperable across the enterprise Let’s Break Down the Silos New policies, regulations e.g. data sharing Economic drivers The promise of shared data The NIH is Starting to Think About the Digital Enterprise Big Data to Knowledge (BD2K) bd2k.nih.gov This is great, but BD2K is just a start, what will the end product look like? To get to that end point we have to consider the complete research lifecycle The Research Life Cycle will Persist IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Tools and Resources Will Continue To Be Developed Authoring Tools Lab Notebooks Data Capture Analysis Tools Software Scholarly Communication Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Those Elements of the Research Life Cycle will Become More Interconnected Around a Common Framework Authoring Tools Lab Notebooks Data Capture Software Analysis Tools Scholarly Communication Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION New/Extended Support Structures Will Emerge Authoring Tools Data Capture Lab Notebooks Analysis Tools Scholarly Communication Software Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Commercial & Public Tools DisciplineBased Metadata Standards Community Portals Git-like Resources By Discipline Training Institutional Repositories Commercial Repositories Data Journals New Reward Systems bonazziv@nih.gov Thank You Questions?