Advanced Molecular Detection Duncan MacCannell, PhD Georgia Tech / CDC Collaborates March 12th, 2014 National Center for Emerging and Zoonotic Infectious Diseases Office of the Director Roche 454 PTP plate, Ion Torrent 314, Pacific BioSciences SMRTcells (x 3) Devices and brand names provided for illustrative purposes only. Their use does not imply endorsement by CDC or HHS. VOLUME OF RAW DATA Data Acquisition/Analysis Challenges For PulseNet USA alone: >70,000 samples/year x 2 to 3 GB raw sequence + 5-10 GB intermediate ~0.9 petabytes of raw data/year Transmission and storage? Is better data compression the answer? Distributed processing and extraction? Is full WGS the right approach for large-scale surveillance? Any solution must balance the advantages of WGS, with the costs of implementation. Library Data Info. Input: DNA/RNA NGS Bioinformatics Source: Genomic Amplicon Whole sample Workflow: Platforms Chemistry Perf. char. Labor/TaT Cost Workflow: Hardware/software Specialized skillsets Algorithms/pipelines Pathogen databases Data analysis/interpret/ Integration/visualization Host/vector/ pathogen/ environment … Increasingly Universal Workflows A Moving Target Established sequencing workflows for a wide range of pathogens. Rapidly evolving technology space. Changing hardware and COTS/OSS capabilities. Lots of choice, but lack of consistent standards. BIG DATA. New workforce and skillset is required. Sample intake Conversion Prep/staging Library prep Extraction Sequencing File hashes/versioning QA/QC Reporting Validated methods/databases Skills/proficiency Process logging/audit Standards Security Pathogen- and application-specific, CLIA-compliant assays ACAATTTGTGCATAACATGTGGACAGTTTTAATCACATGTGGGTAAATAGTTGTCCACATTTGCTTTTTT TGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACAT TTTATATTTATTAGGTTGTACATTTGTTGCGCAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACAC CTTTGGAAAATATCTCTGATTTATGGAATAGTGCCTTAAAAGAATTAGAAAAAAAGGTAAGCAAGCCTAG TTATGAAACATGGTTAAAATCAACAACGGCTCATAACTTGAAGAAAGACGTATTAACGATTACAGCTCCA AATGAATTTGCTCGTGACTGGCTAGAATCTCATTACTCAGAACTTATTTCGGAAACACTATACGATTTAA CAGGGGCAAAATTAGCAATTCGCTTTATTATTCCCCAAAGTCAATCGGAAGAGGACATTGATCTTCCTCC AGTTAAGCGGAATCCAGCACAAGATGATTCAGCTCATTTACCACAGAGCATGTTAAATCCAAAATATACA TTTGATACATTTGTTATCGGCTCTGGTAACCGTTTTGCCCATGCAGCTTCATTAGCTGTAGCCGAGGCGC CAGCTAAAGCGTATAATCCACTCTTTATTTATGGGGGAGTTGGGCTTGGAAAGACGCATTTAATGCACGC AATTGGTCATTATGTAATTGAACATAATCCAAATGCAAAAGTTGTATATTTATCATCAGAAAAATTCACG AATGAATTTATTAACTCTATTCGTGATAATAAAGCTGTTGATTTTCGTAATAAATATCGCAACGTAGATG Output: Information From Sequence Data Comparative Genomics HR Straintyping/Subtyping Cluster identification Molecular evolution Genotypic characterization Virulence, AR, signatures Functional annotation Diagnostic dev/validation Metagenomics Pathogen identification/discovery Culture-independent diagnostics Microbial ecology/diversity …. Objective,“Future-Proof” Data Intrinsic quality metrics. Ability to back-test retrospective sequence data in silico for genes/markers identified at a future date. MANY RESULTS POSSIBLE FROM A SINGLE DATASET! WGS and Pathogen Genomics: Advantages It’s universal… DNA/RNA sequencing workflows and approaches can be applied to a wide range of pathogenic organisms. It’s fundamental… Genomics is a cornerstone for other “omic” approaches Sequence databases starting point for assay devel./validation. It’s objective… Sequence-based methods avoid subjectivity of phenotypic or fragment-based approaches. Volume of data internal controls. It’s (relatively) future proof… Comprehensive sequencing captures the features you know about, and those you don’t. Quality may change, but the sequence will not. This makes it possible to back-test future approaches/targets on the data you collect today. WGS and Genomic Epidemiology: Limitations It lacks standardization… WGS is a rapidly-evolving technology space, both in terms of sequencing and analytics. Standards and mechanisms for data/metadata analysis, storage and exchange remain under active debate and development. Comprehensive databases are still being built… Without a useful baseline understanding of pathogen features/diversity, interpretation may be limited. Need curated,and comprehensive epi-linked reference databases. Many analyses require specialized bioinformatics infrastructure and staff. Bioinformaticists, DBAs, programmers, system administrators, etc. Technical and computational complexity of tasks can vary widely. Data management, retention and release. Storage. LIMS. Advanced Molecular Detection Proposed $30M FY2014 budget request to: 1. Improve pathogen identification and detection Outcome: Rapid progress toward modernizing PulseNet and other critical lab-based surveillance systems 2. Adapt new diagnostics to meet evolving public health needs Outcome: Enhance CDC’s ability to detect outbreaks early, develop new test during outbreaks, and better characterize infectious disease threats 3. Help states meet future reference testing needs in a coordinated manner Outcome: More effective and better integrated outbreak response activities 4. Implement enhanced, sustainable, and integrated laboratory information systems Outcome: Labs inside and outside CDC can share information quickly and seamlessly, including with other CDC databases, such as MicrobeNet and PulseNet 5. Develop prediction, modeling, and early recognition tools Outcome: Better equipped to prevent, detect & respond to infectious diseases. EPI NGS BIOINFO AMD AMD Initiative: Strategic Investments (1) Scientific Infrastructure: Critical laboratory and bioinformatics infrastructure at CDC, state/local PHL, and key overseas laboratories. • • • • Sequencers, mass-spec, other instrumentation, reagents. High performance computing, workstations. Data storage, networking; data integration, knowledge management. Service contracts, software licensing, etc. AMD Initiative: Strategic Investments (2) Workforce development: Training for CDC and PHL staff (bioinformatics, genomics, -omics) New or re-tooled fellowship programs (bioinformatics, genomics) Recruitment of new staff and skillsets (bioinformaticians, data scientists, lab specialists, …) AMD Initiative: Strategic Investments (3) Consortia, partnerships and alignment of efforts Academic institutions State, Federal (NIH, FDA, DHS, DoD, DoE/National Laboratories) Non-Profit/NGO International community Commercial/For-Profit Clinical laboratories Pilot projects with state/local and other partners. Outbreak detection, investigation and response Leverage existing laboratory-based surveillance systems Challenges and Opportunities for CDC/GT Training and workforce development. Development of wet bench and bioinformatics curriculum for public health audiences. Scientific exchanges. Fellowship programs. MOOC-style coursework and training modules for PHL. Bioinformatic challenges. Analysis and visualization of complex structured and unstructured data. Epi/lab integration. Dashboards/decision support. Development and standardization of deployable, CLIA compatible bioinformatics workflows. Fieldable/portable systems. Machine learning and other approaches for genotypic prediction of complex microbial phenotypes (eg: antimicrobial resistance) Approaches to address CIDT: eg: accelerated metagenomic classification, lab/bifx approaches for complex sample matrices. Tools for rapid assay design and validation from HTS data Hardware-accelerated algorithms, scalable HPC (+NoSQL/Hadoop) … Questions and Discussion For more information please contact Centers for Disease Control and Prevention 1600 Clifton Road NE, Atlanta, GA 30333 Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348 Visit: | Contact CDC at: 1-800-CDC-INFO or The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. National Center for Emerging and Zoonotic Infectious Diseases Office of the Director