Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A – Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck Bioinformatics challenges • Methods: How do I analyze my data using procedures for various data types? • Infrastructure: Where do I process my data? Large scale compute accessibility, Installing and maintaining software • Standards: How do I ensure my results are useful? Common, shared formats using community developed software and tools High throughput sequencing map http://omicsmaps.com/ The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207 The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207 The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207 The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH – – – – Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information The NCBI microbial annotation pipeline 1. Ab initio prediction of coding sequences: GeneMark and Glimmer Standalone Tools http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html 2. Automated annotation: NCBI Prokaryotic Genome Automatic Annotation Pipeline RPS-BLAST, BLASTX, TBLASTN http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html The NCBI microbial annotation pipeline http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/ Other genomic resources – Protein Clusters Other genomic resources – Protein Clusters Genome Annotation Checks (complete genomes) • Why do we need to perform checks? garbage in garbage out • We want to provide a tool that will check the annotation of a genome for anomalies that need to be examined further – a measure of genome annotation • Functions in conjunction with existing tools built into Sequin and checks made by GenBank staff during the submission process Genome Annotation Checks (complete genomes) • Takes input genomic file (ASN.1 format) • Nucleotide sequence extracted • tRNAscan used to search for missing tRNAs • BLAST search - against all RefSeq proteins from complete genomes (E<10-6) • RPS-BLAST against all Conserved Domain profiles (E<10-2) Genome Annotation Checks (complete genomes) Current submission 1. Potential frameshifts 2. RNA-CDS overlaps 3. CDS-CDS overlap 4. RNA-RNA overlap 5. missing tRNAs (complete) 6. missing rRNA (5S, 16S, 23S) 7. truncated proteins (partial domain overlaps) 1. Potential frameshifts two or more adjacent genes encoding proteins that hit the same subject from BLAST results 5’ Protein1 Protein2 Common BLAST hit spanning both proteins Protein3 Protein4 3’ 2. CDS-RNA Overlap RNAs completely overlapping (+/- strand) CDS and vice versa Protein1 5’ 5’ RNA2 3’ 3’ 3. CDS-CDS Overlap CDS completely overlapping (+/- strand) CDS Protein1 5’ 5’ Protein2 3’ 3’ Use in RefSeq • Missing or absent structural ribosomal RNAs were detected in all complete prokaryotic genomes (5S, 16S, 23S) • Internal ribosomal RNA database is used for BLAST searches • High scoring potential rRNA is aligned against internal db • Analyzed for missing, strand mismatches, length mismatches • Currently added semiautomatically – (automatically in the future) Data Exchange EcoCyc – publications – protein interactions EcoGene – publications – gene locations, gene names, verified N-terminii PseudoCAP – publications REBASE – publications, protein names BRC – preliminary data KEGG – pathways, ortholog groups How the Genome has changed? • More complex genome structures – (chromosomes, • • • • • • organelles, plasmids) Genome sequencing – NextGen sequencing More complex genome assembly – (chromosomes, scaffolds, contigs) Genome-scale projects - (transcriptome, exome, epigenomics, proteomics) Multi-isolate genome sequencing - (1001 Arabidopsis, 1000 human genomes) Meta-genomes Now useful for drug development New resources at NCBI New genomic resources at NCBI New resources at NCBI Why do we need new databases? Taxonomy BioProject Genome Nucleotide BioSample Assembly BioProject, Genome, Assembly • BioProject is an administrative object (defined by goal, target, funding, collaboration) • Genome is a biological object defining an organism at molecular level • Genome assembly is a complex data structure that defines the structure, relative position (scaffold) and chromosome placement of DNA sequences originated from a single sample What is a Genome project? • Genome project is a scientific endeavor that ultimately aims to determine the complete genome sequence of an organism and … Aims to annotate protein-coding genes and other important genome-encoded features and … Aims to understand the biology, physiology, and evolution of the organism. Genome Project -> BioProject Random survey Targeted sequencing Metagenome Population genomics Variant Discovery Genome sequencing Assembly Epigenomics Proteomics Ecosystem genomics Transcriptome sequencing Annotation BioProject data model Target Scope Objective Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Capture Mono-isolate Multi-isolate Multi-species Environmental Material Method DNA RNA Protein sequencing array proteomics Why do we need a database of genome assemblies? • We are in a period of extraordinary growth in genomics data. • To get the full benefit from all this data, it is important that users can integrate data from different sources. Integration only works, if users know whether or not the different data were reported in the same coordinate system. Broad assembly (NC_018143) TB H37Rv Sanger vs. Broad Sanger assembly (NC_000962) Mycobacterium genomes at NCBI Mycobacterium tuberculosis genomes Mycobacterium tuberculosis overview Mycobacterium tuberculosis genome annotation Mycobacterium tuberculosis H37Rv Mycobacterium tuberculosis H37Rv browser From the Gene record Mycobacterium tuberculosis H37Rv GenePlot BioProject, BioSample, Genome, Assembly, Nucleotide Genome BioProject Single isolate BioSample Genome BioProject Single isolate BioSample Genome BioProject Single isolate BioProject Multi isolate SRA BioSample Genome BioSample BioSample BioSample BioSample Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly Nucleotide Nucleotide Nucleotide Nucleotide NCBI genome submission dataflow BioProject metadata Common Submission Interface BioSample Sequence data SRA GenBank Contigs Genome Collection Virtual machines in cloud environments Running the pipeline happens on the local machine, while the heavy lifting is done on the cloud/cluster CloVR is a Virtual Machine Virtual Machine Pipelines: CloVR-16S CloVR-Search CloVR-Microbe CloVR-Metagenomics Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics CloVR Architecture Galaxy on the cloud • • • • Get Galaxy without the data or usage limitations. Combine with Cloud BioLinux to have access to MANY tools. Create an analysis cluster in minutes. Use autoscaling to get good performance at low cost. http://wiki.g2.bx.psu.edu/Admin/Cloud Deploying Galaxy cluster on AWS 1. 2. 3. 4.