Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A – Computational Genomics 2/15/2012 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Bioinformatics bottleneck Bioinformatics challenges Methods: How do I analyze my data using procedures for various data types? Infrastructure: Where do I process my data? Large scale compute accessibility, Installing and maintaining software Standards: How do I ensure my results are useful? Common, shared formats using community developed software and tools High throughput sequencing map http://pathogenomics.bham.ac.uk/hts/ The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207 The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207 The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207 The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH – – – – Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Genome Projects in Entrez Genome sequencing projects statistics Organism Complete Draft assembly In progress total Prokaryotes 1117 966 595 2678 Archaea 100 5 48 153 Bacteria 1017 961 547 2525 Eukaryotes 36 319 294 649 1153 1285 889 3327 total: 2/13/2012 http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html The NCBI microbial annotation pipeline 1. Ab initio prediction of coding sequences: GeneMark and Glimmer Standalone Tools http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html 2. Automated annotation: NCBI Prokaryotic Genome Automatic Annotation Pipeline RPS-BLAST, BLASTX, TBLASTN http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html Protein Annotation Tools - I Entrez Protein Clusters http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters Protein Annotation Tools - II Entrez Protein Clusters analytic tools multiple sequence alignment phylogenetic trees ProtMap Translating the genome into useful discoveries Genome assembly and annotation Identification of conserved biochemical pathways • Conserved domain identification (RPS-BLAST, HMMER) • COGs •Phylogenetic Profile Comparisons Genome Annotation Checks (complete genomes) • Why do we need to perform checks? garbage in garbage out • We want to provide a tool that will check the annotation of a genome for anomalies that need to be examined further – a measure of genome annotation • Functions in conjunction with existing tools built into Sequin and checks made by GenBank staff during the submission process Genome Annotation Checks (complete genomes) • Takes input genomic file (ASN.1 format) • Nucleotide sequence extracted • tRNAscan used to search for missing tRNAs • BLAST search - against all RefSeq proteins from complete genomes (E<10-6) • RPS-BLAST against all Conserved Domain profiles (E<10-2) Genome Annotation Checks (complete genomes) Current submission 1. Potential frameshifts 2. RNA-CDS overlaps 3. CDS-CDS overlap 4. RNA-RNA overlap 5. missing tRNAs (complete) 6. missing rRNA (5S, 16S, 23S) 7. truncated proteins (partial domain overlaps) 1. Potential frameshifts two or more adjacent genes encoding proteins that hit the same subject from BLAST results 5’ Protein1 Protein2 Common BLAST hit spanning both proteins Protein3 Protein4 3’ 2. CDS-RNA Overlap RNAs completely overlapping (+/- strand) CDS and vice versa 5’ 3’ Protein1 5’ RNA2 3’ 3. CDS-CDS Overlap CDS completely overlapping (+/- strand) CDS 5’ 3’ Protein1 5’ Protein2 3’ Planned (near future) 1. truncated proteins (missing part of a domain) comparison to Conserved Domain database (curated domains SMART, PFAM, COG, PRK) - currently aligned portion of domain must cover > 90% of protein - protein only covers 80% of domain - reported as 'potentially truncated protein' protein domain 2. partial overlaps, CDS-CDS, CDS-RNA, RNA-RNA 3. missing tRNAs Use in RefSeq • Missing or absent structural ribosomal RNAs were detected in all complete prokaryotic genomes (5S, 16S, 23S) • Internal ribosomal RNA database is used for BLAST searches • High scoring potential rRNA is aligned against internal db • Analyzed for missing, strand mismatches, length mismatches • Currently added semiautomatically – (automatically in the future) Virtual machines in cloud environments Running the pipeline happens on the local machine, while the heavy lifting is done on the cloud/cluster CloVR is a Virtual Machine Virtual Machine Pipelines: CloVR-16S CloVR-Search CloVR-Microbe CloVR-Metagenomics Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics CloVR Architecture CloVR Protocols - Search BLAST – Basic Local Alignment Search Tool CloVR Protocols – 16srRNA Multiple Sequence Alignment (MSA) OTU clustering Alpha/beta diversity analysis Software: qiime, RDP classifier, custom R scripts, others RDP=ribosomal database project CloVR Protocols – Microbe Assembly Prediction: genes, tRNA, rRNA Sequence/Functional Annotation Software/Data: BLAST, celera assembler, tRNA-scan, RNAmmer, BLASTx, COG, HMMER, PFAM, TIGRFam CloVR Protocols – Metagenomics Case study: Microbe protocol 160 clusters 1280 CPUs total (avg 8 CPUs/cluster) Distributed BLAST was run on dynamically allocated clusters About 4.5 hours $108/hr Microbe Protocol 250K 454 Reads (8kb PE) Galaxy on the cloud Get Galaxy without the data or usage limitations. Combine with Cloud BioLinux to have access to MANY tools. Create an analysis cluster in minutes. Use autoscaling to get good performance at low cost. http://wiki.g2.bx.psu.edu/Admin/Cloud Deploying Galaxy cluster on AWS 1. 2. 3. 4.