Leonardo Mariño-Ramírez

advertisement
Leonardo Mariño-Ramírez, PhD
NCBI / NLM / NIH
BIOL 7210 A – Computational Genomics
2/15/2012
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Bioinformatics bottleneck
Bioinformatics challenges
 Methods: How do I analyze my data using procedures
for various data types?
 Infrastructure: Where do I process my data? Large
scale compute accessibility, Installing and maintaining
software
 Standards: How do I ensure my results are useful?
Common, shared formats using community developed
software and tools
High throughput sequencing map
http://pathogenomics.bham.ac.uk/hts/
The case for cloud computing in genome informatics
http://genomebiology.com/2010/11/5/207
The case for cloud computing in genome informatics
http://genomebiology.com/2010/11/5/207
The case for cloud computing in genome informatics
http://genomebiology.com/2010/11/5/207
The National Center for
Biotechnology Information
Bethesda,MD
Created in 1988 as a part of the
National Library of Medicine at NIH
–
–
–
–
Establish public databases
Research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information
Genome Projects in Entrez
Genome sequencing projects statistics
Organism
Complete Draft assembly In progress total
Prokaryotes
1117
966
595
2678
Archaea
100
5
48
153
Bacteria
1017
961
547
2525
Eukaryotes
36
319
294
649
1153
1285
889
3327
total:
2/13/2012
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
The NCBI microbial annotation pipeline
1. Ab initio prediction of coding sequences:
GeneMark and Glimmer Standalone Tools
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html
2. Automated annotation:
NCBI Prokaryotic Genome Automatic Annotation Pipeline
RPS-BLAST, BLASTX, TBLASTN
http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html
Protein Annotation Tools - I
Entrez Protein Clusters http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters
Protein Annotation Tools - II
Entrez Protein Clusters
analytic tools
multiple sequence alignment
phylogenetic trees
ProtMap
Translating the genome into useful discoveries
 Genome assembly and annotation
 Identification of conserved biochemical pathways
• Conserved domain identification
(RPS-BLAST, HMMER)
• COGs
•Phylogenetic Profile Comparisons
Genome Annotation Checks (complete genomes)
• Why do we need to perform checks? garbage in garbage
out
• We want to provide a tool that will check the annotation of
a genome for anomalies that need to be examined further –
a measure of genome annotation
• Functions in conjunction with existing tools built into Sequin
and checks made by GenBank staff during the submission
process
Genome Annotation Checks (complete genomes)
• Takes input genomic file (ASN.1 format)
• Nucleotide sequence extracted
• tRNAscan used to search for missing tRNAs
• BLAST search - against all RefSeq proteins from complete genomes (E<10-6)
• RPS-BLAST against all Conserved Domain profiles (E<10-2)
Genome Annotation Checks (complete genomes)
Current submission
1. Potential frameshifts
2. RNA-CDS overlaps
3. CDS-CDS overlap
4. RNA-RNA overlap
5. missing tRNAs (complete)
6. missing rRNA (5S, 16S, 23S)
7. truncated proteins (partial domain overlaps)
1. Potential frameshifts
two or more adjacent genes encoding proteins
that hit the same subject from BLAST results
5’
Protein1
Protein2
Common BLAST hit spanning both proteins
Protein3
Protein4
3’
2. CDS-RNA Overlap
RNAs completely overlapping (+/- strand) CDS
and vice versa
5’
3’
Protein1
5’
RNA2
3’
3. CDS-CDS Overlap
CDS completely overlapping (+/- strand) CDS
5’
3’
Protein1
5’
Protein2
3’
Planned (near future)
1. truncated proteins (missing part of a domain)
comparison to Conserved Domain database (curated domains
SMART, PFAM, COG, PRK)
- currently aligned portion of domain must cover > 90% of protein
- protein only covers 80% of domain
- reported as 'potentially truncated protein'
protein
domain
2. partial overlaps, CDS-CDS, CDS-RNA, RNA-RNA
3. missing tRNAs
Use in RefSeq
• Missing or absent structural ribosomal RNAs were detected in all complete
prokaryotic genomes (5S, 16S, 23S)
• Internal ribosomal RNA database is used for BLAST searches
• High scoring potential rRNA is aligned against internal db
• Analyzed for missing, strand mismatches, length mismatches
• Currently added semiautomatically – (automatically in the future)
Virtual machines in cloud environments
Running the pipeline happens on the local machine,
while the heavy lifting is done on the cloud/cluster
CloVR is a Virtual Machine
Virtual Machine
Pipelines:
CloVR-16S
CloVR-Search
CloVR-Microbe
CloVR-Metagenomics
Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics
CloVR Architecture
CloVR Protocols - Search
 BLAST – Basic Local
Alignment Search Tool
CloVR Protocols – 16srRNA
 Multiple Sequence Alignment (MSA)
 OTU clustering
 Alpha/beta diversity analysis
 Software: qiime, RDP classifier, custom R scripts,
others
 RDP=ribosomal database project
CloVR Protocols – Microbe
 Assembly
 Prediction: genes, tRNA, rRNA
 Sequence/Functional Annotation
 Software/Data: BLAST, celera assembler, tRNA-scan,
RNAmmer, BLASTx, COG, HMMER, PFAM, TIGRFam
CloVR Protocols – Metagenomics
Case study: Microbe protocol
 160 clusters
 1280 CPUs total (avg 8 CPUs/cluster)
 Distributed BLAST was run on dynamically allocated
clusters
 About 4.5 hours
 $108/hr
Microbe Protocol 250K 454 Reads (8kb PE)
Galaxy on the cloud




Get Galaxy without the data or usage limitations.
Combine with Cloud BioLinux to have access to MANY
tools.
Create an analysis cluster in minutes.
Use autoscaling to get good performance at low cost.
http://wiki.g2.bx.psu.edu/Admin/Cloud
Deploying Galaxy cluster on AWS
1.
2.
3.
4.
Download