El Proyecto Genoma Humano

advertisement
Leonardo Mariño-Ramírez, PhD
NCBI / NLM / NIH
BIOL 7210 A – Computational Genomics
2/18/2015
The $1,000 genome is here!
http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn
Bioinformatics bottleneck
Bioinformatics challenges
• Methods: How do I analyze my data using procedures
for various data types?
• Infrastructure: Where do I process my data? Large
scale compute accessibility, Installing and maintaining
software
• Standards: How do I ensure my results are useful?
Common, shared formats using community developed
software and tools
High throughput sequencing map
http://omicsmaps.com/
The case for cloud computing in genome informatics
http://genomebiology.com/2010/11/5/207
The case for cloud computing in genome informatics
http://genomebiology.com/2010/11/5/207
The case for cloud computing in genome informatics
http://genomebiology.com/2010/11/5/207
The National Center for
Biotechnology Information
Bethesda,MD
Created in 1988 as a part of the
National Library of Medicine at NIH
–
–
–
–
Establish public databases
Research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information
The NCBI microbial annotation pipeline
1. Ab initio prediction of coding sequences:
GeneMark and Glimmer Standalone Tools
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html
2. Automated annotation:
NCBI Prokaryotic Genome Automatic Annotation Pipeline
RPS-BLAST, BLASTX, TBLASTN
http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html
The NCBI microbial annotation pipeline
http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/
Other genomic resources – Protein Clusters
Other genomic resources – Protein Clusters
Genome Annotation Checks (complete genomes)
• Why do we need to perform checks? garbage in garbage
out
• We want to provide a tool that will check the annotation of
a genome for anomalies that need to be examined further –
a measure of genome annotation
• Functions in conjunction with existing tools built into Sequin
and checks made by GenBank staff during the submission
process
Genome Annotation Checks (complete genomes)
• Takes input genomic file (ASN.1 format)
• Nucleotide sequence extracted
• tRNAscan used to search for missing tRNAs
• BLAST search - against all RefSeq proteins from complete genomes (E<10-6)
• RPS-BLAST against all Conserved Domain profiles (E<10-2)
Genome Annotation Checks (complete genomes)
Current submission
1. Potential frameshifts
2. RNA-CDS overlaps
3. CDS-CDS overlap
4. RNA-RNA overlap
5. missing tRNAs (complete)
6. missing rRNA (5S, 16S, 23S)
7. truncated proteins (partial domain overlaps)
1. Potential frameshifts
two or more adjacent genes encoding proteins
that hit the same subject from BLAST results
5’
Protein1
Protein2
Common BLAST hit spanning both proteins
Protein3
Protein4
3’
2. CDS-RNA Overlap
RNAs completely overlapping (+/- strand) CDS
and vice versa
Protein1
5’
5’
RNA2
3’
3’
3. CDS-CDS Overlap
CDS completely overlapping (+/- strand) CDS
Protein1
5’
5’
Protein2
3’
3’
Use in RefSeq
• Missing or absent structural ribosomal RNAs were detected in all complete
prokaryotic genomes (5S, 16S, 23S)
• Internal ribosomal RNA database is used for BLAST searches
• High scoring potential rRNA is aligned against internal db
• Analyzed for missing, strand mismatches, length mismatches
• Currently added semiautomatically – (automatically in the future)
Data Exchange
EcoCyc – publications – protein interactions
EcoGene – publications – gene locations, gene names,
verified N-terminii
PseudoCAP – publications
REBASE – publications, protein names
BRC – preliminary data
KEGG – pathways, ortholog groups
How the Genome has changed?
• More complex genome structures – (chromosomes,
•
•
•
•
•
•
organelles, plasmids)
Genome sequencing – NextGen sequencing
More complex genome assembly – (chromosomes,
scaffolds, contigs)
Genome-scale projects - (transcriptome, exome,
epigenomics, proteomics)
Multi-isolate genome sequencing - (1001 Arabidopsis,
1000 human genomes)
Meta-genomes
Now useful for drug development
New resources at NCBI
New genomic resources at NCBI
New resources at NCBI
Why do we need new databases?
Taxonomy
BioProject
Genome
Nucleotide
BioSample
Assembly
BioProject, Genome, Assembly
• BioProject is an administrative object (defined by goal, target, funding,
collaboration)
• Genome is a biological object defining an organism at molecular level
• Genome assembly is a complex data structure that defines the structure,
relative position (scaffold) and chromosome placement of DNA sequences
originated from a single sample
What is a Genome project?
• Genome project is a scientific endeavor that
ultimately aims to determine the complete genome
sequence of an organism and …
Aims to annotate protein-coding genes and
other important genome-encoded features and
…
Aims to understand the biology, physiology, and
evolution of the organism.
Genome Project -> BioProject
Random survey
Targeted sequencing
Metagenome
Population genomics
Variant Discovery
Genome sequencing
Assembly
Epigenomics
Proteomics
Ecosystem genomics
Transcriptome sequencing
Annotation
BioProject data model
Target
Scope
Objective
Mono-isolate
Multi-isolate
Multi-species
Environmental
Mono-isolate
Multi-isolate
Multi-species
Environmental
Capture
Mono-isolate
Multi-isolate
Multi-species
Environmental
Material
Method
DNA
RNA
Protein
sequencing
array
proteomics
Why do we need a database of
genome assemblies?
• We are in a period of extraordinary growth in genomics
data.
• To get the full benefit from all this data, it is important
that users can integrate data from different sources.
Integration only works, if users know whether or not the
different data were reported in the same coordinate
system.
Broad assembly (NC_018143)
TB H37Rv Sanger vs. Broad
Sanger assembly (NC_000962)
Mycobacterium genomes at NCBI
Mycobacterium tuberculosis genomes
Mycobacterium tuberculosis overview
Mycobacterium tuberculosis genome annotation
Mycobacterium tuberculosis H37Rv
Mycobacterium tuberculosis H37Rv browser
From the Gene record
Mycobacterium tuberculosis H37Rv GenePlot
BioProject, BioSample, Genome, Assembly, Nucleotide
Genome
BioProject
Single isolate
BioSample
Genome
BioProject
Single isolate
BioSample
Genome
BioProject
Single isolate
BioProject
Multi isolate
SRA
BioSample
Genome
BioSample
BioSample
BioSample
BioSample
Assembly
Assembly
Assembly
Assembly
Assembly
Assembly
Assembly
Assembly
Assembly
Nucleotide
Nucleotide
Nucleotide
Nucleotide
NCBI genome submission dataflow
BioProject
metadata
Common
Submission
Interface
BioSample
Sequence data
SRA
GenBank
Contigs
Genome Collection
Virtual machines in cloud environments
Running the pipeline happens on the local machine,
while the heavy lifting is done on the cloud/cluster
CloVR is a Virtual Machine
Virtual Machine
Pipelines:
CloVR-16S
CloVR-Search
CloVR-Microbe
CloVR-Metagenomics
Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics
CloVR Architecture
Galaxy on the cloud
•
•
•
•
Get Galaxy without the data or usage limitations.
Combine with Cloud BioLinux to have access to MANY
tools.
Create an analysis cluster in minutes.
Use autoscaling to get good performance at low cost.
http://wiki.g2.bx.psu.edu/Admin/Cloud
Deploying Galaxy cluster on AWS
1.
2.
3.
4.
Download