Bioinformatics in the CDC Biotechnology Core Facility Branch

advertisement
Bioinformatics in the CDC
Biotechnology Core Facility
Branch
Computational Lab
Scott Sammons
Kevin Tang
Chandni Desai
Sequencing Lab
Mike Frace
Missy Olsen-Rasmussen
Marina Khristova
Lori Rowe
Genome Sequencing Lab
sequencing platforms – current and upcoming
AB 3730XL
Roche 454 Titanium +
Pacific Biosciences SMRT sequencer
Illumina GA IIx
Ion Torrent Personal Gene Machine
Building 23 Server Room – Main ISLE
3
High Performance Computing Cluster (Aspen)
•
What is it?
• 35 compute nodes each with 12 processor cores, 48GB of
memory, and 2 Tesla 2050 GPU cards
• Currently in the final stages of development in preparation for
code-freeze and C&A
•
What can it do today?
• 25 cluster applications are currently enabled
for our phase-one deployment including
MatLab, Geneious, Beast, Blast, and PacBio
• Collaboration with NCI via IAA will GPU
scientific applications even further
•
How fast is it?
• By example, a Blast job that takes over 60 hours to complete on
our old cluster takes 2 hours on the new cluster*
4
•
*NOT GPU OPTIMIZED CODE
Isilon
•
What is it?
• High speed, scalable, and redundant Network Attached Storage
• Currently in the process of being integrated with applications
• Connected to both the CDC network and the Aspen HPC cluster
utilizing Infiniband
•
What can it do today?
• It provides user workspace for end-users
and HPC applications
• Solves the problem of being out of
disk space on individual servers
•
What are we doing with it?
• Data warehouse for all scientific equipment
• Central network share for all scientific users
• Integrating directly with ITSO’s Active Directory forest
5
Private Cloud
•
What is it?
• Support science through front-end and back-end services
• Implementation of virtualized infrastructure.
• Currently in the process of being deployed.
•
What can it do today?
• Provide test environments for scientific projects
• Lay the foundation for hardware consolidation
and migration
•
What are we doing with it?
• Standardize platforms
• Centralize management
• Support ongoing growth within the scientific
computing community while enabling science
6
Scientific Computing Infrastructure
The Server Room
•
•
•
•
•
•
•
•
•
•
•
2 Linux High Performance Computing Clusters (~40 nodes each)
1 Genomics Cluster
4 Solaris Servers
12 Stand-Alone Linux Servers
1 Stand-Alone Database Server
5 Stand-Alone Windows Servers
Virtualized Cluster with 15 VMs
3 NAS Devices
2 Tape Libraries
2 Dedicated IP Subnets
One C&A addressing all legacy production hardware (NCEZID) with several
in-process for systems currently under development (NCIRD)
7
INFLUENZA
GSL sequencing 2011
NCIRD
Haemophilus influenzae
Legionella pneumophila
Legionella spp.
Mycoplasma pneumonia
Water cooling tower metagenomics
Respiratory filter metagenomics
Bat metagenomics
NCEZID
Vibrio cholera
Vibrio spp
Cyclospora
Bacillus anthracis
Listera
Yersinia pestis
Brucella spp.
Klebsiella pneumonia
Junin virus
Rift Valley Fever virus
Lujo virus
Marburg virus
CCHF virus
Lassa Fever virus
Clinical sample metagenomics
Tick metagenomics
Soil metagenomics
CGH
Guineaworm
Taenia solium
Angiostrongylus
Sequencing: extended PCR
Position of E-PCR overlapping amplicons
A3
A1
End-L
A2
D PO C





A5
A4
A9
A7
A6
A8
A11
A10
E R K H ML I
F N
Q
HindIII map
A15
A13
A12
A14
A
SJ
A17 End-R
A16 A18
B
G
Primers designed using VAR-BSH and VAC-CPN sequences
Primers target genes involved in reproduction & host response
Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sites
PCR uses minimal DNA amounts, often no need to grow virus
PCR uses hifi expand long-template Taq & Pwo enzymes (Roche)
First Pass Assembly: Seqmerge
fold redundancy
16
12
8
4
Sequencing Assembly: Phred/Phrap/Consed
Gene Prediction
• Heuristic algorithm to assign quality scores to
ORFs (from 1 to 100)
• Quality scores are based on a number of
factors including
– Gene Predictions (glimmer, genemark, getorf)
– Primary sequence homology to known genes
(BLAST)
– Presence of predicted promoter (MEME/MAST)
– Size of predicted ORF
– Presence of transcription terminal signals
Visualizing Gene Predictions and Differences
ORFs of CPVXs from 4 different clades
ITR
crm-D
ITR
45 Smallpox Strains
A. West African
int. CFR ~10%
C. Asian major
CFR ~5 - 35%
B. American alastrim
minor CFR <1%
C-1. non-WestAfrican-African
int CFR ~10%
C-2. non-WestAfrican African
minor CFR <1%
Unrooted tree phylogenetic relationships of
ORF encoding the hemagglutinin protein
Taterapox
Camelpox
Cowpox clade IV
CPXV90_ger2
Variola
AF375135
L22579
Ectromelia
AY902256
Cowpox clade III
(CPXV91_ger3)
AY603355
AF377885
Cowpox clade II
AF375086
VACLS1
Z99045
AY902297
Cowpox clade I
Vaccinia
AF375102
Monkeypox
Next-Gen Diagnostic Sequencing Applications
Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence
coverage over entire genome.
‘Massively parallel’ sequencing not only produces throughput, it provides
sequences of potentially millions of individual molecules (instant cloning). By
sequencing a PCR reaction it allows the detailed search for low expression
quasi-species or mutations which may signal growing drug or vaccine
resistance – a process called ultra-deep or amplicon sequencing.
Example: clinical case of poxvirus infection with
samples exhibiting a reduced sensitivity to an antiviral
drug.
Complex clinical, laboratory or environmental samples can be sequenced to
provide a diagnostic ‘snapshot’ of the resident organisms - an approach called
metagenomic sequencing.
Examples: tissue culture, soil
Shotgun / Paired-End Sequencing
De novo Assembly
• Newbler
• CLCBio
• Mira
• Geneious
• Velvet
• Celera
Reference Mapping
• Newbler
• CLCBio
• Mosaik
• Mira
• Geneious
• BWA
Genome Assembly Visualization
Genome Assembly Visualization
Amplicon (deep) sequencing project
Li, Damon - NCZEID/DVRD/PRB
•
Clinical case of progressive vaccinia infection from smallpox vaccination
of an immune compromised patient
•
Pox antiviral ST-246 administered which targets pox gene F13L, a major
envelope protein which mediates production of extracellular virus
•
Oral ST-246 given daily and vaccination site sampled over 3 week period
A region of gene F13L was amplified from clinical samples, deep sequenced,
and compared to the smallpox vaccine reference sequence (Acambis 2000)
Control swab prior to ST-246
2 weeks after ST-246
T>A
943
C>T
869
3 weeks after ST-246
C>T
869
T>A
943
What is Metagenomics?
•
•
Is the genomic study of DNA from uncultured
microorganisms, generally from environmental
samples
Related
• Metatranscriptomics
• Metaproteomics
Sample Coverage
Rarefaction Curves
Samples
Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2)
Classification Techniques
•
Supervised Taxonomic Classification
• Homology-based
• Database searching by similarity (BLAST, SW)
• BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-NT,
NCBI-ENV-NR
• Composition-based
• N-mer frequency
• Markov Models, Support Vector Machines (SVM), need training
set
•
Unsupervised Taxonomic Classification
• Clustering methods
• SOM - self-organizing maps
• PCA – principal component analysis
Viral Metagenomic Pipeline
(Wash U scripts implemented at CDC)
Sample Collection
Contigs, Reads
DNA
Library Construction
Sequencing
Remove redundant sequences
Unique sequences
Mask repetitive and low complexity seqs
Good sequences
BLASTN against Human Genome (e ≤ 1e-10)
Basecalling
Vector Trimming
Assembly
Non-human sequences
BLASTX
vs nr
BLASTN
vs nt
BLASTN
vs GB-viral
Report Generation, Display in MEGAN, inspect top hits
Software for Taxonomic Classification
•
•
•
•
•
•
MEGAN – GUI interface for classification based on
blast searches
CARMA web-based classification using pFam
database and HMMER alignment of protein families
MG-RAST classification system utilizing protein
encoding databases and several ribosomal DBs. Can
analyze user provided datasets, web use only
Geneious – commercial product
NextGENe – commercial product
Phymm, PhymmBL – composition based
classification system
Software for Comparative Metagenomics
•
•
•
Megan – can display two metagenome populations
on the same phylogenetic tree, uses BLAST file as
input
STAMP – calculates statistical differences between
sets of metagenomes
XIPE-TOTEC – performs pairwise comparisons of
every metagenome in the two sets, creates a
distance matrix which is then used for clustering and
PCA analysis to calculate statistical values of
relatedness
Megan
Ugandan Outbreak Samples
•
4 patients
• Total RNA from patient sera
•
2 samples per 454 run
• ~ 565,000 reads/sample, avg length = 235nt
•
•
•
•
Sequences were screened for random library
amplication primers and low quality
Assembled each run de novo using the 454
gsAssembler
Performed a blastx database search using the
assembled contigs (overnight)
Visualized the blast output using MEGAN.
MEGAN (MetaGenomeANalyzer)
Ugandan Outbreak - results
•
•
•
Run1 - 5 contigs (out of 2463 > 100nt) matched YF
virus, covering 98% of the genome (10,441 of
10,823bp)
Mapped each sample from Run1 using an Ethiopian
YF virus as reference. 3229 individual reads from
Sample 1 indentified as YF.
Run 2 – no YF reads found
Phylogenetic analysis of yellow fever virus
sequences
Laura McMullan (DHPP/VSPB)
Comparative Metagenomics – current work
•
•
One 454 run
Two samples
• Sample 1 – ~578,000 reads, avg read length 438 bases
• Sample 2 – ~550,000 reads, avg read length 425 bases
•
Total number of bases sequenced - ~488,000,000
Sample 1 – Rarefaction Curve
Sample 1 Taxa tree (collapsed at the Order
level)
Comparison of Sample 1 and 2
Bioinformatics Tools
•
•
•
•
•
Bioinformatics Packages
– EMBOSS
– BioInquiry
General Tools
– Java/BioJava
– Perl/BioPerl
– BLAST Suite
– BioEdit
– GFFtoPS
Genome Comparison/Alignment Tools
– Mavid
– Mauve
– Clustal
– Muscle
Gene Prediction
– Glimmer
– GeneMark
Assembly/Mapping Tools
– 454 Suite
– Mosaik Tools
– Mummer
– CLC Bio
– BWA
– Velvet
– AHA (pacbio)
•
•
•
•
Functional Annotation
– Manatee
Phylogenetics
– Paup
– Phylip
– MrBayes
– Beauti/Beast
– MEGA
– DnaSP
Metagenomics
– MEGAN
– Galaxy
– Carma
In-House
– WAMS
– POCs/VOCs
Challenges
Data Management – image files are large (1 run ~25G) moving these
files around the network is slow
Assembly/Mapping Software – Some are provided with the instrument,
but additional methods and algorithms are needed
Finishing Tools – gap filling, primer design
Visualization Tools – tools to graphically display contigs on reference
sequence as well as genome multiple alignments
Generic Robust Annotation Tools – Researchers need tools to
intelligently choose predicted ORFs as genes, assign function, and
submit to GenBank
What are the weaknesses of current next-gen sequencers?
Complicated and time consuming library preparation
Requires micrograms of DNA to begin
3 days to prepare library
Requires amplification of library
Low copy number polymorphisms may be missed
Emulsion PCR is an inefficient, time consuming, oily mess
Potential to introduce PCR bias into sample
Instruments require repetitive sequential ‘flows’ of reagents
Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out
reaction byproducts all slow synthesis and hinder read-length
Consumes liters of reagents ($)
Repetitive flows and imaging extend sequence runs to days (or weeks)
Pacific Bioscience SMRT sequencer
(single-molecule sequencer)
Ion Torrent Personal Gene Machine
(solid-state sequencer)
Nanopore sequencing
Pacific Biosciences SMRT sequencer
Sponsor: Influenza Research Agenda
Pacific Biosciences SMRT Technology
Individual ZMW with attached
polymerase and DNA strand
Laser excitation/detection volume
glass 
~ 50 nm
SMRTcell = 160,000 ZMW
Functional volume (red) is in zL!
SMRTcell array = 1.5 million ZMW
Nucleotide incorporation is a realtime data movie
100 ms
Pacific Biosciences Advantages
 Read lengths of 1,000 – 10,000 bases
4
 No reagent ‘flows’ =10-fold increase in sequencing speed
 Substitute reverse transcriptase for polymerase and sequence RNA directly
 Bacteria genomes sequenced in hours
 Sequence run costs 99$; take 15 minutes to complete
454 Sequencing
•
•
•
•
DNA Library Prep
emPCR Amplification
Sequencing
Data Analysis
454 Sequencing: DNA Prep
• Nebulization
– sheared with high pressure nitrogen to create fragments ~300-800
bases long
• Repair Ends
– double stranded pieces are purified, blunt ended, and phosphorylated
• Adaptor Ligation
– two different adaptors are ligated to the fragment, A and B
– 44 bases long: 20 base PCR primer, 20 base sequencing primer, 4
base key
– B fragment contain a biotin tag for immobilization
– This forms 4 different strands A-A, A-B, B-A, B-B
• Fragment Immobilization
– These immobilized on streptavidin-coated magnetic beads, A-A strands
will not bind and are washed away
• Single-strand Isolation
– bound fragments are denatured and the released strands (containing
both an A and a B tag) form a single-stranded template DNA library
454 Sequencing: emulsionPCR
Emulsion-based clonal PCR
• Annealing
– Fragments are annealed to primer tagged “catcher”
beads
– optimized to anneal a single strand to a single bead
• Distribution in a water-oil-emulsion
– the captured dna and beads along with amplication
reagents are placed in a water-oil mixture
– Each bead is captured in a “bubble” and creates its’
own small “micro-reactor”
– thermocyled creating millions of copies of a single
clonal fragment in individual “microreactors”
– cleaned up and denatured
454 Sequencing: Sequencing by
Synthesis
• Bead Preparation - sequencing primer attached
and polymerase and cofactors are added
• Bead Deposition – beads are layered on a
picotiter plate (wells are 44 μm), then enzyme
beads and packing beads are added
454 Sequencing: Sequencing by
Synthesis (cont.)
• Sequencing
– enzyme beads contain
sulfurylase and
luciferase, packing beads
help keep reaction beads
in position
– a fluidics system delivers
sequencing reagents,
flowing the nucleotides
one at a time in a specific
order across the wells
454 Sequencing: Sequencing by
Synthesis (cont.)
• Sequencing
– if a nucleotide is incorporated, a
pyrophosphate is released which is converted
to ATP by the sulfurylase
– the ATP is hydrolyzed by the luciferase
enzyme producing oxyluciferase and light
– The light emission is measured with a CCD
camera
– light intensity indicates nucleotide
incorporation
454 Sequencing: Sequencing by
Synthesis (cont.)
• Characteristics
– Flow of the four nucleotides is repeated for
one hundred cycles, resulting in average read
length of 300-500 bases
– system averages ~1,000,000 high quality
wells
– therefore, a typical run yields over 400 million
high quality bases
454 Sequencing:
Paired End Protocol
Download