Next-Generation Sequencing: Lecture I

advertisement
Next-Generation Sequencing:
Data, Methods, Analysis and Implications
Lecture I
January 31, 2011
Peter J. Park
Center for Biomedical Informatics
Harvard Medical School
Children’s Hospital Boston/Brigham and Women’s Hospital
Breakthroughs in Technology
• An essential tool in the molecular biology toolkit is the ability
to read the base sequence of DNA molecules
• Rapid DNA sequencing in the 1970s
– Sanger
– Gilbert and Maxam
• Microarrays in late 1990s and 2000s
– cDNA arrays
– oligonucleotide arrays
Microarrays
oligonucleotide arrays
cDNA arrays
Images courtesy of Bioteach
Microarrays
• Extremely successful
• Popular applications: Gene expression profiling, DNA copy
number (comparative genomic hybridization), SNPs,
microRNAs, ChIP-chip (tiling arrays), splicing (exon arrays)
Disadvantages:
• One must know the sequences to design the array
• Even if one knows the sequences, one cannot fit all of them in
a small number of arrays
• High noise level due to cross-hybrization, non-linearity, etc.
What is Next-Generation Sequencing?
• One can sequence hundreds of millions
of short sequences (35bp-100bp) in a
single run
• Illumina/Solexa GA II / HiSeq 2000
• Life Technologies/Applied
Biosystems SOLiD
• Roche/454 FLX, Titanium
• Helicos
• Pacific Biosciences
• CompleteGenomics
Illumina Genome Analyzer
•
•
•
•
•
1 “flow cell” = 8 “lanes”
1 lane = ~10-30 million “reads”
~5-20 million “mapped reads”
36bp, 50bp, 75bp, 100bp
Single-end (SE) or Paired-ends (PE)
• 1 lane: $800-$2000
• Single-end or paired-ends
• Multiplexing
Illumina: Sequencing-by-synthesis
Multiplexing
• We may not need to generate so many reads per sample
• Multiplexing: Pool samples into a single lane of a flow cell
• Add a short “index” to tag libraries
• Current Illumina multiplexing kit
– six-base oligos
– currently 12 unique tags to generate 96 samples/run
• Easy in theory but has not been easy in practice
Leading Platforms
With 3730s, ~60Mb per year
Specifications as of summer 2008
454
Solexa/Illumina
SOLiD (ABI)
Bp per run
400 Mb
2-3 Gb
3-6 Gb
Read length
250-400 bp
35-50 (70-100) bp
35-50 bp
run time
10 hr
2.5 days
5 days
Download
20 min
27 hr (44 min)
~1 day
Analysis
2-5 hr
2 days
2-3 days
Files
20-50 Gb
1T
1T
Latest Platforms: Illumina HiSeq
• ~1 billion clusters
• 30x coverage of two human genomes
in a single run
• ~10K per sample?
• 1 x 35bp: ~1.5 days, ~30Gb
• 2 x 50bp: ~4 days, 75-100Gb
• 2 x 100bp: ~8 days, 150-200Gb
SOLiD 5500xl:
•
•
•
•
•
With microbeads or nanobeads
20-45 Gb/day
12 lanes
Similar run times as HiSeq
Up to 180-300Gb per run
Rapid Decrease in Cost
• The Human Genome
Project: 13 years and $3
billion.
• Sequencing of the Watson
Genome by 454 in 2007:
$2 million
• Illumina: eight days at a
cost of about $10,000.
• ~104 reduction in 5 yrs
• Claims: a genome in 15
minutes for $1000?
Source: The Economist
ABI SOLiD (Seq by Oligo Ligation/Detection)
• Clonal bead library via emulsion PCR
• The actual base detection is no longer done by the
polymerase-driven incorporation of labeled dideoxy
terminators.
• SOLiD uses a mixture of labeled oligonucleotides and queries
the input strand with ligase.
• Each base is interrogated twice
– built-in error checking capability that distinguishes between
measurement errors and true polymorphisms
– detection of more complicated variations
SOLiD Technology
Ligation-based
chemistry with dibase
labelled probes
•
Oligos:
– Positions 1-2 (from 3’ side): one of 16 dinucleotides
– Positions 3-5: degenerate (Ns)
– Positions 6-5’: degenerate and holds one of four fluorescent dyes
• 5-7 ligation reactions are followed by a reset cycle
• Next a new initial primer is used that is N-1 in length
Working in “Colorspace”
Helicos
• True Single Molecule
SequencingTM
• No amplification
• Very easy sample prep
• Sept 2009:
– Nature Biotechnology: ‘Singlemolecule sequencing of an
individual human genome’
– 24-70bp reads, 28x coverage
• Measuring a small amount of
DNA (3-6ng) is difficult
• Alignment is tricky
Pacific Biosciences
• Single Molecular Real Time (SMRT) real-time technology
• Each chip with waveguides – a 100-nm hole to watch DNA
polymerase perform sequencing by synthesis; phospholinked
nucleotides labeled with colored fluorophore are introduced
• Long reads, short run times, high quality
• 1000-1200bp reads (5% 3-5K), fast and low cost per run
Eid et al, Science, 2009
Iron Torrent Personal Genome Machine
• When a nucleotide is incorporated into
a strand of DNA by a polymerase, a
hydrogen ion is released
• A high-density array of wells (using
semiconductor technology) with each
well holding a different DNA template.
Beneath the well is an ion-sensitive
layer and a senor
• Sequentially floods the chip with one
nucleotide after another
• 10Mb of “high-quality”
• If a match, a hydrogen ion is released
sequence
and the change in the pH of the
• Runs in ~2 hours
solution is detected
Access to Platforms
• As in most new technologies, getting good data from a
sequencer initially is not trivial
• This is especially the case if you only have only one or two
machines
• The situations has improved dramatically in the past 2-3 years
as the technology has become more stable
• The cost (500K-800K) is still prohibitive for most universities
• NIH funded many machines through their “large
instrumentation” program
• The big genome centers have substantial advantage in
technology development
• Future landscape?
Stock Prices
Data Analysis
Problems with NGS data
• Reads are short
– difficult to assemble/map repetitive regions
• Not all sequences are equally likely to be sequenced
– GC content
– fragment length
• Amplification bias
• Sequencing errors
– especially toward the end
• Variable quality/turn-around
Are Short Reads Useful?
• But a big problem with repetitive regions!
Francesco Ferrari
Error Rate
• Error rate is high in the first 1-2 bases
• It increases exponentially toward the end
Wang et al, Nature 456: 470, 2008
Kircher et al. Genome Biology 2009
Quality Score
• Each base position in a sequence comes with a “quality
score”.
• This measures the probability that a base is called incorrectly,
by a phred-like algorithm similar to that originally developed
for Sanger sequencing experiments.
• The quality score of a given base, Q, is defined by
Q=
-10*log10(e) where e is the estimated probability of the base
call being wrong.
• A quality score of 20 represents an error rate of 1 in 100, with
a corresponding call accuracy of 99%.
Quality Scores
• 100-bp reads
• 40 is the highest, 0 is lowest
Inter-laboratory
variation
quality
score
Illumina (internal)
100 bp reads
A large genome
sequencing center
•
•
•
Data from the
1000 genomes
project
Different
samples but
same population
Consistent
across many
samples
Francesco Ferrari
Data Generation Pipeline
Image Processing
Base-calling
Genome
Alignment
Data format
• qseq.txt file
Data Management
• Raw data are large; to be kept for ~6 months?
• Processed data (e.g., BAM files) are manageable for most
people: ~1GB for 20 million reads (50bp)
• Alignment is not a big issue for most investigators
• More of an issue for a facility: HiSeq recommends 32 CPU
cores, 4 GB RAM each
• Whole-genome sequencing:
– A 30X coverage genome pair (tumor/normal): ~500GB
– 50 genome pairs: ~25TB
• “Why can’t I get a 1TB drive at Costco for $100?”
• We generally want high-performance, replicated storage
• At HMS, ~$700/TB/year; non-redundant storage: $200/TB
How To Transfer Data
• It is difficult to download data via http or ftp
• A commercial software/protocol is become popular (Aspera
“next-generation file transport”)
• This can give 400-800Mbps
Genome Alignment
• One can specify how many mismatches are to be tolerated
• This can also be quantified by accounting for quality scores
• A typical criterion might be 1-2 mismatches for 36bp reads
• From the raw sequences, ~50-80% of the reads are typically
aligned to the genome
–
–
–
–
sequencing errors
multiple matches in the genome
deviations from the reference genome (SNPs, insertions, etc)
problems with the aligner
• This % of mapped reads is a good measure of data quality
• Often need to normalize using an “alignability map”
Genome Alignment
• A dynamic programming can be used to find the local
alignments between a text T and a pattern P in O(|T||P|) time
• The genome is too big for this approach
• How to find exact match?
– Sort all 36mers in the reference genome
– Search a sorted list in log(N) step
• The genome must be ‘indexed’
• A BWT (Burrows-Wheeler Transformation) index for the
human genome occupies just around 1 G bytes
• Exact matches are too stringent, so heuristic approaches are
needed
Popular Aligners
• Generate a sorted list of genomic oligomers or a hash table
– eland
– MAQ
• Burrows-Wheeler Transformation
– Bowtie
– BWA
Program
Bowtie
BWA
MAQ
Mosaik
Algorithm
FM-index
FM-index
hashing reads
hashing ref.
Long reads Gapped
No
No
Yes
Yes
No
Yes
Yes
Yes
PairedEnds
Yes
Yes
Yes
Yes
Use of
quality info
Yes
No
Yes
No
Name
BLAT
Bowtie
BWA
ELAND
GMAP and
GSNAP
MAQ
MOSAIK
Description
BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step.
Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB
memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.
Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie
but allows indels in alignment
Implemented by Illumina. Includes ungapped alignment with a finite read length.
Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for
digital gene expression, SNP and indel genotyping.
Ungapped alignment that takes into account quality scores for each base
Fast gapped aligner and reference-guided assembler. Aligns reads using a banded SmithWaterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in
size from very short to very long.
SOAP
No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable
and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.
Indexes the reads instead of the reference genome. Uses masks to generate possible keys. Can
map ABI SOLiD color space reads.
Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files
instead of the sequence files as an input for alignment to a reference sequence or a set of
reference sequences.
Robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses
a 12 letter hash table. Now SOAP2 is much faster than the first version.
SOCS
For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color
errors). Uses an iterative version of the Rabin-Karp string search algorithm.
SSAHA
Taipan
Fast for a small number of variants.
de-novo Assembler for Illumina reads
RazerS
SHRiMP
SLIDER
based on http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
Applications of NGS
• If you build it, they will come!
• Whole-genome sequencing
• de novo genome assembly (much harder with shorter reads)
• Variant detection (mutations, SNPs, indels, copy number)
• Targeted resequencing (e.g.,exons)
• ChIP-seq
– Protein-DNA binding, histone modifications, nucleosomes
• Expression profiling:
– RNA-seq – splicing variants
– Digital expression profiling
• Small RNA sequencing
• and many more .
Number of Publications in Pubmed
Kahvejian, Quackenbush, Thompson, Nature Biotech 26:1125, 2008
Download