single-stranded

advertisement
Class 11 – 2nd “next generation” seq. method
Review other sequencing methods
Sanger
Ion torrent
New method
Begin to consider uses of DNA seq.
Old method – Sanger
chain terminationNobel:
clone target DNA in bac.
to get ~1011 copies
needed for 4 seq rxns:
DNA template + primer
+ pol + dNTP + ddATP
(or ddCTP etc., each in
separate tube);
ddNTP’s lack 3’OH,
incorporate normally
but can’t be extended;
run gel w/4 lanes; bands
in G lane show size of
frags. ending in G, etc.
Di-dexoy NTPs
lack 3’OH group
They are incorporated
normally, but
next base can’t be
chemically attached
because it attaches
thru 3’ O
missing OH
More elegant later method:
label each ddNTP with a diff. colored fluor
run electrophoresis products in single lane
camera records color of products as they
run off the bottom of the gel
*
**
*
Ion Torrent Method
Part A: produce ~107 copies of individual DNA fragments
on mm-sized beads because sequencing method
requires multiple identical target molecules/bead
Part B: read sequence by primer extension synthesis,
1 base at a time, detecting pH change when dNTPs
are incorporated in individual wells containing single
beads, using array of ion-sensitive field effect transistors
(ISFETs)
Part A - method to put many copies of single short piece
of DNA on micron-size bead; diff. DNAs on diff. beads
Shear target DNA; select pieces ~200 bp in length (how?)
Ligate forked adapter oligos to ends of sheared DNA
F
R’
R’
F
Note this allows all pieces to be amplified with
oligos F and R (the reverse complement of R’ /= F)
(without fork, F and F’ would be at 5’ and 3’ ends and
their annealing on single templates would impede pcr)
Make water-in-oil emulsion containing:
1) pcr reagents to amplify DNA using primers F and R
2) hydrophilic micron-size beads with lots of oligos F
attached via their 5’ ends
3) bead and DNA concentration adjusted such that
~1 DNA fragment and 1 bead/water droplet
F
Each droplet acts like test
tube to isolate individ. DNA
species. Because many
copies of F are on each
bead, many product strands
( ~107) starting with F get
attached to each bead
Break emulsion with soap, spin down beads,
melt off non-covalently attached strand, spin
down beads - most now have single-stranded
DNA starting with F and ending with R’
Centrifuge enriched beads into wells just big enough to
hold a single bead
Part B: to get sequence, add primer R, DNA pol and a single
dNTP, e.g. dATP; if T is next base on template, A will be incorporated, generating ~107 H+ ions as dATP ->dAMP+PP+H+
If T is not
the next base,
no H+ will
be produced
http://upload.wikimedia.org/wikipedia/commons/1/10/DNTP_nucleotide_incorporation_reaction.svg
Key ideas and innovations in Illumina Method
Biochemistry
• “bridging” pcr to get array of ~108 DNA spots on
glass slide, each containing ~104 copies of an
individual ~ 200 bp DNA species in ~ 1mm area
• sequencing by synthesis, 1 base at a time, using
dNTPs with removable fluors and 3’ blocking groups
• reading ~35b from both ends of each DNA species to
get seq that should be known distance apart in ref. seq.
Image analysis – automated collection and analysis of ~106
microscope images/run
Informatics – mapping short seq. runs to genome
First challenge – how to assemble multiple copies
of individual templates on solid surface where
sequencing will be done
B’ A
A
B’
Shear genomic DNA
A’ A (nebulizer) into segments
~200 – 2000bp
B B’
“blunt” ends w/ DNA pol
Ligate “forked” adapter oligonucleotide
Pcr w/ oligos complementary to adapter seq forked ends
A, B -> at 5’-ends of alternate strands of all fragments
Substrate = glass flow cell, 8 channels ~100mm height,
thin layer of polyacrylamide applied in each channel
Polyacrylamide contains bromo… (BRAPA) which covalently
links to phospho-thioate group on 5’ end of new primers
3’ ~20 bases of attached primers match those of oligo A or B
used to pcr the genomic fragments, so melted
amplified genomic fragments anneal to the attached
primers. Primer ext. w/ DNA pol makes copy of 1 strand
of particular genomic fragment at some spot on surface
Next challenge – make multiple copies of each fragment
in small region on substrate surface (to have enough
copies to get a strong sequencing signal)
Now melt off
template
A
Newly synthsized strand anneals at its
3’-end to nearby, 5’-attached oligo A
A’
B’ B A
A’
B A
B A A’
B
Repetition grows thicket of both strands of particular
genomic fragment in small spot on surface “bridging” pcr;
note all strands are covalently attached via 5’ ends
For unexplained reason they do this surface pcr by repeated
cycles of chemical rather than thermal denaturation
Image of DNA fragments on
surface after bridging pcr;
each fragment is labeled
(during sequencing) with
1 of 4 differently colored
fluors by method
explained below
Each spot = “polony” or
“cluster” of many copies of
single DNA fragment
Spot diameters ~1mm; each spot contains ~ 104 strands;
-> primers ~10nm apart; areal density c/w initial conc.
of annealed genomic fragments ~3pM
Next challenge – how to make surface pcr’d DNA
single-stranded to serve as sequencing template
B
B
Clever method – cut one strand of DNA at chemically
sensitive site (*) engineered in oligo B, then melt off
non-coval. attached DNA, add free primer B that anneals
to distal (3’) end of attached template, extend B w/pol
How to make the single-strand cut?
Put diol modified base in attached oligo B; diol can be
chemically cleaved by periodate
How to sequence other end of template?
B
diol
A
ii
A
A
A
After sequencing 1st strand, melt off primer-ext. product,
perform another cycle of bridging pcr (ii), make singlestranded cut in attached oligo A, melt off oligo A
extension product, seq. w/ soluble primer A
Note you need a new way to make ss cut in oligo A so you
can make the A and B cuts separately; here are 2 ways:
Synthesize oligo A with uracil U instead of T at given
position; enzyme uracil glycosylase removes uracil (not
normally in DNA); heat or high pH then breaks A strand
at site of removed U
Alternative: put oxoG in place of one G in oligo A; enzyme
Fpg glycosylase removes abnormal oxoG; heat or
high pH then breaks A strand at site of removed G
Novel use of enzymes that remove abnormal bases
(repair mutations in vivo) plus ability to insert abnormal
bases during oligo synthesis makes this possible
Additional complication: any free 3’ ends on DNA
on surface might “fold-back” and serve as
primer for competing sequence rxn
They block this by enzymatically adding
nucleotide w/blocked 3’OH group
to all DNAs before adding seq.
primer
How is sequence read biochemically?
They synthesized novel nucleotides!
base
T modified with flour
sugar
3’ azide group N3 blocks extension
A, C and G similarly modified but with diff. colored fluors;
only one base is added at a time due to 3’ blocking group
Treatment with TCEP
removes fluor and
3’ blocking group,
which allows next
nucleotide to be added
and its color detected,
(prev. fluor is removed)
Amazing that bulky, unnatural chemical groups left attached
do not inhibit polymerase, or mess up base-pairing
They say they had to engineer (mutate) DNA polymerase to
get it to incorporate these modified bases efficiently
This is another innovative step!
Repeated cycles of flowing in polymerase plus 4 modified
nucleotides (1 of which gets incorporated in given spot),
washing, taking picture, treating with TCEP -> sequence
Picture taken at step n during
sequencing run; all strands in
a given cluster label with A, C,
G or T depending on sequence
at nth base in template strand.
How does spot density compare
to ion torrent?
Image analysis technology and innovations
“custom”
Note they use TIRF microscopy to reduce background,
only see fluors within < 1mm of surface
Why “custom” objective?
How big is typical microscope field of view (FOV)
at 60x magnification? Imagine FOV expanded
60x in each direction and mapped to 3x3mm CCD
How many images would they need to cover ~10cm2 flow
cell surface?
How long would it take to collect these images serially
if they have to move slide 1 FOV between images?
Their “custom” lens gives them ?? (0.1mm)2 FOV
How many sets of images do they need (1 for each base
addition)? How long does it take to collect data
for 1 run? ~week
Do they need to align the spots in images of the same FOV
taken hours apart? Automated spot alignment program
Cross-talk of different fluors – they need to adjust image
intensities to correct for “red” fluorescence of “green”
fluor, etc to get best estimate of which dNTP was incorp.
If base extension or deblocking is not complete
for all strands in cluster, different nucleotides
will be incorporated at subsequent steps, purity
of fluorescence signal will erode (phasing prob.)
Quality control measures used to decide when base calling
is unreliable; e.g. purity filter: intensity of 1 base must be
> .6 sum of it plus next brightest base in 1st 12 positions
# errors determined by sequencing DNA with known seq.
# errors/35 bases
2
1
Even with
QC criteria
to select
good reads
get only
~35 b
reliable
seq.!
How does Illumina method differ from Sanger /ion torrent?
How to get many
copies of template
Sanger
clone in
bacteria
ion torrent
emulsion
pcr -> beads
Illumina
bridging pcr
on glass surface
seq rxn/
biochemistry
dye-labeled normal
ddNTP chain dNTPs, 1
terminators at a time
reversibly
3’blocked dyelableled dNTPs,
4 at a time
read out
gel electrophoresis of
labeled DNA
size => pos.
ISFET detect.
H+-released
base incorp.
in each well
sequential photos
see order of base
addition in
@ cluster
seq. length
~1000
~100
~35
Informatics – mapping shorts seq. reads to genome
2 programs used to look for matches betw. the ~35b
end seq. they obtain for a cluster and ref seq.
ELAND – finds all seq. in reference that match first 32
bases of cluster seq, allowing up to 2 mismatches
but no gaps; then sees which of these best match
cluster seq at any bases beyond 32
MAQ – more sophisticated in allowing gaps betw. ref.
and cluster seq., so picks up more matches with
small “indels”, but potentially more errors
If genome seq. were random, what length seq. would
be unique (unlikely to occur more than once)?
Complication: some sequences >35b occur many times
“selfish” genes have replicated and re-inserted
in different positions in the genome, e.g.
short interspersed nuclear elements (SINES, alu)
~300 bp; ~106 copies (~10% of genome)
long interspersed nuclear elements (LINES)
~6000 bp; ~105 copies (~20% of genome)
Two features help assignment of 35b reads to
correct position in genome
they know the paired end read should map
to other DNA strand about 200 bp away
in reference sequence
each region of DNA is read many times, so
they can just map consensus sequence
for any segment
Tests of quality
How uniformly does their data cover the ref. seq.?
If some DNA segments don’t amplify well (? due to
high GC content) they might be absent in their seq.
If cluster seq. is random sample of ref seq., Poisson dist.
predicts how many times, n, a ref. seq. base should
appear in cluster seq.
pn=e-mmn/n!
where m = aver. # times
m=130Gb of cluster seq/3Gb per genome = 43
Fig. 2 Take every 50th base of ref seq.; how many times is an
overlapping frag. found in a cluster seq. mapped to the
ref seq.? Make a histogram of the # of such
bases found n times in the cluster seq data set. For
interest, consider separately bases that don’t occur in
repetitive elements like SINES and LINES (unique only)
The dist. is pretty close
to Poisson (only slt. more
samples in tails), so the
method seems to sample
pretty randomly
Does GC content affect how often a region is sampled?
Plot # times a particular base
is sequenced in the data set)
as function of GC content of
seq. in which it occurs.
Only cluster sequences
with most extreme GC
contents were sequenced less
than the average ~40 times
So what? If a seq. (with extreme GC content) is undersampled, you might get only the maternal or paternal
copy (allele) in the seq data set and so miss finding a
polymorphism (false negative)
Next evaluation – compare how often SNPs are identified
in the seq. vs. SNP hybridization assays (“GT, genotyping”)
Note this company makes SNP hybrid. assay, so it working
hard on technology that may replace its current platform!
Using ELAND program:
std version of hybrid.
assay (GT) w/.5M SNPs
latest version of hybrid.
assay w/ >3M SNPs
<1% discordant calls
most often the array
assay (GT) finds a
SNP missed by seq.
Same table, using MAQ program, seq. does slt. better,
but in general GT and seq. have similar fail-to-detect rates
Their new, favorite set of
SNPs with least ambiguity
Most GT failures-to-detect are due to person carrying so variant a seq.
that it fails to hybridize to anything on the chip
Most seq. failures-to-detect are due to low sampling rate of one allele
But seq. picked up ~1M new SNPS in this person!
Why?
Std SNP panels selected for SNPs that occur fairly
frequently in population
This individual of African ancestry - ?underrepresented
in std SNP panel
Maybe most of us carry lots of “private” SNPs
that are very rare in the population
How can you get information about structural changes
larger than 35bases from 35base long reads?
Use info from paired end reads!
Idea – label ends of genomic DNA segments w/ biotin
nucleotides (B) using DNA pol
circularize DNA segments (ligate diluted sample)
re-shear DNA; purify biotinylated DNA; make clusters
as before and read seq of ends of junction frags.
Now sequence at opposite ends of small frags comes from
genomic DNA regions separated by length of circularized
fragments; also, oriention wrt each other is flipped
If you can map both end sequences to genome, you can find
deletions (end seq. further apart in ref. seq. than
circularized fragment length), insertions (end seq. closer
together in ref. seq. than circ. frag. length), inversions
(orientation reversed)
They identified 1000s of >50bp deletions, many of which
were known selfish DNA elements present in reference seq.
but not in the seq. of the person whose DNA they analyzed
90% of these are SINES present in
reference but not in this individual
60% are LINES
They also found 2345 insertions
How many are in
coding sequences?
How many are
homozygous?
Map of a region containing an inversion flanked by
2 small deletions. What do symbols represent?
Note ~2kb region of ref
seq. with no read pairs
(green)
“short insert” pairs
flanking this region
(orange) map to sites
~2kb apart in ref. but
~.5kb in this sample
(i.e. span deletion)
Last level of complexity – bio-medical interpretation
of seq. information
Example - variability greater in certain areas of genome
e.g. parts of X chromosome - why?
Potentially medically relevant findings – your DNA is likely
similar!
26,140 SNPs in protein-coding regions
5,361 encode non-conservative amino acid changes
153 encode premature terminations
“many of which are expected to affect protein function”
excerpt of Table 9
Summary - Impressive accomplishment!
Innovations in many fields – all needed for useful product
molecular biology: bridging pcr to get ~104 copies
of individual fragments arrayed on surface,
nicking tricks to convert pcr products to ss for
sequencing and getting the complementary ss for
sequencing,
new dNTPs with reversibly blocked 3’ ends and chemically removable fluors, to seq. 1 base at a time
engineered DNA pols that use these new dNTPs
photonics, data acquisition, informatics …
Lots of detail -> fuller explanation than ion-torrent
Major challenges remaining
quantifying errors
methods for resequencing variants for confirmation
identifying structural variants larger than the pieces
of dna sequenced – e.g. deletions, insertions,
duplications, inversions
speeding up (parallelization of) data acquisition
interpretation – clinical significance of variants;
implications for human biology
Some key ideas you should take away from today:
How they get array of spots, each with
many copies of a DNA to sequence
How they get sequence, 1 base at a time,
using reversible dye terminator chem.
How they get information about structural
variants larger than the 35 bp runs
(paired end reads)
How over sequencing (fold-coverage) helps
How they evaluate seq. accuracy
What kind of mutational load are we all likely
to carry in our DNA
Next topic - Why sequence? Clinical issues assoc. w/seq.
1. basic biology
determine amino acid seq. of proteins
learn role of non-coding seq.
study evolutionary relationships
2. medical applications
inherited disease (e.g. CF) diagnosis, prenatal dx,
disease risk prediction, dis. mechanisms
drug sensitivity (“personalized med”)
sequence variants assoc w/disease but not causal
cancer mutations – identify drugs to use/not use
diagnose microbial infections
3. non-medical applications
e.g. plant engineering, forensics
Problems/challenges with interpreting seq. info.
accuracy – even error rate 0.001% ->
~104 errors in 3*109 bp human genome seq.
how to re-check possible mutations
predictive value – how reliable are clinical assoc.?
how useful if you can’t (for now at least)
change outcome (Alzheimer’s)
will it lead to unnecessary additional testing?
cancer – how rapidly do cancers develop
new mutations to become resistant to rx
Download