Class 11 – 2nd “next generation” seq. method Review other sequencing methods Sanger Ion torrent New method Begin to consider uses of DNA seq. Old method – Sanger chain terminationNobel: clone target DNA in bac. to get ~1011 copies needed for 4 seq rxns: DNA template + primer + pol + dNTP + ddATP (or ddCTP etc., each in separate tube); ddNTP’s lack 3’OH, incorporate normally but can’t be extended; run gel w/4 lanes; bands in G lane show size of frags. ending in G, etc. Di-dexoy NTPs lack 3’OH group They are incorporated normally, but next base can’t be chemically attached because it attaches thru 3’ O missing OH More elegant later method: label each ddNTP with a diff. colored fluor run electrophoresis products in single lane camera records color of products as they run off the bottom of the gel * ** * Ion Torrent Method Part A: produce ~107 copies of individual DNA fragments on mm-sized beads because sequencing method requires multiple identical target molecules/bead Part B: read sequence by primer extension synthesis, 1 base at a time, detecting pH change when dNTPs are incorporated in individual wells containing single beads, using array of ion-sensitive field effect transistors (ISFETs) Part A - method to put many copies of single short piece of DNA on micron-size bead; diff. DNAs on diff. beads Shear target DNA; select pieces ~200 bp in length (how?) Ligate forked adapter oligos to ends of sheared DNA F R’ R’ F Note this allows all pieces to be amplified with oligos F and R (the reverse complement of R’ /= F) (without fork, F and F’ would be at 5’ and 3’ ends and their annealing on single templates would impede pcr) Make water-in-oil emulsion containing: 1) pcr reagents to amplify DNA using primers F and R 2) hydrophilic micron-size beads with lots of oligos F attached via their 5’ ends 3) bead and DNA concentration adjusted such that ~1 DNA fragment and 1 bead/water droplet F Each droplet acts like test tube to isolate individ. DNA species. Because many copies of F are on each bead, many product strands ( ~107) starting with F get attached to each bead Break emulsion with soap, spin down beads, melt off non-covalently attached strand, spin down beads - most now have single-stranded DNA starting with F and ending with R’ Centrifuge enriched beads into wells just big enough to hold a single bead Part B: to get sequence, add primer R, DNA pol and a single dNTP, e.g. dATP; if T is next base on template, A will be incorporated, generating ~107 H+ ions as dATP ->dAMP+PP+H+ If T is not the next base, no H+ will be produced http://upload.wikimedia.org/wikipedia/commons/1/10/DNTP_nucleotide_incorporation_reaction.svg Key ideas and innovations in Illumina Method Biochemistry • “bridging” pcr to get array of ~108 DNA spots on glass slide, each containing ~104 copies of an individual ~ 200 bp DNA species in ~ 1mm area • sequencing by synthesis, 1 base at a time, using dNTPs with removable fluors and 3’ blocking groups • reading ~35b from both ends of each DNA species to get seq that should be known distance apart in ref. seq. Image analysis – automated collection and analysis of ~106 microscope images/run Informatics – mapping short seq. runs to genome First challenge – how to assemble multiple copies of individual templates on solid surface where sequencing will be done B’ A A B’ Shear genomic DNA A’ A (nebulizer) into segments ~200 – 2000bp B B’ “blunt” ends w/ DNA pol Ligate “forked” adapter oligonucleotide Pcr w/ oligos complementary to adapter seq forked ends A, B -> at 5’-ends of alternate strands of all fragments Substrate = glass flow cell, 8 channels ~100mm height, thin layer of polyacrylamide applied in each channel Polyacrylamide contains bromo… (BRAPA) which covalently links to phospho-thioate group on 5’ end of new primers 3’ ~20 bases of attached primers match those of oligo A or B used to pcr the genomic fragments, so melted amplified genomic fragments anneal to the attached primers. Primer ext. w/ DNA pol makes copy of 1 strand of particular genomic fragment at some spot on surface Next challenge – make multiple copies of each fragment in small region on substrate surface (to have enough copies to get a strong sequencing signal) Now melt off template A Newly synthsized strand anneals at its 3’-end to nearby, 5’-attached oligo A A’ B’ B A A’ B A B A A’ B Repetition grows thicket of both strands of particular genomic fragment in small spot on surface “bridging” pcr; note all strands are covalently attached via 5’ ends For unexplained reason they do this surface pcr by repeated cycles of chemical rather than thermal denaturation Image of DNA fragments on surface after bridging pcr; each fragment is labeled (during sequencing) with 1 of 4 differently colored fluors by method explained below Each spot = “polony” or “cluster” of many copies of single DNA fragment Spot diameters ~1mm; each spot contains ~ 104 strands; -> primers ~10nm apart; areal density c/w initial conc. of annealed genomic fragments ~3pM Next challenge – how to make surface pcr’d DNA single-stranded to serve as sequencing template B B Clever method – cut one strand of DNA at chemically sensitive site (*) engineered in oligo B, then melt off non-coval. attached DNA, add free primer B that anneals to distal (3’) end of attached template, extend B w/pol How to make the single-strand cut? Put diol modified base in attached oligo B; diol can be chemically cleaved by periodate How to sequence other end of template? B diol A ii A A A After sequencing 1st strand, melt off primer-ext. product, perform another cycle of bridging pcr (ii), make singlestranded cut in attached oligo A, melt off oligo A extension product, seq. w/ soluble primer A Note you need a new way to make ss cut in oligo A so you can make the A and B cuts separately; here are 2 ways: Synthesize oligo A with uracil U instead of T at given position; enzyme uracil glycosylase removes uracil (not normally in DNA); heat or high pH then breaks A strand at site of removed U Alternative: put oxoG in place of one G in oligo A; enzyme Fpg glycosylase removes abnormal oxoG; heat or high pH then breaks A strand at site of removed G Novel use of enzymes that remove abnormal bases (repair mutations in vivo) plus ability to insert abnormal bases during oligo synthesis makes this possible Additional complication: any free 3’ ends on DNA on surface might “fold-back” and serve as primer for competing sequence rxn They block this by enzymatically adding nucleotide w/blocked 3’OH group to all DNAs before adding seq. primer How is sequence read biochemically? They synthesized novel nucleotides! base T modified with flour sugar 3’ azide group N3 blocks extension A, C and G similarly modified but with diff. colored fluors; only one base is added at a time due to 3’ blocking group Treatment with TCEP removes fluor and 3’ blocking group, which allows next nucleotide to be added and its color detected, (prev. fluor is removed) Amazing that bulky, unnatural chemical groups left attached do not inhibit polymerase, or mess up base-pairing They say they had to engineer (mutate) DNA polymerase to get it to incorporate these modified bases efficiently This is another innovative step! Repeated cycles of flowing in polymerase plus 4 modified nucleotides (1 of which gets incorporated in given spot), washing, taking picture, treating with TCEP -> sequence Picture taken at step n during sequencing run; all strands in a given cluster label with A, C, G or T depending on sequence at nth base in template strand. How does spot density compare to ion torrent? Image analysis technology and innovations “custom” Note they use TIRF microscopy to reduce background, only see fluors within < 1mm of surface Why “custom” objective? How big is typical microscope field of view (FOV) at 60x magnification? Imagine FOV expanded 60x in each direction and mapped to 3x3mm CCD How many images would they need to cover ~10cm2 flow cell surface? How long would it take to collect these images serially if they have to move slide 1 FOV between images? Their “custom” lens gives them ?? (0.1mm)2 FOV How many sets of images do they need (1 for each base addition)? How long does it take to collect data for 1 run? ~week Do they need to align the spots in images of the same FOV taken hours apart? Automated spot alignment program Cross-talk of different fluors – they need to adjust image intensities to correct for “red” fluorescence of “green” fluor, etc to get best estimate of which dNTP was incorp. If base extension or deblocking is not complete for all strands in cluster, different nucleotides will be incorporated at subsequent steps, purity of fluorescence signal will erode (phasing prob.) Quality control measures used to decide when base calling is unreliable; e.g. purity filter: intensity of 1 base must be > .6 sum of it plus next brightest base in 1st 12 positions # errors determined by sequencing DNA with known seq. # errors/35 bases 2 1 Even with QC criteria to select good reads get only ~35 b reliable seq.! How does Illumina method differ from Sanger /ion torrent? How to get many copies of template Sanger clone in bacteria ion torrent emulsion pcr -> beads Illumina bridging pcr on glass surface seq rxn/ biochemistry dye-labeled normal ddNTP chain dNTPs, 1 terminators at a time reversibly 3’blocked dyelableled dNTPs, 4 at a time read out gel electrophoresis of labeled DNA size => pos. ISFET detect. H+-released base incorp. in each well sequential photos see order of base addition in @ cluster seq. length ~1000 ~100 ~35 Informatics – mapping shorts seq. reads to genome 2 programs used to look for matches betw. the ~35b end seq. they obtain for a cluster and ref seq. ELAND – finds all seq. in reference that match first 32 bases of cluster seq, allowing up to 2 mismatches but no gaps; then sees which of these best match cluster seq at any bases beyond 32 MAQ – more sophisticated in allowing gaps betw. ref. and cluster seq., so picks up more matches with small “indels”, but potentially more errors If genome seq. were random, what length seq. would be unique (unlikely to occur more than once)? Complication: some sequences >35b occur many times “selfish” genes have replicated and re-inserted in different positions in the genome, e.g. short interspersed nuclear elements (SINES, alu) ~300 bp; ~106 copies (~10% of genome) long interspersed nuclear elements (LINES) ~6000 bp; ~105 copies (~20% of genome) Two features help assignment of 35b reads to correct position in genome they know the paired end read should map to other DNA strand about 200 bp away in reference sequence each region of DNA is read many times, so they can just map consensus sequence for any segment Tests of quality How uniformly does their data cover the ref. seq.? If some DNA segments don’t amplify well (? due to high GC content) they might be absent in their seq. If cluster seq. is random sample of ref seq., Poisson dist. predicts how many times, n, a ref. seq. base should appear in cluster seq. pn=e-mmn/n! where m = aver. # times m=130Gb of cluster seq/3Gb per genome = 43 Fig. 2 Take every 50th base of ref seq.; how many times is an overlapping frag. found in a cluster seq. mapped to the ref seq.? Make a histogram of the # of such bases found n times in the cluster seq data set. For interest, consider separately bases that don’t occur in repetitive elements like SINES and LINES (unique only) The dist. is pretty close to Poisson (only slt. more samples in tails), so the method seems to sample pretty randomly Does GC content affect how often a region is sampled? Plot # times a particular base is sequenced in the data set) as function of GC content of seq. in which it occurs. Only cluster sequences with most extreme GC contents were sequenced less than the average ~40 times So what? If a seq. (with extreme GC content) is undersampled, you might get only the maternal or paternal copy (allele) in the seq data set and so miss finding a polymorphism (false negative) Next evaluation – compare how often SNPs are identified in the seq. vs. SNP hybridization assays (“GT, genotyping”) Note this company makes SNP hybrid. assay, so it working hard on technology that may replace its current platform! Using ELAND program: std version of hybrid. assay (GT) w/.5M SNPs latest version of hybrid. assay w/ >3M SNPs <1% discordant calls most often the array assay (GT) finds a SNP missed by seq. Same table, using MAQ program, seq. does slt. better, but in general GT and seq. have similar fail-to-detect rates Their new, favorite set of SNPs with least ambiguity Most GT failures-to-detect are due to person carrying so variant a seq. that it fails to hybridize to anything on the chip Most seq. failures-to-detect are due to low sampling rate of one allele But seq. picked up ~1M new SNPS in this person! Why? Std SNP panels selected for SNPs that occur fairly frequently in population This individual of African ancestry - ?underrepresented in std SNP panel Maybe most of us carry lots of “private” SNPs that are very rare in the population How can you get information about structural changes larger than 35bases from 35base long reads? Use info from paired end reads! Idea – label ends of genomic DNA segments w/ biotin nucleotides (B) using DNA pol circularize DNA segments (ligate diluted sample) re-shear DNA; purify biotinylated DNA; make clusters as before and read seq of ends of junction frags. Now sequence at opposite ends of small frags comes from genomic DNA regions separated by length of circularized fragments; also, oriention wrt each other is flipped If you can map both end sequences to genome, you can find deletions (end seq. further apart in ref. seq. than circularized fragment length), insertions (end seq. closer together in ref. seq. than circ. frag. length), inversions (orientation reversed) They identified 1000s of >50bp deletions, many of which were known selfish DNA elements present in reference seq. but not in the seq. of the person whose DNA they analyzed 90% of these are SINES present in reference but not in this individual 60% are LINES They also found 2345 insertions How many are in coding sequences? How many are homozygous? Map of a region containing an inversion flanked by 2 small deletions. What do symbols represent? Note ~2kb region of ref seq. with no read pairs (green) “short insert” pairs flanking this region (orange) map to sites ~2kb apart in ref. but ~.5kb in this sample (i.e. span deletion) Last level of complexity – bio-medical interpretation of seq. information Example - variability greater in certain areas of genome e.g. parts of X chromosome - why? Potentially medically relevant findings – your DNA is likely similar! 26,140 SNPs in protein-coding regions 5,361 encode non-conservative amino acid changes 153 encode premature terminations “many of which are expected to affect protein function” excerpt of Table 9 Summary - Impressive accomplishment! Innovations in many fields – all needed for useful product molecular biology: bridging pcr to get ~104 copies of individual fragments arrayed on surface, nicking tricks to convert pcr products to ss for sequencing and getting the complementary ss for sequencing, new dNTPs with reversibly blocked 3’ ends and chemically removable fluors, to seq. 1 base at a time engineered DNA pols that use these new dNTPs photonics, data acquisition, informatics … Lots of detail -> fuller explanation than ion-torrent Major challenges remaining quantifying errors methods for resequencing variants for confirmation identifying structural variants larger than the pieces of dna sequenced – e.g. deletions, insertions, duplications, inversions speeding up (parallelization of) data acquisition interpretation – clinical significance of variants; implications for human biology Some key ideas you should take away from today: How they get array of spots, each with many copies of a DNA to sequence How they get sequence, 1 base at a time, using reversible dye terminator chem. How they get information about structural variants larger than the 35 bp runs (paired end reads) How over sequencing (fold-coverage) helps How they evaluate seq. accuracy What kind of mutational load are we all likely to carry in our DNA Next topic - Why sequence? Clinical issues assoc. w/seq. 1. basic biology determine amino acid seq. of proteins learn role of non-coding seq. study evolutionary relationships 2. medical applications inherited disease (e.g. CF) diagnosis, prenatal dx, disease risk prediction, dis. mechanisms drug sensitivity (“personalized med”) sequence variants assoc w/disease but not causal cancer mutations – identify drugs to use/not use diagnose microbial infections 3. non-medical applications e.g. plant engineering, forensics Problems/challenges with interpreting seq. info. accuracy – even error rate 0.001% -> ~104 errors in 3*109 bp human genome seq. how to re-check possible mutations predictive value – how reliable are clinical assoc.? how useful if you can’t (for now at least) change outcome (Alzheimer’s) will it lead to unnecessary additional testing? cancer – how rapidly do cancers develop new mutations to become resistant to rx