From Bauhaus to Bio-House
NATURE | VOL 422 | 24 APRIL 2003
|www.nature.com/nature
Past
Present
& Future
Of Genomics
Technologies
Professor of Computer Science,
Mathematics and Cell Biology
¦
Courant Institute, NYU School of Medicine, Tata Institute of
Fundamental Research, and Mt. Sinai School of Medicine
Where we collect three important tools from biotechnology: scissors, glues and copiers…
• Type II Restriction Enzyme
– Biochemicals capable of cutting the double-stranded DNA by breaking two
-O-P-O bridges on each backbone
• Restriction Site:
– Corresponds to specific short sequences: EcoRI GAATTC
– Naturally occurring protein in bacteria…Defends the bacterium from invading viral DNA…Bacterium produces another enzyme that methylates the restriction sites of its own DNA
Hae III 5’…GG | CC …3’
3’…CC | GG …5’
EcoR I 5’…G | AATT C …3’
3’…C TTAA | G…5’
Pst I 5’…C TGCA | G …3’
3’…G | ACGT C …5’
Hpa I 5’…GTT | AAC …3’
3’…CAA | TTG …5’
• DNA Ligase
– Cellular Enzyme: Joins two strands of
DNA molecules by repairing phosphodiester bonds
– T4 DNA Ligase (E. coli infected with bacteriophage T4)
• Hybridization
– Hydrogen bonding between two complementary single stranded DNA fragments, or an RNA fragment and a complementary single stranded DNA fragment… results in a double stranded DNA or a DNA-RNA fragment
• DNA Amplification:
– Main Ingredients: Insert (the
DNA segment to be amplified), Vector (a cloning vector that combines with an insert to create a replicon), Host Organism
(usually bacteria).
• PCR (Polymerase
Chain Reaction):
• Main Ingredients:
Primers, Catalysts,
Templates, and the dNTPs.
“ All science is either physics or stamp collecting.”
“For Mike’s sake,
Soddy, don’t call it transmutation.
They’ll have our heads off as alchemists.”
Rutherford, winner of
1908 Nobel prize for chemistry for cataloging alpha and beta particles…
• Two Extremes:
– Indexing : For each character ‘b’ in the genome, make a list of each position where it occurs.
– Shotgunning : For each long sentence in the genome, select it with low probability (o(lgn/n)), and then read it reasonably accurately.
• The Middle way:
– Indexed-Shotgun : For each short word in the genome, select it with high probability (o(1)), and then measure its position and read it reasonably accurately.
• Where is the middle???
• Physical Mapping & Sequencing:
– Map:
• assign physical locations to important markers (e.g., restriction sites or hybridization probes).
– Sequence:
• align short sequence reads to the markers (map-based sequence assembly) or
• align long sequence reads to each other (shotgun assembly)
Array Mapping
Optical Mapping
Sequencing
4
2
6
5
3
1
5
6
3
4
1
2 p p p
Distance ¼ 3/6 = 0.5
• A one dimensional “Buffon’s needle problem.”
• Take two points on a line, and drop unit-length needles of some color.
• The probability that the two points will have different colors monotonically increases with the distance between these two points
– as distance increases from 0 to 1;
– attains a fixed value for all distances konger than 1.
• One can generalize by considering
– More than two points…P points.
– Dropping a small set of bichromatic needles…
cX coverage subsample
M cX coverage subsample
• Probes are “points”
• BACs are “needles”
• Hybridization on an array simulates
“dropping the bichromatic needles”
High Coverage
BAC Library cX coverage subsample cX coverage subsample
• A set of P points: {x
1
, x
2
, …, x
µ [0,G] with pdf f(x) = 1/G i.i.d. for all x
2
[0,G]
P
}
• Distance d i,j
= d(|x i
–x j
|),
“measured” between two arbitrary points x i and x j
= x.
• Given O(P 2 ) distances infer positions.
0
m A
J 0
J 0
• Given a P £ P positive symmetric real-valued matrix D of “measured distances”.
– The entry d i,j
» f(d |x).
• Choose an embedding of the points:
– {x’
1
, x
2
, …, x’
P
} ½ [0,G],
– which maximizes a likelihood function
–
1 · i, j · f(|x’ i
– x’ j
| | d i,j
)
Minimizing a Quadratic Cost Function
P
1 d
1,2 d
1,4 d
1,3
P
2 d
2,3 d
2,4 d
3,4
P
4
P
3
P
1
P
2
P
3
Mass-less Balls connected with springs of different stiffness…
P
4
Join
• Consider measured distances of length L’ · q
L; Examine these distances in increasing order.
– q
2
(0,1) to be determined by the
Chernoff bounds
• Initially, every probe is a singleton contig.
• Two operations: Join and Adjust either combines smaller contigs or improve an existing contig.
Adjust
• Join and adjust locally minimizes the
“log-likelihood cost function”
– Local minimum of a weighted sum-of-square error function
Schizosaccharomyces pombe
3 chromosomes (5.5, 4.4, 2.4 Mbp)
1224 probes ( ~1 probe every 10 kb)
3072 BAC clones (~166 kb per clone) (40X cvg)
(1224! = 6.47 x 10 3249 potential maps)
Normalized log
2
probe ratios: Pool 75 data
Red probes are hits (in pool) Blue probes are nulls(not in pool)
0
-1
-2
-3
3
2
1
0 200 400 600 800
Probes as they occur in genome order
1000 1200
1200
35X Library Coverage: EM processed log ratios
1224 probes 24 BACs per pool 106 pools
800
400
0
0 200 400 600 800 1000 1200
60
40
20
0
0
Hamming distance vs. Physical distance
Probe 111
Probe 79
Probe 85
40 80
Probes in genome order
Probe 101
Probe 95
120
Optical Approaches are Inherently Noisy!
• Since many biological macromolecules are smaller than the Raleigh limit, the optical approaches involve attaching single fluorescent probes to specific macromolecules.
• Controlling Noise:
– Magnitude of Stoke-shift
– Steric hinderance
– Absorption cross-section
– Point spread function (PSF)
– Image Processing
1. Capture and immobilize whole genomes as massive collections of single DNA molecules
Cells gently lysed to extract genomic
DNA
DNA captured in parallel arrays of long single
DNA molecules using microfluidic device
Genomic DNA, captured as single DNA molecules produced by random breakage of intact chromosomes
2. Interrogate with restriction endonucleases
3. Maintain order of restriction fragments in each molecule
Digestion reveals 6-nucleotide cleavage sites as ”gaps”
4. Determine size of fragments
5. GENTIG
Robust Bayesian Map
Assembler to make wholegenome restriction map
Single DNA molecule on Optical
Chip after digestion, staining
• Image analysis software measures size and order of restriction fragments
• Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome
Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome
Various combinations of error sources lead to NP-hard Problems
(Single Molecule Restriction Map)
D j s
1j s
2j s
3j s
M,j
D R j s R
M,j s R
3j s R
2j s R
1j
(Single Molecule Restriction Map)
Where we design the experiments to generate easy instances of a difficult problem…
-
+
-
The probability of successfully computing the correct restriction map as a function of the number of cuts in the map and number of molecules used in creating the map…
Where we rely on statistical models of the error sources to map correctly…
Robustness of Optical Mapping Algorithm
• BAC Clones with 6-cutters
– Average Clone size = 160 Kb; Average Fragment Size = 4
Kb, & Average Number of Cutsites = 40.
• Parameters:
– Digestion rate can be as low as 10%
– Orientation of DNA need not be known.
– 40% foreign DNA
– 85% DNA partially broken
– Relative sizing error up to 30%
– 30% spurious randomly located cuts…
Algorithm Design and
Analysis jointly with
T.S.Anantharaman
P r(H | D) =
P r(D | H) £ P r(H)/
P r(D)
Gradient search for good parameters
H
1
H
2
H
3
H
4
Local gradient optimization
H
1
• “From a gene’s point of view, reshuffling is a great restorative…
• “The Y, in its solitary state disapproves of such laxity. Apart from small parts near each tip which line up with a shared section of the X, it stands aloof from the great DNA swap. Its genes, such as they are, remain in purdah as the generations succeed. As a result, each Y is a genetic republic, insulated from the outside world.
Like most closed societies it becomes both selfish and wasteful. Every lineage evolves an identity of its own which, quite often, collapses under the weight of its own inborn weaknesses.
• “Celibacy has ruined man’s chromosome.”
– Steve Jones, Y: The descent of Men, 2002.
Where we map large genomes…
• Malaria Parasite
• Deadliest of all the human
Malaria parasites:
– P. falciparum
– P. vivax
– P.malariae
– P. ovale
• Responsible for
1.5-2.7 million deaths in 1997.
Gentig Maps:
• A. Gap-free consensus BamHI &
NheI maps for all 14 chromosomes.
• B.BamHI map
• C. NheI map
• D.NheI map of
Chromosome 3 displayed by ConVEx
E. coli
•
•
•
•
•
E. coli
P. falciparum ¦ D. radiodurans ¦ Y. Pestis
Rhodobacter sphaeroides ¦ Shigella flexneri ¦ Salmonella enterica
Aspergillus fumigatus
…
P. falciparum
• The automated Gentig system is routinely used
– to map a microbe genome quickly & effortlessly
– by a scientist with no quantitative or computational training.
Y.Pestis
•Transformation from Hamiltonian
Path Problem restricted to cubic graphs.
j j j j j j
3 4 5 9 12 13 14
( 3
12
x j
1
, , 3
x j
1
, , 12 kx
j
1
, kx j
1
4
, 13 x j
2
x
, j
2
, 4
kx j
2
, , 13
, 5 kx j
2
x j
3
,
, 14
, 5 x j
3
,
kx
, 14 j
3
,
kx j
3
)
Choose p= 3/4 & k = M x j
1
1 k
1
1
2 M
j
1 x j
2
1 k
1
1
2 M
j
2 x j
3
1 k
1
1
2 M
j
3
“You should never bet against anything in science at odds of more than about 10 12 to 1.“
• Comparing Two
Genomic Restriction
Maps: Given two maps A and B , we say that they overlap, if –
1. k or more of the restriction fragments align positionally (subject to sizing error)
2. Number of unmatched fragments in either prefix is bounded by r
Effect of Partial Digestion
• Parameters:
– Partial digestion probability, p
– Relative sizing error, b
– # Restriction fragments, n
– Overlap threshold ratio, q
• m = n p = Expected # detected restriction fragments.
• Controlling False Negative:
K
5 np 4 q /2 and r
= k
1
/p 4 , k
1
¼ 2
If in fact the clones A and B overlap then we will detect it with a probability, at least
(1-exp(-k
1
)) (1 – exp(-n p 4 q /8))
• Relation among the error parameters:
3 b n p /4
5 k
5 n p 4 q /2
) p
=
(3 b /2 q ) 1/3
• Parameter choice for shotgun-mapping.
Make the partial digestion probability rather high (close to 1) or the relative sizing error as low … for instance by using a rare cutter.
1
0.9
0.8
0.7
Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)
0.6
Sizing Error a
500 1000 1500 2000 2500 3000
. 1000 2000 3000
• The calculation is for the human genome, G =3,300 mb.
• The average molecule length =
5 mb, with an overlap of 1 mb
• The average restriction fragment length = 25 kb
• For a sizing error of 3 kb, the required digestion rate is ~94%
• If the sizing error is reduced to
2 kb, the required digestion rate drops to ~ 88%…
• If the sizing error is reduced to
1 kb, the required digestion rate drops to ~ 80%…
• Miss restriction site rate =
1 -
• Gaussian sizing var = s
2
L i
• False restriction sites rate =
P d l f
• Small fragments missing rate =
P v
L
• Multiple chromosomes mixed together.
D
1
D
2
D
M f(H|D
1
, L , D m
= c f(H) f(D
1
, L , D m
|H)
Conditional Probability sum of alignments right of I,J
I
H
AR
D
J
N m
1
I
1 J
0
AR UL
f
K
|
I
K
I
K
Pr Alignments with a fragment spanning
H
K
, H
K
1
1
1
I
K m
J
0
AL UR
1 N m
1 J
0
AR UL m
J
1
(
K
K
N
N ?1: 0)
M
1
I
1 P K 1 Q J 1
AL FM
AL FA
,
1
AR
K
1
AR
• M. Waterman et al. (2005)
– Draft map based on pairwise alignments.
– Draft Map refinement using HMM, analogous to sequence alignment HMMs.
Replacing Gentig:
Faster & More Accurate
Candida Albicans
• The left end of chromsome-1 of the common fungus
Candida Albicans (being sequenced by Stanford).
• You can clearly see 3 polymorphisms:
– ( A ) Fragment 2 is of size
41.19kb (top) vs 38.73kb
(bottom).
– ( B ) The 3rd fragment of size
7.76kb is missing from the top haplotype.
– ( C )The large fragment in the middle is of size 61.78kb vs
59.66kb.
• Identify location of polymorphism in data.
• Determine phasing between neighboring polymorphisms.
• Problem:
– Similar to Gentig Problem, if distributions of restriction site polymorphism (due to SNPs) and restriction fragment length polymorphism (due to indels) are known…
– How can one separate partial digestions from restriction site polymorphism and sizing error from restriction length polymorphism?
• SNPs that coincide with restriction sites:
– Restriction site missing in 50% of data.
• Insertions & deletions from 100’s of bases to several kilo bases:
– Fragment sizes appear to be drawn from two distributions with different means.
H
1
H
2
• Finding overlapped maps: O(M m log G)
• Assembling approx H: O(M m log C)
• Optimizing H & H
1
, H
2
: O(M m 2 N 2 LN)
– M= number of input maps
– m = average sites per input map
– C = Genome coverage (data redundancy)
– G = Total sites in Genome
– L = Search look ahead factor
– N = Total sites per chromosome
Chr1 length
12 Mb
12 Mb
30 Mb
30 Mb
100 Mb
100 Mb
200 Mb
200 Mb
Coverage
20x
100x
20x
100x
20x
50x
20x
50x
USC time
3431 secs
17052 secs
28993 secs
124194 secs
NA
NA
NA
NA
Note: Haptig times based on 2.4 GHz CPU
NYU time
448 secs
2818 secs
1444 secs
9308 secs
25465 secs
74025 secs
88835 secs
259024 secs
1 10
6
800000
600000
USC Algorithm
4 10
6
3 10
6
USC Algorithm
2 10
6
400000
200000
NYU Algorithm
1 10
6
NYU Algorithm
50 100
Genome size (in Mb) a
150
20 X coverage
200
150 50 100
Genome size (in Mb) a
100 X coverage
Unlike the USC algorithm (by Waterman et al.), Gentig/Haptig algorithms easily scale to human genome sized problems. Gentig is routinely used by
OpGen and other labs to map.
200
• USC assembler is consistently slower
– (Even without including the time to compute a
Draft Map.)
• USC assembler takes O(N 2 ) time even for N under 30Mb.
• NYU assembler is O(N 1.4
) up to N=30Mb.
– NYU assembler is quadratic above N=100Mb because of the choice of a threshold in the current implementation.
• Sequence Validation
• Haplotyping
• Sequencing
• Comparative Genomics
• Rearrangement events
• Hemizygous Deletions
• Epigenomics
• Characterizing cDNAs
– Expression Profiling
– Alternate Splicing
• Sequence a human size genome of about 6 Gb— include both haplotypes.
• Integrate three component technologies:
1. Optical Mapping to create Ordered Restriction Maps with respect to one or more restriction enzymes,
2. Hybridization of a pool of short nucleobase probes
(PNA or LNA oligomers) with Single Genome dsDNA molecules on a surface, and
3. Efficient polynomial time algorithms to solve
“localized versions” of the PSBH (Positional
Sequencing by Hybridization) problems over the whole genome.
“We haven't the money, so we've got to think."
• A novel class of nucleic acid analogues.
– LNA monomers are bicyclic compounds structurally similar to RNA nucleosides.
– The term "Locked Nucleic Acid" has been coined to emphasize that the furanose ring conformation is restricted in LNA by a methylene linker that connects the 2'-O position to the 4'-C position (see figure).
– LNA oligomers obey Watson-Crick base pairing rules and hybridize to
Complementary oligo-nucleotides.
• LNA/DNA or LNA/RNA duplexes have increased thermal stability compared with similar duplexes formed by DNA or
RNA.
– The LNA modification has been shown to increase the biological stability of nucleic acids.
– Fully modified LNA oligonucleotides are resistant towards most nucleases tested.
• 5’ biotin – GATGAACGg
• Corresponds to Lambda sequence at
18,689 bp from 5’ COS site
• Label with Alexa 546 – Streptavidin (3 fluors per)
– Strategy: simple replication of previous experiments… In response to limited resources, no protocol customization
Manual imaging, image processing, etc.
• The correct DNA sequence is s
• Computable:
– Correct Haplotypic Restriction Maps (with respect to multiple enzymes)
– Positional Spectra: For a position x, w x
& w c x
, the two k-mers at position x on the watson- and crick-strands.
• Data:
– Map Errors: Errors in sizing, false positives and negatives in cleavage
– Spectral Errors: Errors in position and false positives and negatives in hybridization
• Compute a sequence t .
• How closely does t approximate s ?
• Overcoming the complexity issues
– Spectrum of k-mer probes + constraints on the location of the k-mers in the sequence:
– However if the location constraints are in the form of k contiguous locations then the reconstruction problem is exponential in k , rather than the sequence length m.
– NP-Complete, if k ¸ 3
• A special case of the PSBH problem,
– Separate constraints for each of the multiple instances of each k-mer.
– By focusing on a small window of about 2 k-1 bp, in which most k-mers will occur only once, we hope to approximately solve the standard PSBH problem
• Here separate constraints for multiple instances of each k-mer are not important.
– Amortize over many instances
• Simulated sequencing errors (per 1000 bp averaged over 100 samples)
– Parameters of Simulation: Assume
• Sizing error SD = 60bp,
• False negative = 2%,
• False Positive = 6/Megabase (in the consensus probe maps),
– Each of which is about 40% of conservative estimates in the grant.
– But this would realistic if we can get the raw hybridization rate up to 60% (or alternately increase coverage from 25x to 125x).
– The current software implementation runs out of memory with more pessimistic error rates.
• (per 1000 bp averaged over 100 samples)
Probe Size
5 bp
6 bp
7 bp
8 bp
Substitutions Insertions
135.57
17.40
3.64
0.24
8.64
0.54
0.14
0.04
Deletions
23.74
1.49
0.14
0.09
70
10
60
Substitution Error
50
40
Deletion Error
30
20
Insertion Error
5 6 7
Size of the Probe a
8 9 10
(8bp probes) gaaatggggtcgacacgttggcgcgtacttccaccaatggtacacccaccgtgcaaccgaatccc tccttagtgtcaacccaatcaagcggggatatacgaggaaccaggcaggttgagagcgcggcggg ctcttcccgtgatcgtcctacgcattgcgcatcctctaggacgtgctaggaacccacactcgtta gaatggatcaggtgaggcgactcgccaactaataaggagacagagggctccccttgctgggtgga tctagtctccgatttcccgaacctaagttcttggtcgcaaggaggataccaggggggcggggcac ccaggcaccgtggttaacgtaggcccgtgtcacggggttagtgatgaaattacccggagcttcac atattcgttggcggcggcatggggctgcctgggtggcacagagctaacgaccgtaattccgagct caaccctgtccactgcatcaac C t C a A gttgatgcctcttgctacccccgggaccgctccttcgg gagtagggcgttaccgtttccggatattcggagctgtccctaatccgccgggacgacgagaggta ggtatataatgggttccccgagtgggtgcgggccgccggggacctcgcgattgtggccagcggga cagagcgtgagttagttggtaccgcagtgcatggtctgacttcgttgtcctggaaaatagcagtg ccccctacatcagcgaaatactcatactgacactctgcatccatggtggtttgcagccctgtatt cggttgccgaacagatgagtaaacgatcctcggtggcttaatccgcaacgagtatccttggacct ttctggtgtaacccgaccagagtaccgggaggaaggggccgggatcccgcggtgtctgtggcagc gctctctcttcccctgagcgggcagagacggttgattaatacccgctttcgctaggacgctccta cgagttgacatgttcactaccgct -
Sequencing-by-Synthesis:
• BASE EXTENSION: A single-stranded
DNA fragment, known as the template, is anchored to a surface with the starting point of a complementary strand, called the primer, attached to one of its ends
(a). When fluorescently tagged nucleotides (dNTPs) and polymerase are exposed to the template, a base complementary to the template will be added to the primer strand (b).
Remaining polymerase and dNTPs are washed away, then laser light excites the fluorescent tag, revealing the identity of the newly incorporated nucleotide (c). Its fluorescent tag is then stripped away, and the process starts anew.
• LIGATION: An “anchor primer” is attached to a single-stranded template to designate the beginning of an unknown sequence (a). Short, fluorescently labeled
“query primers” are created with degenerate DNA, except for one nucleotide at the query position bearing one of the four base types (b). The enzyme ligase joins one of the query primers to the anchor primer, following base-pairing rules to match the base at the query position in the template strand
(c). The anchor-query primer complex is then stripped away and the process repeated for a different position in the template.
• AMPLIFICATION: Because light signals are difficult to detect at the scale of a single DNA molecule, baseextension or ligation reactions are often performed on millions of copies of the same template strand simultaneously. Cellfree methods (a and b) for making these copies involve
PCR on a miniaturized scale.
• Sequencing-by-
Synthesis
– Bottom-up
– Short high-quality reads without location information
– Shot-gun Assembler
• Must use maps or paired-reads
• Error Prone
– Genotypic Sequence
– Less Flexible
• Sequencing-by-
Mapping
– Top-down
– Shorter low-quality
“reads” with “location” information
– Bayesian Assembler
– Haplotypic Sequence
– Flexible
• Gap-less Sequencing
• Large high-quality contigs
• Haplotype phasing
• Structural changes (e.g., chromosomal aberrations)
“I have become more and more impressed by the power of the scientific method of extending our knowledge of nature.
Experiment, directed by the imagination of either an individual, or still better of a group of individuals of varied mental outlook is able to achieve results which far transcend the imagination alone of the greatest natural philosopher.”
“Experiment without imagination, or imagination without recourse to experiment, can accomplish little. But for effective progress, a happy blend of these powers is necessary”
• Collaborators:
– Bhardwaj
– Brugge
– Cordoza
– Cantor
– Demidov
– Dynlacht
– Fitch
– Gimzewski
– Gromov
– Hubbard
– Jett
– Lazebnik
– Ostrer
– Pagano
– Parida
– Ramakrishnan
– Reed
– Sobel
– States
– Teitell
– Zhou
• Colleagues &
Students
– Anantharaman
– Antoniotti
– Daruwala
– Mishra
– Paxia
– Cheng
– Gill
– Heymann
– Ionita
– Mysore
– Rudkevich
– Sun
The end