Talk_VT - People at VT Computer Science

advertisement
Sequence Assembly
and
Protein Docking Algorithms
Vicky Choi
Department of Computer Science
Duke University
Outline
• Sequence Assembly Algorithm
(Joint work with
Martin Farach-Colton @Rutgers)
• Local Search Algorithm for Rigid Protein
Docking
(Joint work with Pankaj K. Agarwal,
Herbert Edelsbrunner,
Johannes Rudolph @Duke)
2/73
Outline: Sequence Assembly
•
Biological Background
•
Human Genome Project and the
Sequence Assembly Problem
•
The BARNACLE Assembler
3/73
DNA
A DNA molecule consists of two strands which
are tied together in a helical structure.
Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis
Each strand is represented by a string over the
alphabet {A,C,G,T}, called a DNA sequence.
Example
AAGCTTCAGTTCCTGACCTTCCAATCGCAA
{A,C,G,T} = nucleotide, base, basepair (bp)
4/73
Two Strands: Reverse Complementary
(5’)
(3’)
Orientation: 5’ ! 3’
Complement:
A $ T, C $ G
one strand ) another strand
Example
(3’)
(5’)
(5')
(3')
ACCATGGTGCACCTGACTCCTGAGGAG
TGGTACCACGTGGACTGAGGACTCCTC
(3')
(5')
Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis
5/73
A genome is the complete set of DNA
sequences of an organism.
Human Chromosomes
Image Credit: Sanger Center http://www.sanger.ac.uk/
Human Genome ~ 3x109 bp
6/73
DNA Sequencing
DNA Sequencing is the process for determining the
sequence of nucleotides of a region of DNA.
CGAATCGTCGATGCTAATG
Current technology : ~500bp
Question: How to sequence a longer stretch of DNA?
7/73
Shotgun Sequencing
Target DNA
DNA Cloning
Copies of Target
Shotgun
DNA SequencingACGTAAGAGTACCGATTGGCCA
Sequence Reads
Assembly
Consensus
Directed Read
Final
…ACGTAGTCTTAGATGATAGTAGA…
8/73
Shotgun Sequencing History
•
•
•
•
•
1980s: 5 to 10 Kbp
1990:
40 Kbp
1995:
1.8 Mbp (H. Influenzae)
2000:
draft Drosophila (120 Mbp)
2001:
draft Human Genome (3x109bp)
(attempted by Celera)
9/73
Outline: Sequence Assembly
•
Biological Background
•
Human Genome Project and the
Sequence Assembly Problem
•
The BARNACLE Assembler
10/73
Human Genome Project (HGP)
• 1988: “Mapping and Sequencing the
Human Genome”
• 1990: HGP started in US
• 2001: A “working draft” version
• 2003: Completed by HGP Consortium
standard
11/73
Hierarchical Shotgun Sequencing
(BAC-by-BAC)
•Map First, Then Sequence
Human Genome
BAC library
(100-200Kb)
Physical Map
A BAC is a segment of DNA
from a chromosome.
Each BAC is ~100-200Kb.
Tiling Path of BACs
Shotgun Sequence &
Assemble of each TP BAC
Final Sequence
12/73
Hierarchical Shotgun Sequencing
(BAC-by-BAC)
•Map First??
Human Genome
BAC library
(100-200Kb)
Physical Map
?
?
?
?
Physical Map is difficult to build!
(original expected time: 5 years)
13/73
BAC-by-BAC ! BAC-Based
New Idea: Map + Sequencing concurrently
Randomly pick BACs (not wait for Physical Map)
and shotgun sequence BACs
BAC
Sequence Reads
Fragments
Phase 1: Draft
Ordered Fragments
Phase 2: Draft
Phase 3: Finished
14/73
BAC-Based
Human Genome
BAC library
(100-200Kb)
Finished + Draft BACs
Sequence Assembly Problem
the working draft of the human genome
15/73
Outline: Sequence Assembly
•
Biological Background
•
Human Genome Project and the Sequence
Assembly Problem
•
The BARNACLE Assembler
–
–
–
Details of Input
Difficulties
Basic Idea
16/73
Details of Input
• Sequence Information:
– BACs
• Overlap Information:
– Local Alignments
– NT-pairs
• Orientation Information:
Plasmid, EST, mRNA
17/73
Input: Sequence Information
Recall: A BAC is a contiguous stretch of DNA from a
chromosome. Each comes as a set of fragments.
Accession
Phase
Chrm
# frags
AC002092.1
1
17
4
Frag acc.
length
AC002092.1~1
888
AC002092.1~2
45312
AC002092.1~3
38725
AC002092.1~4.1
10245
BAC
fragment
•Phase 1,2 = Draft
•Phase 3 = Finished
18/73
Input: Overlap Information
• Preprocessing:
– Local alignments of all fragment pairs
– NT-pairs: Generated from GenBank annotation
submitted from genome centers
19/73
Example: Input of Dec 2001 freeze
Sequence Information:
Phase
1
2
3
Total
BACs
15298
2154
17624
35076
Fragments Total Length (Gbp)
246424
2.5
8161
3.3
17624
2
272209
4.9
Average Number of Fragments
16.11
3.79
1
7.76
Chromosome Assignments:
31543 by STS; 2450 by Genbank; 1083 unknown
Overlap Information: 403,466 fragment pairs,
12,656 NT-pairs
Orientation Information: 321,751 fragment pairs
20/73
True vs Repeat-induced Overlap
True Overlap
Repeat-induced Overlap
21/73
Repeats of the Human Genome
 High-copy repeats
e.g. ALU, L1
 Low-copy repeats (segmental duplication)
•Large block (>200Kb)
•Highly Similar (>97%)
22/73
Noise
• False positives (FP): due to repeat
• False negatives (FN): polymorphism, draft quality
• Chimeric BAC (CB)
23/73
The Basic Idea
1. “Conservatively” assemble fragments
24/73
Necessary Condition for True Overlaps
A overlaps with B
A
A overlaps with C
A
B
C
Does B overlap with C?
Yes.
No.
C
A
B
A
C
B
Idea: assemble non-conflict overlaps first
25/73
The Basic Idea
1. “Conservatively” assemble fragments into subcontigs
BAC Graph
26/73
Interval Graph
The BAC graph is an interval graph!
Definition: A graph G is called an interval graph if there is one-one
correspondence between its vertices and a set of intervals on the
real line such that two vertices are adjacent in G iff their corresponding
intervals overlap.
27/73
Necessary… But Not Sufficient
Long Repeats
Under-represented
28/73
Non-interval Graph
Collapsing Repeats:
Chimeric BAC
29/73
Forbidden Subgraphs
Theorem (Lekkerkerker & Boland 1962)
A graph is interval iff it does not contain one of the
following (induced) subgraph:
30/73
Resolving Non-interval Graphs
Definition: A vertex u 2 V is I-critical if G|V\{u} is interval.
Given a non-interval graph G, identify a forbidden
subgraph.
If at least one of the vertices of the forbidden
subgraph is I-critical, we say G is fixable.
Based on the structure of the forbidden subgraph,
a fixable graph G is resolved by
1. adding an FN edge; or
2. removing FP edges; or
3. removing a vertex.
31/73
Divide and Conquer Method
For the non-fixable
graphs, we employ a
divide-and-conquer
method by dividing the
graph according to
some articulation points
such that each
subcomponent is
fixable.
32/73
The Basic Idea
1. “Conservatively” assemble fragments into subcontigs
BAC Graph
2. Resolve Non-interval Graph
and
Find an Interval Realization of the BAC Graph
3. Orient and order subcontigs
33/73
Error Detection
1. “Conservatively” assemble fragments into subcontigs
• wrong NT-pairs (annotation from genome centers)
• chromosome misassignments
2. Resolve Non-interval Graph
and
Find an Interval Realization of the BAC Graph
• chimerics
3. Orient and order subcontigs
• fragment misassemblies
This is the only algorithm available that does Error Detection.
34/73
Output: Contigs
35/73
Other Two Assemblers
• GigAssembler by Jim Kent and David Haussler
(stop after April 2001 freeze)
• NCBI’s assembler – top-down approach:
• build a physical map using sequence overlaps as
fingerprint overlaps;
• using some scoring functions to resolve conflicts.
36/73
NCBI’s assembly
BARNACLE’s assembly
37/73
Comparison with NCBI’s Assembly
(Dec 2001)
Assembled BAC Length
· 250K (good BACs)
Barnacle
NCBI
33921
29952
250K-300K
434
461
300K-500K
549
1328
500K-800K
33
798
800K-1M
0
248
1M-2M
0
496
2M-3M
0
129
3M-10M
0
259
10M-20M
0
67
Total (>250K)
1016
3786
38/73
How was the human genome “finished”?
• Hand-curate tiling path of BACs (by Genome Centers)
• Finish sequencing the tiling path of BACs only
• Assemble by NCBI’s assembler based on the
hand-curated tiling paths
39/73
Incorporating segmental duplication database
Collaboration with Evan E. Eichler
(Department of Genetics,
Case Western Reserve University)
BARNACLE’s assembly suggested that at least 89
repeat-contained BACs were dropped from
the tiling path.
– 69 were added to HGP’s final tiling path
– 20 were declared unnecessary
• Due to disagreement about repeat structure of genome
The Sequence and Assembly of Highly Duplicated Regions in the Human Genome.
V. Choi, J. Bailey, G. Schuler, Z. Gu, P. Li, M. Farach-Colton and E. Eichler.
Genome Sequencing & Biology meeting at Cold Spring Harbor Laboratory 2002.
40/73
Conclusions
• Better assembly
– Error detection
– Measured by the assembled BAC length
• Efficient (3 minutes on a Pentium III)
Reference : V. Choi, M. Farach-Colton. BARNACLE: An
Assemble Algorithm for Clone-based Sequences of Whole
Genomes. Gene, 320, 165-176, 2003.
• To do large scale sequencing:
–Handle repeats
–Design in data acquisition that will permit error
detection & correction
41/73
Acknowledgement
Wojciech Makalowski (NCBI/Penn State University)
David Lipman (NCBI)
Greg Schuler (NCBI)
Evan E. Eichler (Case Western Reserve University)
Granger Sutton (Celera)
JinSheng Lai (Waksman Institute, Rutgers University)
• NCBI/NIH pre-doctoral visiting fellowship
• Program in Mathematics and Molecular Biology (PMMB),
Burroughs Wellcome Fund Interface Program fellowship
42/73
Outline : Protein Docking
1. Protein-Protein Docking
2. Local Search Algorithm
3. Test Results
43/73
Protein-Protein Docking
Barstar
Barnase
1BRS : Barnase + Barstar
44/73
Protein Re-Docking Problem
(Bound Protein Docking)
Given a known protein-protein complex A-B (native
configuration), randomly separate two proteins.
Fix A, find a rigid motion m such that m(B) is near-native.
Rigid Body Assumption
45/73
Formulation of Rigid Protein Docking
• A scoring function that can discriminate correct
docking configuration from incorrect ones;
• A search algorithm that finds the docking
configuration measured by the scoring function.
46/73
Protein
A protein molecule consists of a set of atoms.
Each atom is represented by a ball in R3.
Notation:
A = { (a1, r1), …, (an,rn) }
where ai 2 R3 is the ith atom center with
(van der Waals) radius ri
Atom Type
C
Radius in Angstrom 1.548
N
O
P
S
1.4
1.348
1.88
1.808
47/73
Our Scoring Function
48/73
Exhaustive Search
• Sampling the rigid motion space (6-dimension)
• Evaluate each motion using the scoring function
Rigid Motion = Rotation + Translation
A rotation in R3 can be specified by a rotation angle q
about a rotation axis u – represented by unit quaternion.
Sampling Rotation Space ) S3 (unit sphere in R4)
A translation in R3 is a 3-dimensional vector (x,y,z) 2 R3.
Sampling Translation Space : a 3-dimensional grid
49/73
Protein Re-Docking Without False Positives
Empirical Results:
The configuration m(B) for which
• Score(A,m(B)) is maximized;
• Bump(A,m(B)) · 7;
is near-native : RMSD(Bnative, m(B)) · 3.
Other prior works (e.g. FFT-based, geometric hashing)
generate multiple possible docking configurations
(i.e. near-native + false positives).
50/73
Sampling Rigid Motion Space : High Resolution
Rotations : 12,036 (~5 degree)
Translation: 0.4 grid step size (106 ~ 107)
Diverse test set (25 protein-protein complexes):
1A22, 1A4Y, 1BI8, 1BUH, 1BXI, 1CHO, 1CSE, 1DFJ, 1F47, 1FC2
1FIN, 1FS1, 3HLA, 1JAT, 1JLT, 1MCT, 1MEE, 2PTC, 3SGB, 4SGB
1STF, 1TEC, 1TGS, 1TX4, 3YGS
Running time: 13 hours ~ two days on 50 machines
Duke Internet Systems and Storage Group Cluster (~200 machines)
51/73
Why high resolution?
Two close configurations (i.e. small RMSD),
Score and Bump fluctuates greatly.
Example:
[Score, Bump] = [309, 2], [467, 39], [158, 2]
52/73
Protein Re-Docking: Local Search Approach
Given A = {(aj,rj) : 1 · j · n}, B={(bi,si): 1 · i · m},
find a rigid motion m such that
• Score(A,m(B)) is maximized; and
• Bump(A,m(B)) · 7.
B
A
53/73
Outline : Protein Docking
1. Protein-Protein Docking
2. Local Search Algorithm
3. Test Results
54/73
Weighted Least Squares Rigid Motion
m = WLSM(w,B,C):
i wi||m(bi) – ci||2 is minimized
tentative goal
absolute orientation problem in computer vision
55/73
Local Search Algorithm
56/73
Preprocessing: Candidate Positions
mid-spheres = {(a,r+s+0.75): (a,r) 2 A}
Vertex set = {v: v is a vertex of arrangement of mid-spheres,
Bump((v,s),A)=0}
Sc(v)=Score((v,s),A)
57/73
Example: Candidate Positions
58/73
Outer Loop : Increasing Score
Local search neighborhood distance D (· 4.5),
Tentative goal ci is the largest score vertex within
the local neighborhood of bi
59/73
Apply Least Squares Rigid Motion
60/73
Inner Loop: Collision Resolution
F = {(b,s) 2 B : Bump(m(b,s),A)=0, Score(m(b,s),A)>0}
Y = {(b,s) 2 B : Bump(m(b,s),A)  0}
61/73
Collision Resolution
For (bi, s) 2 F, ci = m(bi), wi = 1
For (bi, s) 2 Y, ci = the nearest vertex within distance 2
wi = W(m) / ||m(bi) – ci||2
62/73
Example: 1BRS
Notation: [Score, Bump] (RMSD)
Native : [309, 2] (0)
Input : [91, 5] (3.78)
Increasing Score: [297, 59]
Resolving Collisions: [236, 34], [215,28], …, [132,2] (5.45)
[326, 59]
[298, 43],[282, 30], …, [119,4] (4.67)
[246, 13]
[247, 10],[200, 9],[174,4](2.67)
[351, 30]
[332, 16], [323,7] (1.98)
[386, 18]
[377, 7] (0.53)
Running Time: 30 seconds ~ 2 minutes for preprocessing
1~3 seconds per local search
63/73
Outline : Protein Docking
1. Protein-Protein Docking
2. Local Search Algorithm
3. Test Results
64/73
Perturbations
Perturb protein B locally from its native position:
rotation=(u,q) followed by translation=(v,t)
Sampling:
u,v 2 {32 uniformly distributed unit vectors in R3}
q2 {0,3,6, …, 27} (degree)
t 2 {0,0.5,1.0, …, 4.5} (Angstrom)
Total:
(32x9+1) (rotations) x (32x9+1)(translations) = 83,521
65/73
Test Results
Example:
q = 18, t=2.5
829/1024 = 81%
Success :
Score > 90% Native_Score,
Bump · 7,
RMSD· 2
66/73
Test Results
{(q · 12, t · 3.5),
(q · 15, t · 3.0),
(q · 18, t · 2.5),
(q · 21, t · 2.0)}
40,903/44,481=92%
67/73
10 Different Protein-Protein Complexes
q · 18, t · 3.5 Angstrom
(43,425 perturbations)
68/73
Conclusions
 Works well in neighborhood :
- Rotation angle · 18 degrees
- Translation distance · 3.5 Angstrom
 Global Search
 Incorporate conformation flexibility
Reference: V. Choi, H. Edelsbrunner, P.K. Agarwal and J. Rudolph.
Local Search Heuristic for Rigid Protein Docking. To be submitted.
69/73
Acknowledgement
Biogeometry Group @ Duke:
Tammy Bailey
Andrew Ban
Sergei Bespamiatnykh
Abhijit Guria
Vijay Natarajan
Alper Ungor
Yusu Wang
Navin Goyal
(Rutgers University)
Raimund Seidel
(Univ. des Saarlandes)
Stefan Leopoldseder
(Vienna Univ. of Technology)
VMD – Visual Molecular Dynamics:
http://www.ks.uiuc.edu/Research/vmd
70/73
Future Work
 Protein Docking Problem : Unbound case
 Repeats in the Human Genome:
71/73
Repeats: Junk DNA?
G
G
Aberrant recombination
G
G
Human disease or structural polymorphism
Not Junk at All!
72/73
Future Work
 Protein Docking Problem : Unbound case
 Repeats in the Human Genome:
Characterization and distribution of repeats (both
high copy and low copy) in the human genome
Thank You!
73/73
Download