Sequence Assembly and Protein Docking Algorithms Vicky Choi Department of Computer Science Duke University Outline • Sequence Assembly Algorithm (Joint work with Martin Farach-Colton @Rutgers) • Local Search Algorithm for Rigid Protein Docking (Joint work with Pankaj K. Agarwal, Herbert Edelsbrunner, Johannes Rudolph @Duke) 2/73 Outline: Sequence Assembly • Biological Background • Human Genome Project and the Sequence Assembly Problem • The BARNACLE Assembler 3/73 DNA A DNA molecule consists of two strands which are tied together in a helical structure. Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis Each strand is represented by a string over the alphabet {A,C,G,T}, called a DNA sequence. Example AAGCTTCAGTTCCTGACCTTCCAATCGCAA {A,C,G,T} = nucleotide, base, basepair (bp) 4/73 Two Strands: Reverse Complementary (5’) (3’) Orientation: 5’ ! 3’ Complement: A $ T, C $ G one strand ) another strand Example (3’) (5’) (5') (3') ACCATGGTGCACCTGACTCCTGAGGAG TGGTACCACGTGGACTGAGGACTCCTC (3') (5') Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis 5/73 A genome is the complete set of DNA sequences of an organism. Human Chromosomes Image Credit: Sanger Center http://www.sanger.ac.uk/ Human Genome ~ 3x109 bp 6/73 DNA Sequencing DNA Sequencing is the process for determining the sequence of nucleotides of a region of DNA. CGAATCGTCGATGCTAATG Current technology : ~500bp Question: How to sequence a longer stretch of DNA? 7/73 Shotgun Sequencing Target DNA DNA Cloning Copies of Target Shotgun DNA SequencingACGTAAGAGTACCGATTGGCCA Sequence Reads Assembly Consensus Directed Read Final …ACGTAGTCTTAGATGATAGTAGA… 8/73 Shotgun Sequencing History • • • • • 1980s: 5 to 10 Kbp 1990: 40 Kbp 1995: 1.8 Mbp (H. Influenzae) 2000: draft Drosophila (120 Mbp) 2001: draft Human Genome (3x109bp) (attempted by Celera) 9/73 Outline: Sequence Assembly • Biological Background • Human Genome Project and the Sequence Assembly Problem • The BARNACLE Assembler 10/73 Human Genome Project (HGP) • 1988: “Mapping and Sequencing the Human Genome” • 1990: HGP started in US • 2001: A “working draft” version • 2003: Completed by HGP Consortium standard 11/73 Hierarchical Shotgun Sequencing (BAC-by-BAC) •Map First, Then Sequence Human Genome BAC library (100-200Kb) Physical Map A BAC is a segment of DNA from a chromosome. Each BAC is ~100-200Kb. Tiling Path of BACs Shotgun Sequence & Assemble of each TP BAC Final Sequence 12/73 Hierarchical Shotgun Sequencing (BAC-by-BAC) •Map First?? Human Genome BAC library (100-200Kb) Physical Map ? ? ? ? Physical Map is difficult to build! (original expected time: 5 years) 13/73 BAC-by-BAC ! BAC-Based New Idea: Map + Sequencing concurrently Randomly pick BACs (not wait for Physical Map) and shotgun sequence BACs BAC Sequence Reads Fragments Phase 1: Draft Ordered Fragments Phase 2: Draft Phase 3: Finished 14/73 BAC-Based Human Genome BAC library (100-200Kb) Finished + Draft BACs Sequence Assembly Problem the working draft of the human genome 15/73 Outline: Sequence Assembly • Biological Background • Human Genome Project and the Sequence Assembly Problem • The BARNACLE Assembler – – – Details of Input Difficulties Basic Idea 16/73 Details of Input • Sequence Information: – BACs • Overlap Information: – Local Alignments – NT-pairs • Orientation Information: Plasmid, EST, mRNA 17/73 Input: Sequence Information Recall: A BAC is a contiguous stretch of DNA from a chromosome. Each comes as a set of fragments. Accession Phase Chrm # frags AC002092.1 1 17 4 Frag acc. length AC002092.1~1 888 AC002092.1~2 45312 AC002092.1~3 38725 AC002092.1~4.1 10245 BAC fragment •Phase 1,2 = Draft •Phase 3 = Finished 18/73 Input: Overlap Information • Preprocessing: – Local alignments of all fragment pairs – NT-pairs: Generated from GenBank annotation submitted from genome centers 19/73 Example: Input of Dec 2001 freeze Sequence Information: Phase 1 2 3 Total BACs 15298 2154 17624 35076 Fragments Total Length (Gbp) 246424 2.5 8161 3.3 17624 2 272209 4.9 Average Number of Fragments 16.11 3.79 1 7.76 Chromosome Assignments: 31543 by STS; 2450 by Genbank; 1083 unknown Overlap Information: 403,466 fragment pairs, 12,656 NT-pairs Orientation Information: 321,751 fragment pairs 20/73 True vs Repeat-induced Overlap True Overlap Repeat-induced Overlap 21/73 Repeats of the Human Genome High-copy repeats e.g. ALU, L1 Low-copy repeats (segmental duplication) •Large block (>200Kb) •Highly Similar (>97%) 22/73 Noise • False positives (FP): due to repeat • False negatives (FN): polymorphism, draft quality • Chimeric BAC (CB) 23/73 The Basic Idea 1. “Conservatively” assemble fragments 24/73 Necessary Condition for True Overlaps A overlaps with B A A overlaps with C A B C Does B overlap with C? Yes. No. C A B A C B Idea: assemble non-conflict overlaps first 25/73 The Basic Idea 1. “Conservatively” assemble fragments into subcontigs BAC Graph 26/73 Interval Graph The BAC graph is an interval graph! Definition: A graph G is called an interval graph if there is one-one correspondence between its vertices and a set of intervals on the real line such that two vertices are adjacent in G iff their corresponding intervals overlap. 27/73 Necessary… But Not Sufficient Long Repeats Under-represented 28/73 Non-interval Graph Collapsing Repeats: Chimeric BAC 29/73 Forbidden Subgraphs Theorem (Lekkerkerker & Boland 1962) A graph is interval iff it does not contain one of the following (induced) subgraph: 30/73 Resolving Non-interval Graphs Definition: A vertex u 2 V is I-critical if G|V\{u} is interval. Given a non-interval graph G, identify a forbidden subgraph. If at least one of the vertices of the forbidden subgraph is I-critical, we say G is fixable. Based on the structure of the forbidden subgraph, a fixable graph G is resolved by 1. adding an FN edge; or 2. removing FP edges; or 3. removing a vertex. 31/73 Divide and Conquer Method For the non-fixable graphs, we employ a divide-and-conquer method by dividing the graph according to some articulation points such that each subcomponent is fixable. 32/73 The Basic Idea 1. “Conservatively” assemble fragments into subcontigs BAC Graph 2. Resolve Non-interval Graph and Find an Interval Realization of the BAC Graph 3. Orient and order subcontigs 33/73 Error Detection 1. “Conservatively” assemble fragments into subcontigs • wrong NT-pairs (annotation from genome centers) • chromosome misassignments 2. Resolve Non-interval Graph and Find an Interval Realization of the BAC Graph • chimerics 3. Orient and order subcontigs • fragment misassemblies This is the only algorithm available that does Error Detection. 34/73 Output: Contigs 35/73 Other Two Assemblers • GigAssembler by Jim Kent and David Haussler (stop after April 2001 freeze) • NCBI’s assembler – top-down approach: • build a physical map using sequence overlaps as fingerprint overlaps; • using some scoring functions to resolve conflicts. 36/73 NCBI’s assembly BARNACLE’s assembly 37/73 Comparison with NCBI’s Assembly (Dec 2001) Assembled BAC Length · 250K (good BACs) Barnacle NCBI 33921 29952 250K-300K 434 461 300K-500K 549 1328 500K-800K 33 798 800K-1M 0 248 1M-2M 0 496 2M-3M 0 129 3M-10M 0 259 10M-20M 0 67 Total (>250K) 1016 3786 38/73 How was the human genome “finished”? • Hand-curate tiling path of BACs (by Genome Centers) • Finish sequencing the tiling path of BACs only • Assemble by NCBI’s assembler based on the hand-curated tiling paths 39/73 Incorporating segmental duplication database Collaboration with Evan E. Eichler (Department of Genetics, Case Western Reserve University) BARNACLE’s assembly suggested that at least 89 repeat-contained BACs were dropped from the tiling path. – 69 were added to HGP’s final tiling path – 20 were declared unnecessary • Due to disagreement about repeat structure of genome The Sequence and Assembly of Highly Duplicated Regions in the Human Genome. V. Choi, J. Bailey, G. Schuler, Z. Gu, P. Li, M. Farach-Colton and E. Eichler. Genome Sequencing & Biology meeting at Cold Spring Harbor Laboratory 2002. 40/73 Conclusions • Better assembly – Error detection – Measured by the assembled BAC length • Efficient (3 minutes on a Pentium III) Reference : V. Choi, M. Farach-Colton. BARNACLE: An Assemble Algorithm for Clone-based Sequences of Whole Genomes. Gene, 320, 165-176, 2003. • To do large scale sequencing: –Handle repeats –Design in data acquisition that will permit error detection & correction 41/73 Acknowledgement Wojciech Makalowski (NCBI/Penn State University) David Lipman (NCBI) Greg Schuler (NCBI) Evan E. Eichler (Case Western Reserve University) Granger Sutton (Celera) JinSheng Lai (Waksman Institute, Rutgers University) • NCBI/NIH pre-doctoral visiting fellowship • Program in Mathematics and Molecular Biology (PMMB), Burroughs Wellcome Fund Interface Program fellowship 42/73 Outline : Protein Docking 1. Protein-Protein Docking 2. Local Search Algorithm 3. Test Results 43/73 Protein-Protein Docking Barstar Barnase 1BRS : Barnase + Barstar 44/73 Protein Re-Docking Problem (Bound Protein Docking) Given a known protein-protein complex A-B (native configuration), randomly separate two proteins. Fix A, find a rigid motion m such that m(B) is near-native. Rigid Body Assumption 45/73 Formulation of Rigid Protein Docking • A scoring function that can discriminate correct docking configuration from incorrect ones; • A search algorithm that finds the docking configuration measured by the scoring function. 46/73 Protein A protein molecule consists of a set of atoms. Each atom is represented by a ball in R3. Notation: A = { (a1, r1), …, (an,rn) } where ai 2 R3 is the ith atom center with (van der Waals) radius ri Atom Type C Radius in Angstrom 1.548 N O P S 1.4 1.348 1.88 1.808 47/73 Our Scoring Function 48/73 Exhaustive Search • Sampling the rigid motion space (6-dimension) • Evaluate each motion using the scoring function Rigid Motion = Rotation + Translation A rotation in R3 can be specified by a rotation angle q about a rotation axis u – represented by unit quaternion. Sampling Rotation Space ) S3 (unit sphere in R4) A translation in R3 is a 3-dimensional vector (x,y,z) 2 R3. Sampling Translation Space : a 3-dimensional grid 49/73 Protein Re-Docking Without False Positives Empirical Results: The configuration m(B) for which • Score(A,m(B)) is maximized; • Bump(A,m(B)) · 7; is near-native : RMSD(Bnative, m(B)) · 3. Other prior works (e.g. FFT-based, geometric hashing) generate multiple possible docking configurations (i.e. near-native + false positives). 50/73 Sampling Rigid Motion Space : High Resolution Rotations : 12,036 (~5 degree) Translation: 0.4 grid step size (106 ~ 107) Diverse test set (25 protein-protein complexes): 1A22, 1A4Y, 1BI8, 1BUH, 1BXI, 1CHO, 1CSE, 1DFJ, 1F47, 1FC2 1FIN, 1FS1, 3HLA, 1JAT, 1JLT, 1MCT, 1MEE, 2PTC, 3SGB, 4SGB 1STF, 1TEC, 1TGS, 1TX4, 3YGS Running time: 13 hours ~ two days on 50 machines Duke Internet Systems and Storage Group Cluster (~200 machines) 51/73 Why high resolution? Two close configurations (i.e. small RMSD), Score and Bump fluctuates greatly. Example: [Score, Bump] = [309, 2], [467, 39], [158, 2] 52/73 Protein Re-Docking: Local Search Approach Given A = {(aj,rj) : 1 · j · n}, B={(bi,si): 1 · i · m}, find a rigid motion m such that • Score(A,m(B)) is maximized; and • Bump(A,m(B)) · 7. B A 53/73 Outline : Protein Docking 1. Protein-Protein Docking 2. Local Search Algorithm 3. Test Results 54/73 Weighted Least Squares Rigid Motion m = WLSM(w,B,C): i wi||m(bi) – ci||2 is minimized tentative goal absolute orientation problem in computer vision 55/73 Local Search Algorithm 56/73 Preprocessing: Candidate Positions mid-spheres = {(a,r+s+0.75): (a,r) 2 A} Vertex set = {v: v is a vertex of arrangement of mid-spheres, Bump((v,s),A)=0} Sc(v)=Score((v,s),A) 57/73 Example: Candidate Positions 58/73 Outer Loop : Increasing Score Local search neighborhood distance D (· 4.5), Tentative goal ci is the largest score vertex within the local neighborhood of bi 59/73 Apply Least Squares Rigid Motion 60/73 Inner Loop: Collision Resolution F = {(b,s) 2 B : Bump(m(b,s),A)=0, Score(m(b,s),A)>0} Y = {(b,s) 2 B : Bump(m(b,s),A) 0} 61/73 Collision Resolution For (bi, s) 2 F, ci = m(bi), wi = 1 For (bi, s) 2 Y, ci = the nearest vertex within distance 2 wi = W(m) / ||m(bi) – ci||2 62/73 Example: 1BRS Notation: [Score, Bump] (RMSD) Native : [309, 2] (0) Input : [91, 5] (3.78) Increasing Score: [297, 59] Resolving Collisions: [236, 34], [215,28], …, [132,2] (5.45) [326, 59] [298, 43],[282, 30], …, [119,4] (4.67) [246, 13] [247, 10],[200, 9],[174,4](2.67) [351, 30] [332, 16], [323,7] (1.98) [386, 18] [377, 7] (0.53) Running Time: 30 seconds ~ 2 minutes for preprocessing 1~3 seconds per local search 63/73 Outline : Protein Docking 1. Protein-Protein Docking 2. Local Search Algorithm 3. Test Results 64/73 Perturbations Perturb protein B locally from its native position: rotation=(u,q) followed by translation=(v,t) Sampling: u,v 2 {32 uniformly distributed unit vectors in R3} q2 {0,3,6, …, 27} (degree) t 2 {0,0.5,1.0, …, 4.5} (Angstrom) Total: (32x9+1) (rotations) x (32x9+1)(translations) = 83,521 65/73 Test Results Example: q = 18, t=2.5 829/1024 = 81% Success : Score > 90% Native_Score, Bump · 7, RMSD· 2 66/73 Test Results {(q · 12, t · 3.5), (q · 15, t · 3.0), (q · 18, t · 2.5), (q · 21, t · 2.0)} 40,903/44,481=92% 67/73 10 Different Protein-Protein Complexes q · 18, t · 3.5 Angstrom (43,425 perturbations) 68/73 Conclusions Works well in neighborhood : - Rotation angle · 18 degrees - Translation distance · 3.5 Angstrom Global Search Incorporate conformation flexibility Reference: V. Choi, H. Edelsbrunner, P.K. Agarwal and J. Rudolph. Local Search Heuristic for Rigid Protein Docking. To be submitted. 69/73 Acknowledgement Biogeometry Group @ Duke: Tammy Bailey Andrew Ban Sergei Bespamiatnykh Abhijit Guria Vijay Natarajan Alper Ungor Yusu Wang Navin Goyal (Rutgers University) Raimund Seidel (Univ. des Saarlandes) Stefan Leopoldseder (Vienna Univ. of Technology) VMD – Visual Molecular Dynamics: http://www.ks.uiuc.edu/Research/vmd 70/73 Future Work Protein Docking Problem : Unbound case Repeats in the Human Genome: 71/73 Repeats: Junk DNA? G G Aberrant recombination G G Human disease or structural polymorphism Not Junk at All! 72/73 Future Work Protein Docking Problem : Unbound case Repeats in the Human Genome: Characterization and distribution of repeats (both high copy and low copy) in the human genome Thank You! 73/73