The Role of Algorithmic Research in Computational Genomics Richard M. Karp IEEE Computer Society Bioinformatics Conference August 14, 2003 Algorithmic Research in Computer Science • Computer science is a ``science of the artificial.’’ • Problems are precisely stated and are often generic rather than application-specific. • The quality of an algorithm is measured by its worst-case time bound. • Mathematical elegance is just as important as relevance to applications. Algorithmic Research in Computational Genomics • The goal is to understand ground truth. • Problem statements are often fuzzy. • Problems are often application-specific, and problem formulations must be faithful to those applications. • The quality of an algorithm is measured by its performance on real data. • Biological findings are more important than computational methods. Genomics can Benefit from Algorithmic Research in C.S. • Data structures such as suffix trees. • Randomized algorithms and sampling techniques. • Dynamic programming (sequence alignment, RNA folding, protein threading, haplotype block structure…) • Network flows, graph theory, NP-completeness, integer programming, semidefinite programming. Adapting to Genomics • Choose problems that are fundamental, timely and relevant. • Mathematical depth and elegance are highly desirable, but often simple mathematics, artfully applied, is the key to success. • Avoid problems that will change when technology changes. • Learn the biological background of your problem, the available sources of data and their noise characteristics. Adapting to Genomics • Work with an application-oriented team and don’t get typecast as an algorithms specialist. • Benchmark your algorithms on real data, establish a user community and make your software available and easy to use. Sequence Assembly • Given many noisy `reads’ of short substrings of a target string, identify the target string. • The shortest superstring problem, an elegant but flawed abstraction: find a shortest string containing a set of given strings as substrings. The problem is NP-hard, and theoretical results focus on constant-factor approximation algorithms. Shortest Superstring Problem The shortest superstring problem is only superficially related to the sequence assembly problem. Its difficulty stems from pathological examples that are unlikely to occur in practice. It does not take noisy reads into account, and admits solutions with an unreasonably large number of mutually overlapping reads. Progress in Sequence Assembly Algorithms • Phred provides highly accurate base-specific quality scores based on signal analysis of sequence traces. • Celera assembler: realistic simulations based on the structure of repeats in genomic sequence suggested that full-genome sequence assembly would be possible using double-ended reads. A sophisticated heuristic assembly algorithm was constructed, leading to the successful assembly of the Drosophila, human and mouse genomes. Physical Mapping • Goal: determine the relative locations of sequence-tagged sites, restriction sites or clones on a target DNA molecule. • Radiation hybrid mapping: fragment the target, recover random sets of fragments and detect the sequence-tagged sites within them. Physical Mapping • Optical mapping: directly image the restriction sites on many incomplete copies of the target. • Clone-based mapping: generate a clone library together with a restriction-site or sequence-tagged-site fingerprint of each clone. Computationally infer the relative positions of the clones. A Generic Subproblem • X(i) distance in bases of site i from 5’ end of target • Experimental data yields inequalities of the form a(i,j) X(i)– X(j) b(i,j) • In nearly every case, no solution existed. • The algorithm was then modified to find the minimal obstructions to a solution and pinpoint the places where the experimental data needed to be corrected. Why My Physical Mapping Projects Had Little Influence • Some problem formulations were technologydependent and hence of transient interest. • Difficulty in infiltrating existing projects and acquiring test data. • Implementations lacked good user interfaces. • Whole-genome sequencing supplanted physical mapping to some extent. Elegance vs. Realism: the Case of Probe Selection • Probe Selection Problem: find a maximum number of DNA probes, such that each hybridizes strongly to its complement, but not to the complement of any other probe. • For highly realistic models of hybridization there appears to be no method of solution short of brute force search. • A reasonable simplified model has an elegant solution. Simplified Model • 2-4 rule: the melting temperature of a DNA sequence is twice the number of A’s and T’s within it, plus four times the number of C’s and G’s. • Simplified problem: Find a maximum number of probes such that each has melting temperature a, but no sequence of melting temperature b occurs as a substring of two different probes. • Open question: how to modify solution to the simplified problem to satisfy constraints of more realistic models. Principles for Designing Computational Strategies • An organism is best understood in the light of its evolutionary relationship to other organisms. • The use of diverse sources of data is often the key to success. • Problems of finding structure within data should be framed within statistical models, so that significance can be attached to the structures that are found. Fundamental Problems that Need Better Algorithms • • • • • • • Multiple alignment Global alignment of multiple genomes Phylogeny construction Genome rearrangement Approximate string matching Clustering biological data Feature selection: finding small sets of input variables that most accurately predict a given output variable. SNPs, Genotypes, and Haplotypes • SNP: site where the two copies of a chromosome commonly contain different bases. • Genotype: the pair of bases occurring at each SNP. • Haplotype: designates which base lies on which copy. Haplotyping Problems • Given the genotypes of a sample of individuals, determine: – The common haplotypes and their frequencies – The haplotype of each individual – The influence of an individual’s haplotype on observable phenotypes such as disease. Analysis of Gene Regulation • Gene finding • Breaking the cis-regulatory code (analysis of transcriptional regulation) – Characterize the binding sites of transcription factors – Find sets of transcription factors that work in combination to induce or repress many genes • Analysis of signal transduction pathways and protein complexes using protein-protein interaction data. Combinatorial Analysis of Transcriptional Regulation • Goal: find sets of transcription factors whose binding sites co-occur frequently in the promoter regions of selected sets of genes and determine whether these transcription factors combine to activate or suppress transcription. Databases and Tools • TRANSFAC: binding site motifs for 414 TFs occurring in vertebrate genomes • RefSeq: database of human genes • LBNL alignment of human and mouse genomes • rVista : tool for finding human-mouse conserved motif occurrences • Expression data and phases for cell-cycle regulated genes • Stress response genes in the GO database and their subcategories General Mechanisms for Combining Diverse Data Sources • Biclustering – A. Tanay, R. Sharan, M. Kupiec, R. Shamir, manuscript (2003) • Probabilistic graphical models • Kernel-based data fusion – G. Lanckriet, M. Deng, N. Cristianini, M.Jordan, W. Noble, Technical Report 645, UC Berkeley Department of Statistics (2003) Biclustering Given a (0,1) matrix in which the rows represent genes, the columns represent properties of genes (function, expression, association with diseases etc.) and a 1 in the (g,p) entry indicates that gene g has property p, find submatrices with an unusually high density of 1s. Probabilistic Graphical Model (PGM) • Graph-theoretic representation of the probabilistic and deterministic relationships among a set of variables. Vertices correspond to variables and directed edges represent dependencies. Some variables are observed and others are hidden. • A PGM provides an algorithm for generating samples from the joint distribution of its variables. • Given the values of the observed variables, there is an automatic (but not necessarily efficient) procedure for inferring the most likely values of the hidden variables. Application: Finding Binding Site Motifs • Observed variables: the genomic sequences within which the motifs occur • Hidden variables: the locations of the motif occurrences, the nucleotide distributions at sites within and between motif occurrences, and metaparameters governing these nucleotide distributions. • Questions: How much data is required to train a PGM, and how rapidly will the inference algorithm converge? Classification Using Diverse Sources of Data Example: Classifying proteins based on five types of data: (1) Their domain structures (2) Protein-protein interactions (3) Genetic interactions (4) Co-participation in protein complexes (5) Cell cycle gene expression measurements Support Vector Machine (SVM) • Input: a training set {p1,p2, …,pn} of proteins, a class label (positive or negative) for each protein and a n x n positive-definite matrix S = (sjk) giving the similarities between all pairs of proteins. • The SVM algorithm produces a decision rule that achieves maximal separation between the positive and negative examples in the training set and can be used to classify additional proteins on the basis of their similarities to proteins in the training set. Extension to Diverse Data Sources • The t-th data source gives a positive-definite matrix St = (sjkt) of similarities between the proteins. • Data fusion: Any positive linear combination of these matrices gives a positive-definite similarity matrix. • The problem of choosing the linear combination giving the largest margin of separation between positive and negative examples can be solved by semidefinite programming. Computer Science Paradigms from Biology and Genomics • Living cells can adapt to environmental changes, but large computer programs are brittle.Does biology hold clues for software engineering? • Genomics algorithms are required to perform well on real-life data, not on all possible data.Should theoretical computer science depart from worst-case analysis? Computer Science Paradigms from Biology and Genomics • The Celera whole-genome shotgun sequencing algorithm is an instance of a general approach to combinatorial puzzle solving in which constraints on the solution are enforced in an order determined by the strength of evidence for them. Should this approach be studied within theoretical computer science?