The Role of Algorithmic Research in Computational Genomics

advertisement
The Role of Algorithmic
Research in Computational
Genomics
Richard M. Karp
IEEE Computer Society
Bioinformatics Conference
August 14, 2003
Algorithmic Research in
Computer Science
• Computer science is a ``science of the
artificial.’’
• Problems are precisely stated and are often
generic rather than application-specific.
• The quality of an algorithm is measured by
its worst-case time bound.
• Mathematical elegance is just as important
as relevance to applications.
Algorithmic Research in
Computational Genomics
• The goal is to understand ground truth.
• Problem statements are often fuzzy.
• Problems are often application-specific, and
problem formulations must be faithful to those
applications.
• The quality of an algorithm is measured by its
performance on real data.
• Biological findings are more important than
computational methods.
Genomics can Benefit from
Algorithmic Research in C.S.
• Data structures such as suffix trees.
• Randomized algorithms and sampling techniques.
• Dynamic programming (sequence alignment,
RNA folding, protein threading, haplotype block
structure…)
• Network flows, graph theory, NP-completeness,
integer programming, semidefinite programming.
Adapting to Genomics
• Choose problems that are fundamental, timely and
relevant.
• Mathematical depth and elegance are highly
desirable, but often simple mathematics, artfully
applied, is the key to success.
• Avoid problems that will change when technology
changes.
• Learn the biological background of your problem,
the available sources of data and their noise
characteristics.
Adapting to Genomics
• Work with an application-oriented team and
don’t get typecast as an algorithms
specialist.
• Benchmark your algorithms on real data,
establish a user community and make your
software available and easy to use.
Sequence Assembly
• Given many noisy `reads’ of short substrings of a
target string, identify the target string.
• The shortest superstring problem, an elegant but
flawed abstraction: find a shortest string
containing a set of given strings as substrings. The
problem is NP-hard, and theoretical results focus
on constant-factor approximation algorithms.
Shortest Superstring Problem
The shortest superstring problem is only
superficially related to the sequence assembly
problem. Its difficulty stems from pathological
examples that are unlikely to occur in practice. It
does not take noisy reads into account, and admits
solutions with an unreasonably large number of
mutually overlapping reads.
Progress in Sequence Assembly
Algorithms
• Phred provides highly accurate base-specific
quality scores based on signal analysis of
sequence traces.
• Celera assembler: realistic simulations based on
the structure of repeats in genomic sequence
suggested that full-genome sequence assembly
would be possible using double-ended reads. A
sophisticated heuristic assembly algorithm was
constructed, leading to the successful assembly of
the Drosophila, human and mouse genomes.
Physical Mapping
• Goal: determine the relative locations of
sequence-tagged sites, restriction sites or
clones on a target DNA molecule.
• Radiation hybrid mapping: fragment the
target, recover random sets of fragments
and detect the sequence-tagged sites within
them.
Physical Mapping
• Optical mapping: directly image the
restriction sites on many incomplete copies
of the target.
• Clone-based mapping: generate a clone
library together with a restriction-site or
sequence-tagged-site fingerprint of each
clone. Computationally infer the relative
positions of the clones.
A Generic Subproblem
• X(i) distance in bases of site i from 5’ end of
target
• Experimental data yields inequalities of the form
a(i,j)  X(i)– X(j)  b(i,j)
• In nearly every case, no solution existed.
• The algorithm was then modified to find the
minimal obstructions to a solution and pinpoint
the places where the experimental data needed to
be corrected.
Why My Physical Mapping
Projects Had Little Influence
• Some problem formulations were technologydependent and hence of transient interest.
• Difficulty in infiltrating existing projects and
acquiring test data.
• Implementations lacked good user interfaces.
• Whole-genome sequencing supplanted physical
mapping to some extent.
Elegance vs. Realism: the Case
of Probe Selection
• Probe Selection Problem: find a maximum
number of DNA probes, such that each hybridizes
strongly to its complement, but not to the
complement of any other probe.
• For highly realistic models of hybridization there
appears to be no method of solution short of brute
force search.
• A reasonable simplified model has an elegant
solution.
Simplified Model
• 2-4 rule: the melting temperature of a DNA
sequence is twice the number of A’s and T’s within
it, plus four times the number of C’s and G’s.
• Simplified problem: Find a maximum number of
probes such that each has melting temperature  a,
but no sequence of melting temperature  b occurs
as a substring of two different probes.
• Open question: how to modify solution to the
simplified problem to satisfy constraints of more
realistic models.
Principles for Designing
Computational Strategies
• An organism is best understood in the light of its
evolutionary relationship to other organisms.
• The use of diverse sources of data is often the key
to success.
• Problems of finding structure within data should
be framed within statistical models, so that
significance can be attached to the structures that
are found.
Fundamental Problems that Need
Better Algorithms
•
•
•
•
•
•
•
Multiple alignment
Global alignment of multiple genomes
Phylogeny construction
Genome rearrangement
Approximate string matching
Clustering biological data
Feature selection: finding small sets of input
variables that most accurately predict a given
output variable.
SNPs, Genotypes, and
Haplotypes
•
SNP: site where the two copies of a
chromosome commonly contain different
bases.
• Genotype: the pair of bases occurring at
each SNP.
• Haplotype: designates which base lies on
which copy.
Haplotyping Problems
• Given the genotypes of a sample of
individuals, determine:
– The common haplotypes and their frequencies
– The haplotype of each individual
– The influence of an individual’s haplotype on
observable phenotypes such as disease.
Analysis of Gene Regulation
• Gene finding
• Breaking the cis-regulatory code (analysis of
transcriptional regulation)
– Characterize the binding sites of transcription factors
– Find sets of transcription factors that work in
combination to induce or repress many genes
• Analysis of signal transduction pathways and
protein complexes using protein-protein
interaction data.
Combinatorial Analysis of
Transcriptional Regulation
• Goal: find sets of transcription factors
whose binding sites co-occur frequently in
the promoter regions of selected sets of
genes and determine whether these
transcription factors combine to activate or
suppress transcription.
Databases and Tools
• TRANSFAC: binding site motifs for 414 TFs
occurring in vertebrate genomes
• RefSeq: database of human genes
• LBNL alignment of human and mouse genomes
• rVista : tool for finding human-mouse conserved
motif occurrences
• Expression data and phases for cell-cycle
regulated genes
• Stress response genes in the GO database and their
subcategories
General Mechanisms for
Combining Diverse Data Sources
• Biclustering
– A. Tanay, R. Sharan, M. Kupiec, R. Shamir,
manuscript (2003)
• Probabilistic graphical models
• Kernel-based data fusion
– G. Lanckriet, M. Deng, N. Cristianini,
M.Jordan, W. Noble, Technical Report 645, UC
Berkeley Department of Statistics (2003)
Biclustering
Given a (0,1) matrix in which the rows
represent genes, the columns represent
properties of genes (function, expression,
association with diseases etc.) and a 1 in
the (g,p) entry indicates that gene g has
property p, find submatrices with an
unusually high density of 1s.
Probabilistic Graphical Model
(PGM)
• Graph-theoretic representation of the probabilistic
and deterministic relationships among a set of
variables. Vertices correspond to variables and
directed edges represent dependencies. Some
variables are observed and others are hidden.
• A PGM provides an algorithm for generating
samples from the joint distribution of its variables.
• Given the values of the observed variables, there
is an automatic (but not necessarily efficient)
procedure for inferring the most likely values of
the hidden variables.
Application: Finding Binding
Site Motifs
• Observed variables: the genomic sequences
within which the motifs occur
• Hidden variables: the locations of the motif
occurrences, the nucleotide distributions at sites
within and between motif occurrences, and metaparameters governing these nucleotide
distributions.
• Questions: How much data is required to train a
PGM, and how rapidly will the inference
algorithm converge?
Classification Using Diverse
Sources of Data
Example: Classifying proteins based on five
types of data:
(1) Their domain structures
(2) Protein-protein interactions
(3) Genetic interactions
(4) Co-participation in protein complexes
(5) Cell cycle gene expression measurements
Support Vector Machine (SVM)
• Input: a training set {p1,p2, …,pn} of proteins, a
class label (positive or negative) for each protein
and a n x n positive-definite matrix S = (sjk) giving
the similarities between all pairs of proteins.
• The SVM algorithm produces a decision rule that
achieves maximal separation between the positive
and negative examples in the training set and can
be used to classify additional proteins on the basis
of their similarities to proteins in the training set.
Extension to Diverse Data
Sources
• The t-th data source gives a positive-definite
matrix St = (sjkt) of similarities between the
proteins.
• Data fusion: Any positive linear combination of
these matrices gives a positive-definite similarity
matrix.
• The problem of choosing the linear combination
giving the largest margin of separation between
positive and negative examples can be solved by
semidefinite programming.
Computer Science Paradigms
from Biology and Genomics
• Living cells can adapt to environmental
changes, but large computer programs are
brittle.Does biology hold clues for software
engineering?
• Genomics algorithms are required to
perform well on real-life data, not on all
possible data.Should theoretical computer
science depart from worst-case analysis?
Computer Science Paradigms
from Biology and Genomics
• The Celera whole-genome shotgun
sequencing algorithm is an instance of a
general approach to combinatorial puzzle
solving in which constraints on the solution
are enforced in an order determined by the
strength of evidence for them. Should this
approach be studied within theoretical
computer science?
Download