Gibbs sampling for motif finding in biological sequences

advertisement
Gibbs sampling for motif
finding in biological sequences
Christopher Sheldahl
Biological Sequences
• Proteins : “strings” of amino acids.
20 common types
|alphabet| = 20
• Nucleic acids : “strings” of nucleotides
4 common types for RNA; 4 for DNA
|alphabet| = 4
Motif Finding
• Multiple sequences : How to detect related
segments (or “motifs”)?
1: …ATACGTTGAGA…
2: …CGACGTTGCAA…
3: …CCACGCTGGAC…
Sequence Databases are Large
• Genbank : NA : 108 million sequences
• SwissProt : proteins : ~500,000
sequences
Growth of SwissProt
http://www.expasy.org/sprot/relnotes/relstat.html
The basic problem
• We have N biological sequences.
• Assume that there is one evolutionarily
related segment (substring) of known
width W in each sequence.
• How do we find this segment in each
sequence? Remember that there might
be mutations.
Protein case : We’d like to
know the following:
• Motif description (matrix): Probability of each
of the 20 aa at each position in the motif of
width w:
qi,R : Probability of R type aa at position i
of motif. Matrix is W x 20.
• Background probability (vector) of each of the
20 aa in sites not part of motif: p1….p20
• Ak : alignment vector of starting indices
for the motif in each of the N
sequences.
Scoring a given string
• If we knew the motif description matrix and
the background vector we could score any
string of width W :
W
W
P   pRi
i1
i1
Q   qi,Ri
Q
L
P

L : score of segment wrt the motif.
Ri : the aa type at position i of the segment.

Scoring an alignment of N
sequences
We can score an entire alignment :
N
AlignScore   Lx
x1

Qx
Lx 
Px
What if we just had one new
sequence?
• What if we knew the motif description
matrix and the background vector for a
large number of sequences, and we
want to quickly align a new (presumably
related) sequence?
What if we just had one new
sequence?
• Score possible locations of the motif in the
new sequence and pick one with a high (or
max) value of L:
Q
L
P
• Note that the better our motif description and
background model are, the better the new
alignment will be.

Motif Finding Dilemma
• If we knew the alignment, we could calculate
the motif description matrix and the
background vector.
• If we knew the motif description and the
background, we could calculate the alignment
(by choosing high L positions).
• We don’t know either of these - we’ve seen
one way to approach this problem before...
Expectation Maximization or
The Road Not Taken in this Lecture
Start with Random motif description and background.
Repeat until motif description converges:
1) E - step : Calculate alignment points (Ak) from Motif Description
and Background.
2) M - step : Calculate Motif Description and background from
alignments.
Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization
of Common Sites in Unaligned Biolpolymer Sequences”. Proteins.
Local minima
• After initialization, simple EM algorithm
is a deterministic process.
• If the initialization is bad, can get
trapped in local minima.
• Heuristic : Try many different starting
points.
Random Sampling
• Note the heuristic of trying many
random initial positions. Maybe we can
overcome local minima if we employ
randomness more thoroughly.
• Can we take random sample alignments
from the distribution :
p(Align | S, W, Motif Matrix, Background)
where Align means A1, A2…AN.
Markov Chain Monte Carlo
• Markov Chain Monte Carlo methods
generate a Markov chain of points that
converges to a distribution of interest.
• “Monte Carlo” : The methods employ
randomness.
Metropolis-Hastings
• Metropolis-Hastings is an MCMC model that can sample from
any distribution P, using a proposal distribution
Q(x’; x).
• Initialize with random x.
• Generate new x’ =
Proposal position according to
Q(x’; x)
• Compute α = min( (P(x’) / P(x) ), 1)
and accept change with probability α.
Figure : Wikipedia
Gibbs Sampling
• Gibbs sampling is a variety of MetropolisHastings sampling where the sampling step is
always accepted.
• For multivariate distributions, in Gibbs
sampling only one parameter is changed at a
time.
• This makes Gibbs sampling particularly
useful for multivariate distributions.
Motif Finding with Gibbs :
Site Sampler
• Site sampler : Sample starting points for motif
in each sequence.
• Start with random alignments, then use
random sampling for one sequence at a time
to gradually improve the alignments.
Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for
Multiple Alignment”, Science. 1993. 262(5131), 208-214
Gibbs sampling: Changing
One Thing at a Time
• N=3
• We want to sample alignment points A1 for
sequence 1, given the pattern description M,
background model B, the sequences S and
width W, and alignments A2 (for sequence 2)
and A3 (for sequence 3).
p( A1 | S, M, B,W, A2, A3)
Site Sampler Algorithm
Initialize with random alignment points.
While not converged:
Do steps 1 and 2 for each of N sequences :
1) Predictive update step :
Calculate motif matrix and background using all
sequences but the currently selected one.
2) Sampling step. Calculate Lx = Qx / Px for all starting
points x in this sequence. Choose one with a
probability proportional to Lx.
• The inner loop is an iteration of sequences.
• In step 2 you sample a new value for Ak i.e. the alignment position for sequence k
• Convergence : In theory you sample until alignment no longer changes
Example: Random Alignments
Rouchka, 1997,
“A Brief Overview
of Gibbs Sampling”
DNA:
Alphabet =
{A,C,G,T}
Example:
Initial Counts
New Counts
Remove 1st sequence (ATTTAT)
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
Rouchka, 1997.
Sequence 1:
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
Rouchka, 1997.
Final Alignment
Assumptions for Site Sampler
• Assumption of one motif.
• Assumption of one copy of motif in each
sequence.
• Assumption that we know the motif
width.
Motif Sampler
• Sample different motifs for each part of the
sequence. Store motif descriptions and
background vector for each motif.
• Allows multiple motifs, and varying number of
each motif to be present.
• Still subject to assumption of known motif
width.
Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein
repeats”. 1995. Protein Science. 4:1618-1632.
Null model
• Null model has pattern description
probabilities identical to the background
probabilities for its constituent amino
acids.
Widths and Copy Numbers
• The motif sampler requires widths and expected
numbers of each motif per sequence as input
parameters. The widths are fixed during the
algorithm, but the number of copies can change.
• Construction of subfamilies (more specific models)
favored by larger widths and smaller expected
number of copies.
• Construction of superfamilies (less specific models)
favored by smaller widths and larger expected
number of copies.
Motif Sampler Algorithm
Initialize with Ei non-overlapping random segments for each motif i in each of the
sequences.
While not converged:
Do steps 1 and 2 for successive segments in the biological sequences :
1) Predictive update step :
If this segment is in an alignment remove it and recalculate
motif matrixand background data excluding this segment.
2) Sampling step : Sample one of the motifs with a probability
proportional to the score that the motif description assigns this
segment.
• In the inner loop the motif sampler iterates through segments (substrings)
• Motif sampler samples among different motif models for each segment.
Modifications to the Samplers
• There are methods to allow for gapped
motifs.
• High scoring alignments can be stored
for later investigation.
Neuwald et al., 1995
Porin structures
Porin alignment
Neuwald et al., 1995
Motif descriptions
Motivation for Gibbs sampling
• We have a joint probability density
f(x, y1, y2,…yp)
• We want a marginal density, f(x).
• The integrations are difficult.
Casella and George, “Explaining the Gibbs Sampler”
Near-optimum Sampler
• A particular motif may not be present in
the best alignment found, even though
its found in many other near-optimal
alignments.
• Sample among near-optimal alignments
to capture all sites present.
Column sampler
• The column sampler allows for a motif
to shift:
GCACCTG --> GCACCTG
• It also allows for the development of
motifs with gaps:
GCACCTG --> GCACCTG
Download