CS5263 Bioinformatics Probabilistic modeling approaches for motif finding

advertisement
CS5263 Bioinformatics
Probabilistic modeling approaches
for motif finding
Motif representation
• Collection of exact words
– {ACGTTAC, ACGCTAC, AGGTGAC, …}
• Consensus sequence (with wild cards)
– {AcGTgTtAC}
– {ASGTKTKAC} S=C/G, K=G/T (IUPAC code)
• Position specific weight matrices (PWMs)
Sequence Logo
1
2
3
4
5
6
7
8
9
A
C
.97
.10
.02
.03
.10
.01
.05
.85
.03
.01
.40
.01
.04
.05
.01
.05
.05
.03
G
T
I
.01
.40
.95
.03
.40
.01
.3
.05
.03
.01
.10
.02
.90
.45
.97
.6
.05
.91
1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Finding Motifs
Classification of approaches
• Combinatorial search
– Based on enumeration of words and
computing word similarities
– Analogy to DP for sequence alignment
• Probabilistic modeling
– Construct models to distinguish motifs vs nonmotifs
– Analogy to HMM for sequence alignment
Combinatorial motif finding
Given a set of sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and a word in xi
d(W, S) = i d(W, xi)
W* = argmin( d(W, S) )
Exhaustive searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T
(4K possibilities)
Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )
(where N = i |xi|)
Guaranteed to find the optimal solution.
Exhaustive searches
2. Sample-driven algorithm:
For W = a K-long word in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )
OR Report a local improvement of W*
Running time: O( K N2 )
WEEDER: algorithm sketch
Current pattern P, |P| < K
# mismatches
(e, B)
Seq occ
A
C
G
T
T
• A list containing all eligible
nodes: with at most α
mismatches to P
• For each node, remember
#mismatches accumulated (e),
and bit vector (B) for seq occ,
e.g. [011100010]
• Bit OR all B’s to get seq
occurrence for P
• Suppose #occ >= m
– Pattern still valid
• Now add a letter
WEEDER: algorithm sketch
Current pattern P
(e, B)
A
C
G
T
T
A
• Simple extension: no branches.
– No change to B
– e may increase by 1 or no
change
– Drop node if e > α
• Branches: replace a node with
its child nodes
– Drop if e > α
– B may change
• Re-do Bit OR using all B’s
• Try a different char if #occ < m
• Report P when |P| = K
Probabilistic modeling approaches
for motif finding
Probabilistic modeling approaches
• A motif model
– Usually a PWM
– M = (Pij), i = 1..4, j = 1..k, k: motif length
• A background model
– Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences)
– B = (bi), i = 1..4
• A word can be generated by M or B
Expectation-Maximization
• For any word W,
 P(W | M) = PW[1] 1 PW[2] 2…PW[K] K
 P(W | B) = bW[1] bW[2] …bW[K]
• Let  = P(M), i.e., the probability for any word to
be generated by M.
• Then P(B) = 1 - 
• Can compute the posterior probability P(M|W)
and P(B|W)
 P(M|W) ~ P(W|M) * 
 P(B|W) ~ P(W|B) * (1-)
Expectation-Maximization
Initialize:
Randomly assign each word to M or B
• Let Zxy = 1 if position y in sequence x is a motif, and 0
otherwise
• Estimate parameters M, , B
Iterate until converge:
• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y
• M-step: re-estimate M,  given Z (B usually fixed)
Expectation-Maximization
position
5
1
Initialize
E-step
probability
1
5
9
9
M-step
• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y
• M-step: re-estimate M,  given Z
MEME
•
•
•
•
•
Multiple EM for Motif Elicitation
Bailey and Elkan, UCSD
http://meme.sdsc.edu/
Multiple starting points
Multiple modes: ZOOPS, OOPS, TCM
Gibbs Sampling
• Another very useful technique for
estimating missing parameters
• EM is deterministic
– Often trapped by local optima
• Gibbs sampling: stochastic behavior to
avoid local optima
Gibbs sampling
Initialize:
Randomly assign each word to M or B
• Let Zxy = 1 if position y in sequence x is a motif, and 0
otherwise
• Estimate parameters M, B, 
Iterate:
•
•
•
•
•
Randomly remove a sequence X* from S
Recalculate model parameters using S \ X*
Compute Zx*y for X*
Sample a y* from Zx*y.
Let Zx*y = 1 for y = y* and 0 otherwise
Gibbs Sampling
probability
position
0.2
probability
0.15
0.1
0.05
0
0
2
4
6
8
10
position
12
14
16
18
Sampling
• Gibbs sampling: sample one position according to probability
•
•
– Update prediction of one training sequence at a time
Viterbi: always take the highest
Simultaneously update
EM: take weighted average
predictions of all sequences
20
Better background model
• Repeat DNA can be confused as motif
– Especially low-complexity CACACA… AAAAA, etc.
• Solution: more elaborate background model
– Higher-order Markov model
0th order: B = { pA, pC, pG, pT }
1st order: B = { P(A|A), P(A|C), …, P(T|T) }
…
Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
Gibbs sampling motif finders
• Gibbs Sampler
– First appeared as: Larence et.al. Science 262(5131):208-214.
– Continually developed and updated. webpage
– The newest version: Thompson et. al. Nucleic Acids Res. 35 (s2):W232W237
• AlignACE
– Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14.
– Allow don’t care positions
– Additional tools to scan motifs on new seqs, and to compare and group
motifs
• BioProspector, X. Liu et. al. PSB 2001 , an improvement of
AlignACE
– Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38.
– Allow two-block motifs
– Consider higher-order markov models
Limits of Motif Finders
0
???
gene
• Given upstream regions of coregulated genes:
– Increasing length makes motif finding harder –
random motifs clutter the true ones
– Decreasing length makes motif finding harder – true
motif missing in some sequences
Challenging problem
d mutations
n = 20
k
L = 600
• (k, d)-motif challenge problem
• Many algorithms fail at (15, 4)-motif for n = 20 and L = 600
• Combinatorial algorithms usually work better on challenge problem
– However, they are usually designed to find (k, d)-motifs
– Performance in real data varies
(15, 4)-motif
• Information content: 11.7 bits
• ~ 6mers. Expected occurrence 1 per 3k bp
Actual
Results
by MEME
llr = 163
E-value = 3.2e+005
llr = 177
E-value = 1.5e+006
llr = 88
E-value = 2.5e+005
Motif finding in practice
• Now we’ve found some good looking
motifs
– This is probably the easiest step
• What to do next?
– Are they real?
– How do we find more instances in the rest of
the genome?
– What are their functional meaning?
• Motifs => regulatory networks
How to make sense of the motifs?
• Each program usually reports a number of motifs
(tens to hundreds)
– Many motifs are variations of each other
– Each program also report some different ones
• Each program has its own way of scoring motifs
–
–
–
–
Best scored motifs often not interesting
AAAAAAAA
ACACACAC
TATATATAT
How to make sense of the motifs?
• Combine results from different algorithms
usually helpful
– Ones that appeared multiple times are probably more
interesting
• Except simple repeats like AAAAA or ATATATATA
– Cluster motifs into groups.
• Compare with known motifs in database
– TRANSFAC
– JASPAR
– YPD (yeast promoter database)
Strategies to improve results
• How to tell real motifs (functional) from
noises? Statistical test of significance.
– Enrichment in target sequences vs
background sequences
Target set
T
Assumed to contain a
common motif, P
Background set
B
Assumed to not contain P,
or with very low frequency
Ideal case: every sequence in T has P, no sequence in B has P
Statistical test for significance
P
Target set
T
Background set + target set
B+T
N
P appeared in
n sequences
P appeared in
m sequences
• If n / N >> m / M
– P is enriched (over-represented) in T
– Statistical significance?
• If we randomly draw N sequences from (B+T), how
likely we will see at least n sequences having P?
M
Hypergeometric distribution
• A box with M balls (seqs), of which m
are red (with motifs), and the rest are
blue (without motifs).
– Red ball: sequences with motifs
– Blue ball: sequences without motifs
• We randomly draw N balls (seqs) from
the box
• What’s the probability we’ll see n red
balls?
 m  M  m 
 

n N n 
hypegeom(n; M , N , m)   
M 
 
N
# of choices to have n red balls
Total # of choices to draw N balls
Cumulative hypergeometric test for
motif significance
• We are interested in: if we
randomly pick m balls, how likely
that we’ll see at least n red balls?
cHypegeom (n; M , N , m) 
min( m , N )
 hypogeom(i; M , N , m)
i n
n 1
 1   hypogeom(i; M , N , m)
Null hypothesis: our selection is random.
Alternative hypothesis: our selection favored red balls.
When prob is small, we reject the null hypothesis.
Equivalent: we accept the alternative hypothesis
(The number of red balls is larger than expected).
i 0
 m  M  m 
 

n 1 
i
N i 
 1    
M 
i 0
 
N
Example
•
•
•
•
•
•
•
Yeast genome has 6000 genes
Select 50 genes believed to be co-regulated by a common TF
Found a motif from the promoter seqs of these 50 genes
The motif appears in 20 of these 50 genes
In the rest of the genome, 100 genes have this motif
M = 6000, N = 50, m = 100+20 = 120, n = 20
Intuitively:
– m/M = 120/6000=1/50. (1 out 50 genes has the motif)
– N = 50, would expect only 1 gene in the target set to have the motif
– 20-fold enrichment
• P-value = cHyperGeom(20; 6000, 50, 120) = 6 x 10-22
• This motif is significantly enriched in the set of genes
ROC curve for motif significance
• Motif is usually a PWM
• Any word will have a score
–
–
–
–
Typical scoring function: Log (P(W | M) / P(W | B))
W: a word.
M: a PWM.
B: background model
• To determine whether motif M occurred in a sequence, a
cutoff has to be decided
–
–
–
–
Different cutoffs give different # of occurrences
Stringent cutoff: low occurrence in both + and - sequences
Loose cutoff: high occurrence in both + and - sequences
It may be better to look at a range of cutoffs
ROC curve for motif significance
P
Target set
T
N
Given a score cutoff
Appeared in
n sequences
•
•
•
•
•
•
Background set + target set
B+T
Appeared in m
sequences
With different score cutoff, will have different m and n
Assume you want to use P to classify T and B
Sensitivity: n / N
Specificity: (M-N-m+n) / (M-N)
False Positive Rate = 1 – specificity: (m – n) / (M-N)
With decreasing cutoff, sensitivity , FPR 
M
ROC curve for motif significance
A good cutoff
Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.
sensitivity
1
ROC-AUC: area under curve.
1: the best. 0.5: random.
Motif 1
Motif 2
Random
0
0
1-specificity
Motif 1 is more enriched in motif 2.
1
Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.
Other strategies
• Cross-validation
– Randomly divide sequences into 10 sets, hold 1 set
for test.
– Do motif finding on 9 sets. Does the motif also appear
in the testing set?
• Phylogenetic conservation information
– Does a motif also appears in the homologous genes
of another species?
– Strongest evidence
– However, will not be able to find species-specific ones
Other strategies
• Finding motif modules
– Will two motifs always appear in the same gene?
• Location preference
– Some motifs appear to be in certain location
• E.g., within 50-150bp upstream to transcription start
– If a detect motif has strong positional bias, may be a sign of its
function
• Evidence from other types of data sources
– Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?
– Interact with the same set of proteins?
– Similar functions?
– etc.
To search for new instances
• Usually many false positives
• Score cutoff is critical
• Can estimate a score cutoff from the “true”
binding sites
Motif finding
Scoring function
Log (P(W | M) / P(W | B))
A set of scores for the “true” sites. Take mean - std as a cutoff.
(or a cutoff such that the majority of “true” sites can be predicted).
To search for new instances
• Use other information, such as positional biases
of motifs to restrict the regions that a motif may
appear
• Use gene expression data to help: the genes
having the true motif should have similar
activities
– Risk of circular reasoning: most likely this is how you
get the initial sequences to do motif finding
• Phylogenetic conservation is the key
References
• D’haeseleer P (2006) What are DNA sequence motifs?
NATURE BIOTECHNOLOGY, 24 (4):423-425
• D’haeseleer P (2006) How does DNA sequence motif
discovery work? NATURE BIOTECHNOLOGY, 24
(8):959-961
• MacIsaac KD, Fraenkel E (2006) Practical strategies for
discovering regulatory DNA sequence motifs. PLoS
Comput Biol 2(4): e36
• Lawrence CE et. al. (1993) Detecting Subtle Sequence
Signals: A Gibbs Sampling Strategy for Multiple
Alignment, Science, 262(5131):208-214
Download