Techniques and Applications of Sequence Comparison.

advertisement
Techniques & Applications of
Sequence Comparison
Limsoon Wong
Institute for Infocomm Research
27 August 2003
Copyright 2003 limsoon wong
Lecture plan
• Basic sequence comparison methods
– Pairwise alignment
– Multiple alignment
• Applications
– Active sites
– Homologs
– Key mutation sites
• P-value
• More advanced sequence comparison
methods
Copyright 2003 limsoon wong
Basic Sequence Comparison Methods
A brief refresher
Copyright 2003 limsoon wong
Sequence Comparison:
Motivations
• DNA is blue print for living organisms
 Evolution is related to changes in DNA
 By comparing DNA sequences we can
infer evolutionary relationships between
the sequences w/o knowledge of the
evolutionary events themselves
• Foundation for inferring function, active
site, and key mutations
Copyright 2003 limsoon wong
Alignment
Copyright 2003 limsoon wong
Alignment:
An Example
indel
Sequence U
mismatch
Sequence V
match
Copyright 2003 limsoon wong
Alignment:
Simple-minded Probability & Score
• Define score S(A) by simple log likelihood as
S(A) = log(prob(A)) - [n log(s) + r log(s)], with log(p/s) = 1
• Then S(A) = #matches - #mismatches - #indels
Copyright 2003 limsoon wong
Global Pairwise Alignment:
Problem Definition
• Given sequences U and V of lengths n
and m, then number of possible
alignments is given by
– f(n, m) = f(n-1,m) + f(n-1,m-1) + f(n,m-1)
– f(n,n) ~ (1 + 2)2n+1 n-1/2
• The problem of finding a global pairwise
alignment is to find an alignment A so
that S(A) is max among exponential
number of possible alternatives
Copyright 2003 limsoon wong
Global Pairwise Alignment:
Dynamic Programming Solution
• Define an indel-similarity matrix s(.,.);
e.g.,
– s(x,x) = 1
– s(x,y) = -, if x  y
• Then
Copyright 2003 limsoon wong
Global Pairwise Alignment:
More realistic handling of indels
• In Nature, indels of several adjacent
letters are not the sum of single indels,
but the result of one event
• So reformulate as follows:
Copyright 2003 limsoon wong
Variations of Pairwise Alignment
• Fitting a “short’’ seq
to a “long’’ seq.
• Find “local” alignment
U
U
V
• Indels at beginning
and end are not
penalized
V
• find i, j, k, l, so that
– S(A) is maximized,
– A is alignment of ui…uj
and vk…vl
Copyright 2003 limsoon wong
Multiple Alignment
Copyright 2003 limsoon wong
Multiple Alignment:
Naïve Approach
• Let S(A) be the score of a multiple alignment A.
The optimal multiple alignment A of sequences
U1, …, Ur can be extracted from the following
dynamic programming computation of Sm1,…,mr:
• This requires O(2r) steps
• Exercise: Propose a practical approximation
Copyright 2003 limsoon wong
Applications of Sequence Comparison
Copyright 2003 limsoon wong
Emerging Patterns
• An emerging pattern is a pattern that
occurs significantly more frequently in
one class of data compared to other
classes of data
• A lot of biological sequence analysis
problems can be thought of as extracting
emerging patterns from sequence
comparison results
Copyright 2003 limsoon wong
A protein is a ...
• A protein is a large
complex molecule
made up of one or
more chains of
amino acids
• Protein performs a
wide variety of
activities in the cell
Copyright 2003 limsoon wong
Function Assignment to Protein Sequence
SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNR
YVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWE
QNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGD
VTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTG
TFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELE
VT
• How do we attempt to assign a function
to a new protein sequence?
Copyright 2003 limsoon wong
Function Assignment:
Guilty-by-Association
• Compare the target sequence T with
sequences S1, …, Sn of known function
in a database
• Determine which ones amongst S1, …,
Sn are the mostly likely homologs of T
• Then assign to T the same function as
these homologs
• Finally, confirm with suitable wet
experiments
Copyright 2003 limsoon wong
Guilty-by-Association:
Homologs obtained by BLAST
• Thus our example sequence could be a
protein tyrosine phosphatase (PTP)
Copyright 2003 limsoon wong
Guilty-by-Association:
Caveats
• Ensure that the effect of database size
has been accounted for
• Ensure that the function of the homology
is not derived via invalid “transitive
assignment’’
• Ensure that the target sequence has all
the key features associated with the
function, e.g., active site and/or domain
Copyright 2003 limsoon wong
Effect of database size:
Interpretation of P-value
• Seq. comparison
• Suppose the P-value of
progs, e.g. BLAST,
an alignment is 10-6
often associate a P• If database has 107
value to each hit
seqs, then you expect
• P-value is interpreted
107 * 10-6 = 10 seqs in
as prob. that a random
it that give an equally
seq. has an equally
good alignment
good alignment
 Need to correct for
database size if your
seq. comparison prog
does not do that!
Copyright 2003 limsoon wong
Examples of Invalid Function Assignment:
The IMP dehydrogenases (IMPDH)
A partial list of IMPdehydrogenase misnomers
in complete genomes remaining in some
public databases
Copyright 2003 limsoon wong
IMPDH:
Domain Structure
IMPDH Misnomer in Methanococcus jannaschii
IMPDH Misnomers in Archaeoglobus fulgidus
• Typical IMPDHs have 2 IMPDH domains that
form the catalytic core and 2 CBS domains.
• A less common but functional IMPDH (E70218)
lacks the CBS domains.
• Misnomers show similarity to the CBS domains
Copyright 2003 limsoon wong
IMPDH:
Invalid Transitive Assignment
Root of invalid transitive assignment
B
A
C
Mis-assignment
of function
No IMPDH domain
Copyright 2003 limsoon wong
IMPDH:
Emerging Pattern
Typical IMPDH
Functional IMPDH w/o CBS
IMPDH Misnomer in Methanococcus jannaschii
IMPDH Misnomers in Archaeoglobus fulgidus
• Most IMPDHs have 2 IMPDH and 2 CBS domains.
• Some IMPDH (E70218) lacks CBS domains.
 IMPDH domain is the emerging pattern
Copyright 2003 limsoon wong
Discover Active Site and/or Domain
• How to discover the active site and/or
domain of a function in the first place?
– Multiple alignment of homologous seqs
– Determine conserved positions
 Emerging patterns relative to background
 Candidate active sites and/or domains
• Easier if sequences of distance
homologs are used
Copyright 2003 limsoon wong
Discover Active Site:
Multiple Alignment of PTPs
• Notice the PTPs agree with each other on
some positions more than other positions
• These positions are more impt wrt PTPs
• Else they wouldn’t be conserved by evolution
 They are candidate active sites
Copyright 2003 limsoon wong
Identifying Key Mutation Sites
Sequence from a typical PTP domain D2
• Some PTPs have 2 PTP domains
• PTP domain D1 is has much more activity
than PTP domain D2
• Why? And how do you figure that out?
Copyright 2003 limsoon wong
Key Mutation Site:
Emerging Patterns of PTP D1 vs D2
•
•
•
•
•
Collect example PTP D1 sequences
Collect example PTP D2 sequences
Make multiple alignment A1 of PTP D1
Make multiple alignment A2 of PTP D2
Are there positions conserved in A1 that
are violated in A2?
• These are candidate mutations that
cause PTP activity to weaken
• Confirm by wet experiments
Copyright 2003 limsoon wong
Key Mutation Site:
PTP D1 vs D2
D2
D1
• Positions marked by “!” and “?” are likely
places responsible for reduced PTP activity
– All PTP D1 agree on them
– All PTP D2 disagree on them
Copyright 2003 limsoon wong
Key Mutation Site:
PTP D1 vs D2
D2
D1
• Positions marked by “!” are even more likely as
3D modeling predicts they induce large
distortion to structure
Copyright 2003 limsoon wong
Key Mutation Sites:
Confirmation by Mutagenesis Expt
• What wet experiments are needed to
confirm the prediction?
– Mutate E  D in D2 and see if there is gain
in PTP activity
– Mutate D  E in D1 and see if there is loss
in PTP activity
Copyright 2003 limsoon wong
Understanding P-value
Copyright 2003 limsoon wong
What is P-value?
• What does E-value mean?
– Statistical notion of P-value
– Prob that a random seq gives an equally
good alignment
• How do we calculate it?
Copyright 2003 limsoon wong
Hypothesis Testing
• Null hypothesis H0
– A claim (about a
probability
distribution) that we
are interested in
rejecting or refuting
• Alternative
hypothesis H1
– The contrasting
hypothesis that must
be true if H0 is
rejected
• Type I error
– H0 is wrongly rejected
• Type II error
– H1 is wrongly rejected
• Level of significance
– Probability of getting
a type I error
rejection
region
rejection
region
Acceptance
region
Copyright 2003 limsoon wong
Description Level, aka P-value
• Instead of fixing the
• The description level
significance level at a
of a test H0 is the
value a, we may be
smallest level of
interested in
significance a at
computing the
which the observed
probability of getting
test result would be
a result as extreme
declared significant-as, or more extreme
-that is, would be
than, the observed
declared indicative of
result under H0
rejection of H0
Copyright 2003 limsoon wong
P-value:
Key Questions
• Recall S(A) scores an alignment A
• Let H(U,V) = max{ S(A) | A is an
alignment of U and V}
• Suppose the letters in U and V are iid.
Can we calculate
h = E(H(U,V))?
• Furthermore, can we calculate
P(H(U,V) > h + c)?
Copyright 2003 limsoon wong
Alignment:
Statistical Understanding
• Ignoring indels for now, we can think of a
good alignment as one that has a long
contiguous stretch of matches
• The matches are essentially a long run of
“heads’’ in a series of coin tosses
• So we think in terms of “a headrun of
length t begins at position i ”
Copyright 2003 limsoon wong
E(H(U,V)):
Erdos-Renyi Thm for Exact Match
Copyright 2003 limsoon wong
E(H(U,V)):
Arratia-Waterman Thm for Local Alignment
Copyright 2003 limsoon wong
Alignment:
Statistical Understanding
• Recall we think in terms of “a headrun of
length t begins at position i ”
• Caution:
– headruns occur in “clumps”
– if there is a headrun of length t at position i,
then with high prob there is also a headrun of
length t at position i+1
• So, we count only 1st headrun in a clump;
i.e., a headrun preceded by a tail
Copyright 2003 limsoon wong
Arratia-Gordon Thm on Large
Deviations for binomials
Copyright 2003 limsoon wong
E(H(U,V)):
Exact Match, Accounting for Clumps
Copyright 2003 limsoon wong
E(H(UV)):
Local Alignment, Accounting for Clumps
Copyright 2003 limsoon wong
P(H(U,V) > E(H(U,V))):
Approximate P-Value
Our E(Yn) from previous
slides are in the right form :-)
Copyright 2003 limsoon wong
More Advanced
Sequence Comparison Methods
• PHI-BLAST
• Iterated BLAST
Copyright 2003 limsoon wong
PHI-BLAST:
Pattern-Hit Initiated BLAST
• Input
– protein sequence and
– pattern of interest that
it contains
• Output
– protein sequences
containing the pattern
and have good
alignment
surrounding the
pattern
• Impact
– able to detect
statistically significant
similarity between
homologous proteins
that are not
recognizably related
using traditional onepass methods
Copyright 2003 limsoon wong
PHI-BLAST:
How it works
find sequences with
good flanking
alignment
find from database
all seq containing
given pattern
Copyright 2003 limsoon wong
PHI-BLAST:
IMPACT
Copyright 2003 limsoon wong
ISS:
Intermediate Sequence Search
• Two homologous seqs, which have
diverged beyond the point where their
homology can be recognized by a simple
direct comparison, can be related
through a third sequence that is suitably
intermediate between the two
• High score betw A & C, and betw B & C,
imply A & B are related even though their
own match score is low
Copyright 2003 limsoon wong
ISS:
Search Procedure
Input
seq A
BLAST against db
(p-value @ 0.081)
Results
H1, H2, ...
Matched seqs
M1, M2, ...
BLAST against db
(p-value @ 0.0006)
Keep regions in M1,
M2, … that A.
Discard rest of M1,
M2, ...
Matched regions
R1, R2, ...
Copyright 2003 limsoon wong
ISS:
IMPACT
No obvious match between
Amicyanin and Ascorbate Oxidase
Copyright 2003 limsoon wong
ISS:
IMPACT
Convincing homology
via Plastocyanin
Previously only
this part was
matched
Copyright 2003 limsoon wong
PSI-BLAST:
Position-Specific Iterated BLAST
• given a query seq,
• matrix is used to search
initial set of homologs
db for new homologs
is collected from db
• new homologs with
using GAP-BLAST
good score are used to
• weighted multiple
construct new positionalignment is made from
specific score matrix
query seq and
• iterate the search until
homologs scoring
no new homologs
better than threshold
found, or until specified
• position-specific score
limit is reached
matrix is constructed
from this alignment
Copyright 2003 limsoon wong
SAM-T98 HMM Method
• similar to PSI-BLAST
• but use HMM instead of position-specific
score matrix
Copyright 2003 limsoon wong
Comparisons
Iterated seq.
comparisons vs
pairwise seq.
comparison
Copyright 2003 limsoon wong
Suggested Readings
Copyright 2003 limsoon wong
Function Assignment
• S.E.Brenner. “Errors in genome annotation”, TIG,
15:132--133, 1999
• T.F.Smith & X.Zhang. “The challenges of genome
sequence annotation or `The devil is in the details’”,
Nature Biotech, 15:1222--1223, 1997
• D. Devos & A.Valencia. “Intrinsic errors in genome
annotation”, TIG, 17:429--431, 2001.
• K.L.Lim et al. “Interconversion of kinetic identities of the
tandem catalytic domains of receptor-like protein
tyrosine phosphatase PTP-alpha by two point
mutations is synergist and substrate dependent”, JBC,
273:28986--28993, 1998.
Copyright 2003 limsoon wong
Alignment Applications
• J. Park et al. “Sequence comparisons using multiple
sequences detect three times as many remote homologs as
pairwise methods”, JMB, 284(4):1201-1210, 1998
• J. Park et al. “Intermediate sequences increase the
detection of homology between sequences”, JMB, 273:349-354, 1997
• Z. Zhang et al. “Protein sequence similarity searches using
patterns as seeds”, NAR, 26(17):3986--3990, 1996
• M.S.Gelfand et al. “Gene recognition via spliced sequence
alignment”, PNAS, 93:9061--9066, 1996
• S.F.Altschul et al. “Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs”, NAR,
25(17):3389--3402, 1997.
Copyright 2003 limsoon wong
Alignment Statistics
• P.Erdos & A. Renyi. “On a new law of large numbers”,
J. Anal. Math., 22:103--111, 1970
• R. Arratia & M. S. Waterman. “Critical phenomena in
sequence matching”, Ann. Prob., 13:1236--1249, 1985
• R. Arratia, P. Morris, & M. S. Waterman. “Stochastic
scrabble: Large deviations for sequences with scores”,
J. Appl. Prob., 25:106--119, 1988
• R. Arratia, L. Gordon. “Tutorial on large deviations for
the binomial distribution”, Bull. Math. Biol., 51:125-131, 1989
Copyright 2003 limsoon wong
lecture is (27/8/2003) wednesday at 6.30pm at LT4.
If you have time we could meet a bit earlier to have a chat .
What about 5.30pm in my office at N4-2c-79?
Copyright 2003 limsoon wong
Download