BCB 444/544 Lecture 9 #9_Sept10 Finish:

advertisement
BCB 444/544
Lecture 9
Finish: Scoring Matrices & Alignment Statistics
BLAST vs FASTA (not yet!)
Smith-Waterman Algorithm
#9_Sept10
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
1
Required Reading
(before lecture)
Mon Sept 10 - for Lecture 9
BLAST variations; BLAST vs FASTA, SW
• Chp 4 - pp 51-62
Wed Sept 12 - for Lecture 10 & Lab 4
Multiple Sequence Alignment (MSA)
• Chp 5 - pp 63-74
Fri Sept 14 - for Lecture 11
Position Specific Scoring Matrices & Profiles
• Chp 6 - pp 75-78 (but not HMMs)
• Good Additional Resource re: Sequence Alignment?
• Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
2
Assignments & Announcements - #1
Revised Grading Policy has been posted online
(see Handout) - Please review!
Mon Sept 10 - Lab 3 Exercise due 5 PM:
to: terrible@iastate.edu
Thu Sept 13 - Graded Lab 3
will be returned at beginning of Lab 4
Fri Sept 14 - HW#2 due by 5 PM (106 MBB)
Study Guide for Exam 1 will be posted by 5 PM
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
3
Assignments & Announcements - #2
Mon Sept 17 - Answers to HW#2
will be posted on by 5 PM
Thu Sept 20 - Lab = Optional Review Session for Exam
Fri Sept 21 - Exam 1 - Will cover:
•
•
•
•
Lectures 2-12
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
4
Chp 3- Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
•
•
•
•
•
•
√Evolutionary Basis
√Sequence Homology versus Sequence Similarity
√Sequence Similarity versus Sequence Identity
√Methods - (Dot Plots, DP; Global vs Local Alignment)
√Scoring Matrices (PAM vs BLOSUM)
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
5
First, let's re-visit DP
for Local Alignment:
• Email explaining "confusion" in Lecture 8 on Friday was
sent on Sunday (so you wouldn't try to do HW2 without a
better explanation!)
• Answers to DP Examples given in Lectures are included
in Lecture PPTs for Lectures 8 (Friday) & 9 (Today):
• Global Alignment
• Local Alignment
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
6
What are the 2 Global Alignments
with Optimal Score = 33?
Top: C T C G C A G C
Left: C A T T C A C
1:
C
C
A
T
T
C
-
G
T
C
C
A
A
G
-
C
C
2:
C
C
A
T
T
C
T
G
-
C
C
A
A
G
-
C
C
Check the scores: +10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
7
Local Alignment: Motivation
• To "ignore" stretches of non-coding DNA:
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
• To locate protein domains or motifs:
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Non-coding - "not encoding protein"
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
8
Local Alignment: Example
G G T C T G A G
A A A C G A
Match: +2
Mismatch or space: -1
Best local alignment:
G G T C T G A G
A A A C – G A -
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
Score = 5
9/10/07
9
Local Alignment: Algorithm
This slide has
been changed!
1) Initialize top row & leftmost column of matrix with "0"
2) Fill in DP matrix:
In local alignment, no negative scores
Assign "0" to cells with negative scores
3) Optimal score? in highest scoring cell(s)
4) Optimal alignment(s)? Traceback from each cell
containing the optimal score, until a cell with "0" is
reached (not just from lower right corner)
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
10
Local Alignment DP:
Initialization & Recursion
S 0,0  0
New Slide
S(i,0)  0 S(0, j)  0

S i 1, j 1   x , y
  i j
 
S i, j   max S i 1, j   

S i, j 1  

0
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
11
Filling in DP Matrix for Local Alignment
No negative scores - fill in "0"
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2
0
1
0
0
1
0
0
0
0
1
0
2
0
0
0
1
0
1
0
2
0
1
1
λ
C
A
C
A
C
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
12
Traceback - for Local Alignment
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2 1
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2 4 0
1
0
0
1
0
0
0
0
1
0
2 2 0
0
0
1
0
1
0
2
0
1
λ
C
A
C
A
C
3
1
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
13
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
T
C
G
C
A
G
C
2:
C
T
C
G
C
A
G
C
3:
C
T
C
G
C
A
G
C
4:
C
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
14
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
-
T
-
C
-
G
-
C
C
A
A
G
T
C
T
2:
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
3:
C
T
T
T
C
C
G
A
C
C
A
G
C
4:
C
T
T
T
C
C
G
A
C
C
A
G
C
Check the scores: +1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
15
Some Results re: Alignment Algorithms
(for ComS, CprE & Math types)
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
for Biologists: Big O notation
• used when analyzing algorithms for efficiency
• refers to time or number of steps it takes to
solve a problem
• expressed as a function of size of the problem
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
16
"Scoring" or "Substitution" Matrices
2 Major types for Amino Acids: PAM & BLOSUM
• PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
• BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
17
PAM Matrix:
Point Accepted Mutation
I added 2 bullets
to this slide
Relies on "evolutionary model" based on observed
differences in closely related proteins [Dayhoff78]
• Model includes defined rate for each type of sequence change
• Suffix number (n) reflects amount of "time" passed:
rate of expected mutation if n% of amino acids had changed
• e.g., PAM1 matrix estimates what rate of substitution would be
expected if 1% of the amino acids had changed
• PAM1 matrix is used as basis for calculating other matrices:
assumes that repeated mutations would follow same pattern as
those in PAM1 matrix, and multiple substitutions can occur at the
same site
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
18
BLOSUM:
BLOck SUbstitution Matrix
I added 2 bullets
to this slide
Based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins (in BLOCKS database) [Henikoff & Henikoff92]
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
avg % aa identity in MSA from which matrix was generated
• e.g., BLOSUM62 is derived from sequence alignments of proteins
with no more than 62% identity
• Blocks database contains ungapped aligned segments
corresponding to the most highly conserved regions of proteins
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
19
Scoring Matrices:
What are the scores?
See Xiong Textbook:
Fig 3.5 = PAM250
Fig 3.6 = BLOSUM62
Usually only 1/2 of matrix is displayed
(it is symmetric)
s(a,b) corresponds to score of
aligning character a with
character b
These are log-odds scores:
each entry ~
log (freq(observed)/freq(expected)
+  more likely than random
0  at random base rate
-  less likely than random
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
20
Log-odds scoring
• What are the odds that this alignment is meaningful?
x1x2x3 xN
y1y2y3 yN
• If sequences are not related: we’re observing a chance
event,
& the probability is:  pX  pY
i
i
i
i
where px is the probability of x, py is probability of y
• If sequences are related by evolution: they are derived
from a common ancestor,
& the probability is:  p X Y
i
i
i
where pxy is the joint probability that x and y evolved from the
same ancestor
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
21
Log-odds scoring matrix
• Odds ratio = Relative likelihood of the 2 possibilities:
p
p p
X i Yi
i
Xi
i
Yi

i
pX i Yi
pX i pYi
i
• Alignment score = Log-odds ratio:

where
S   s(x i , y i )
i
 p

xi yi
s(x i , y i )  log 
p p 

 x i y i 
• Thus, s (xi, yi) gives the substitution matrix score for
the pair xi, yi.


• Together all the scores
s(xi, yi) define the log-odds
scoring matrix
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
22
How do we estimate s(x, y)?
• The score for matching x and y is:
 p 
xy
s(x, y)  log
p p 

 x y 
• Pxy is probability of substituting x and y
• Px is probability of amino acid x
(on average ~ 5% with 20 amino acids, similarly for Py)

 Trusted (manual) alignments of related sequences
provide information about biologically permissible
mutations
 Frequency of amino acid substitutions in trusted
alignments is used to generate substitution matrices
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
23
A Few Words about Parameter Selection
in Sequence Alignment
Optimal alignment between a pair of sequences depends critically
on the selection of substitution matrix &
gap penalty function
S i 1, j 1  xi , y j 

S i, j   max S i 1, j   
S i, j 1  

 
In using BLAST or similar software, it is important to understand and,
sometimes, to adjust these parameters (default is NOT always best!)
How do we pick parameters that give the most biologically
meaningful alignments and alignment scores?
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
24
Which is Better Substitution Matrix?
PAM or BLOSUM
• PAM matrices
• derived from evolutionary model
• often used in reconstructing phylogenetic trees - but, not
very good for highly divergent sequences
• BLOSUM matrices
• based on direct observations
• more "realistic" - and outperform PAM matrices in terms of
accuracy in local alignment
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
25
Empirical Tests May be Needed:
Several other types of matrices available:
• Gonnet & Jones-Taylor-Thornton:
• very robust in tree construction
• "Best" matrix depends on task:
• different matrices for different applications
ADVICE: if unsure, try several different matrices
& choose the one that gives best alignment result
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
26
How Should Gaps be Scored?
(k)
So far, we've used
Simple linear gap penalty function:
Gap of length
k
Incurs penalty
-kx
However, in biological sequences, gaps often occur in clusters:
AGKLAVRSTMIESTRVILTWRKW
AGKLAVRS------RVILTWRKW
w(k)
More realistic? "Affine" gap penalty:
penalty for one long gap
w(k) =  + (k – 1) x 
is smaller than penalty


for many smaller gaps
gap
gap
opening
extension
that add up to same size
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics


9/10/07
27
Affine Gap Penalty Functions
Affine Gap Penalties = Differential Gap Penalties used to reflect
cost differences between opening a gap and extending an existing
gap
Total Gap Penalty is function of gap length:
W =
where

+

X
(k - 1)
 = gap opening penalty
 = gap extension penalty
k = length of gap
Can also be solved in
O(nm) time using DP
Sometimes, a Constant Gap Penalty is used, but it is usually least
realistic than the Affine Gap Penalty
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
28
Calculating an Alignment Score using
a Substitution Matrix &
an Affine Gap Penalty
• Alignment score is sum of all match/mismatch
scores (from substitution matrix) with an affine
penalty subtracted for each gap
Match
score
a b c - - d
a c c e f d
9 2 7
6 => 24
Values from
substitution matrix
Gap opening
+ extension
-
Alignment
(10 + 2) = 12
Score
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
29
Sequence Alignment Statistics
• Distribution of similarity scores in sequence alignment
is not a simple "normal" distribution
• "Gumble extreme value distribution" - a highly skewed
normal distribution with a long tail
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
30
How Assess Statistical Significance
of an Alignment?
• Compare score of an alignment with distribution of scores
of alignments for many 'randomized' (shuffled) versions of
the original sequence
• If score is in extreme margin, then unlikely due to random
chance
• P-value = probability that original alignment is due to
random chance (lower P means alignment more significant)
P = 10-5 - 10-50
P > 10-1
sequences have clear homology
alignment is no better than random
Check out: PRSS (Probability of Random Shuffles)
http://www.ch.embnet.org/software/PRSS_form.html
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
31
Download