#9 Scoring Statistics 9/10/07 BCB 444/544 Lecture 9

advertisement
#9 Scoring Statistics
9/10/07
Required Reading
BCB 444/544
(before lecture)
Mon Sept 10 - for Lecture 9
BLAST variations; BLAST vs FASTA, SW
• Chp 4 - pp 51-62
Lecture 9
Finish: Scoring Matrices & Alignment Statistics
Wed Sept 12 - for Lecture 10 & Lab 4
Multiple Sequence Alignment (MSA)
• Chp 5 - pp 63-74
BLAST vs FASTA (not yet!)
Smith-Waterman Algorithm
Fri Sept 14 - for Lecture 11
Position Specific Scoring Matrices & Profiles
• Chp 6 - pp 75-78 (but not HMMs)
#9_Sept10
• Good Additional Resource re: Sequence Alignment?
• Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
1
Assignments & Announcements - #1
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
Assignments & Announcements - #2
Mon Sept 17 - Answers to HW#2
will be posted on by 5 PM
Revised Grading Policy has been posted online
(see Handout) - Please review!
Mon Sept 10 - Lab 3 Exercise due 5 PM:
to: terrible@iastate.edu
Thu Sept 20 - Lab = Optional Review Session for Exam
Thu Sept 13 - Graded Lab 3
will be returned at beginning of Lab 4
Fri Sept 21 - Exam 1 - Will cover:
•
•
•
•
Fri Sept 14 - HW#2 due by 5 PM (106 MBB)
Study Guide for Exam 1 will be posted by 5 PM
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
3
SEQUENCE ALIGNMENT
Pairwise Sequence Alignment
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
BCB 444/544 Fall 07 Dobbs
9/10/07
4
• Answers to DP Examples given in Lectures are included
in Lecture PPTs for Lectures 8 (Friday) & 9 (Today):
• Global Alignment
• Local Alignment
√ Evolutionary Basis
√ Sequence Homology versus Sequence Similarity
√ Sequence Similarity versus Sequence Identity
√ Methods - (Dot Plots, DP; Global vs Local Alignment)
√ Scoring Matrices (PAM vs BLOSUM)
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
• Email explaining "confusion" in Lecture 8 on Friday was
sent on Sunday (so you wouldn't try to do HW2 without a
better explanation!)
Xiong: Chp 3
•
•
•
•
•
•
Lectures 2-12
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
First, let's re-visit DP
for Local Alignment:
Chp 3- Sequence Alignment
SECTION II
2
9/10/07
5
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
6
1
#9 Scoring Statistics
9/10/07
What are the 2 Global Alignments
with Optimal Score = 33?
Local Alignment: Motivation
• To "ignore" stretches of non-coding DNA:
Top: C T C G C A G C
Left: C A T T C A C
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
1:
C
C
A
T
T
C
-
G
T
C
C
A
A
G
-
C
C
2:
C
C
A
T
T
C
T
G
-
C
C
A
A
G
-
C
C
• To locate protein domains or motifs:
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Non-coding - "not encoding protein"
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
Check the scores: +10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
7
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
Local Alignment: Example
1) Initialize top row & leftmost column of matrix with "0"
2) Fill in DP matrix:
In local alignment, no negative scores
Assign "0" to cells with negative scores
Mismatch or space: -1
3) Optimal score? in highest scoring cell(s)
Best local alignment:
G G T C T G A G
A A A C – G A -
4) Optimal alignment(s)? Traceback from each cell
containing the optimal score, until a cell with "0" is
reached (not just from lower right corner)
Score = 5
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
Local Alignment DP:
Initialization & Recursion
S (0,0) = 0
8
This slide has
been changed!
Local Alignment: Algorithm
G G T C T G A G
A A A C G A
Match: +2
9/10/07
9/10/07
9
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
10
Filling in DP Matrix for Local Alignment
No negative scores - fill in "0"
New Slide
S(i,0) = 0 S(0, j) = 0
%
'S i "1, j "1 + # x , y
) ( i j)
' (
S (i, j ) = max&S (i "1, j ) " $
!
'S (i, j "1) " $
'
(0
λ
C
T
C
G
C
A
G
C
λ
0
0
0
0
0
0
0
0
0
C
0
1
0
1
0
1
0
0
1
A
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2
0
1
0
0
1
A
0
0
0
0
1
0
2
0
0
C
0
1
0
1
0
2
0
1
1
C
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
11
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
12
!
BCB 444/544 Fall 07 Dobbs
2
#9 Scoring Statistics
9/10/07
Traceback - for Local Alignment
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2 1
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2 4 0
1
0
0
1
0
0
0
0
1
0
2 2 0
0
0
1
0
1
0
2
0
1
λ
C
A
C
A
C
3
1
What are the 4 Local Alignments with
Optimal Score = 2?
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
13
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
-
T
-
C
-
G
-
C
C
A
A
G
T
C
T
2:
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
3:
C
T
T
T
C
C
G
A
C
C
A
G
C
4:
C
T
T
T
C
C
G
A
C
C
A
G
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
T
C
G
C
A
G
C
2:
C
T
C
G
C
A
G
C
3:
C
T
C
G
C
A
G
C
4:
C
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
14
Some Results re: Alignment Algorithms
(for ComS, CprE & Math types)
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
for Biologists:
9/10/07
Big O notation
• used when analyzing algorithms for efficiency
• refers to time or number of steps it takes to
solve a problem
• expressed as a function of size of the problem
Check the scores: +1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
C
C
15
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
PAM Matrix:
Point Accepted Mutation
"Scoring" or "Substitution" Matrices
9/10/07
16
I added 2 bullets
to this slide
Relies on "evolutionary model" based on observed
differences in closely related proteins [Dayhoff78]
2 Major types for Amino Acids: PAM & BLOSUM
• PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
• BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Model includes defined rate for each type of sequence change
• Suffix number (n) reflects amount of "time" passed:
rate of expected mutation if n% of amino acids had changed
• e.g., PAM1 matrix estimates what rate of substitution would be
expected if 1% of the amino acids had changed
• PAM1 matrix is used as basis for calculating other matrices:
assumes that repeated mutations would follow same pattern as
those in PAM1 matrix, and multiple substitutions can occur at the
same site
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
BCB 444/544 Fall 07 Dobbs
9/10/07
17
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
18
3
#9 Scoring Statistics
9/10/07
BLOSUM:
BLOck SUbstitution Matrix
I added 2 bullets
to this slide
Scoring Matrices:
What are the scores?
See Xiong Textbook:
Fig 3.5 = PAM250
Fig 3.6 = BLOSUM62
Based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins (in BLOCKS database) [Henikoff & Henikoff92]
Usually only 1/2 of matrix is displayed
(it is symmetric)
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
avg % aa identity in MSA from which matrix was generated
• e.g., BLOSUM62 is derived from sequence alignments of proteins
with no more than 62% identity
• Blocks database contains ungapped aligned segments
corresponding to the most highly conserved regions of proteins
s(a,b) corresponds to score of
aligning character a with
character b
These are log-odds scores:
each entry ~
log (freq(observed)/freq(expected)
+ → more likely than random
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
0 → at random base rate
- → less likely than random
9/10/07
19
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
Log-odds scoring
• Odds ratio = Relative likelihood of the 2 possibilities:
"p
"p "p
X i Yi
i
• If sequences are not related: we’re observing a chance
event,
& the probability is: ! pX ! pY
i
Xi
i
Yi
!
where
where pxy is the joint probability that x and y evolved from the
same ancestor
9/10/07
pX i Yi
pX i pYi
" p
%
s(x i , y i ) = log$$ x i y i ''
p
p
# xi y i &
• Thus, s ( xi, yi) gives the substitution matrix score for
the pair xi, yi.
!
!
• Together all the scores
s(xi , yi) define the log-odds
scoring matrix
i
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
S = " s(x i , y i )
i
• If sequences are related by evolution: they are derived
from a common ancestor,
& the probability is: " p X Y
i
i
i
i
i
="
• Alignment score = Log-odds ratio:
where px is the probability of x, py is probability of y
i
20
Log-odds scoring matrix
• What are the odds that this alignment is meaningful?
x1 x2x3 … xN
y1 y2y3 … yN
i
9/10/07
21
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
22
!
A Few Words about Parameter Selection
in Sequence Alignment
How do we estimate s(x, y)?
• The score for matching x and y is:
Optimal alignment between a pair of sequences depends critically
" p %
s(x, y) = log$$ xy ''
# px py &
on the selection of substitution matrix &
gap penalty function
%S (i "1, j "1) + # ( xi , y j )
'
S (i, j ) = max&S (i "1, j ) " $
'S i, j "1 " $
)
( (
• Pxy is probability of substituting x and y
• Px is probability of amino acid x
(on average ~ 5% with 20 amino acids, similarly for Py)
!
 Trusted (manual) alignments of related sequences
provide information about biologically permissible
mutations
 Frequency of amino acid substitutions in trusted
alignments is used to generate substitution matrices
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
BCB 444/544 Fall 07 Dobbs
9/10/07
In using BLAST or similar software, it is important to understand and,
sometimes, to adjust these parameters (default is NOT always best!)
!
23
How do we pick parameters that give the most biologically
meaningful alignments and alignment scores?
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
24
4
#9 Scoring Statistics
9/10/07
Which is Better Substitution Matrix?
PAM or BLOSUM
Empirical Tests May be Needed:
Several other types of matrices available:
• PAM matrices
• Gonnet & Jones-Taylor-Thornton:
• derived from evolutionary model
• often used in reconstructing phylogenetic trees - but, not
very good for highly divergent sequences
• very robust in tree construction
• BLOSUM matrices
• "Best" matrix depends on task:
• based on direct observations
• more "realistic" - and outperform PAM matrices in terms of
accuracy in local alignment
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
• different matrices for different applications
ADVICE: if unsure, try several different matrices
& choose the one that gives best alignment result
25
How Should Gaps be Scored?
So far, we've used
Simple linear gap penalty function:
Gap of length
k
Incurs penalty - k x γ
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
Affine Gap Penalties = Differential Gap Penalties used to reflect
cost differences between opening a gap and extending an existing
gap
γ(k)
Total Gap Penalty is function of gap length:
W =
w(k)
More realistic? "Affine" gap penalty:
penalty for one long gap
w (k) = γ + (k – 1) x δ
is smaller than penalty
⇑
⇑
for many smaller gaps
gap
gap
opening
extension
that add up to same size
where
δ
γ㻃
γ
+
δ
X
(k - 1)
γ = gap opening penalty
δ = gap extension penalty
k = length of gap
Can also be solved in
O(nm) time using DP
Sometimes, a Constant Gap Penalty is used, but it is usually least
realistic than the Affine Gap Penalty
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
27
Calculating an Alignment Score using
a Substitution Matrix &
an Affine Gap Penalty
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
28
Sequence Alignment Statistics
• Distribution of similarity scores in sequence alignment
is not a simple "normal" distribution
• Alignment score is sum of all match/mismatch
scores (from substitution matrix) with an affine
penalty subtracted for each gap
Match
a b c - - d
score
a c c e f d
9 2 7
6 => 24
-
26
Affine Gap Penalty Functions
However, in biological sequences, gaps often occur in clusters:
AGKLAVRSTMIESTRVILTWRKW
AGKLAVRS------RVILTWRKW
9/10/07
• "Gumble extreme value distribution" - a highly skewed
normal distribution with a long tail
Gap opening
+ extension
(10 + 2) = 12
Values from
substitution matrix
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
BCB 444/544 Fall 07 Dobbs
Alignment
Score
9/10/07
29
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
9/10/07
30
5
#9 Scoring Statistics
9/10/07
How Assess Statistical Significance
of an Alignment?
• Compare score of an alignment with distribution of scores
of alignments for many 'randomized' (shuffled) versions of
the original sequence
• If score is in extreme margin, then unlikely due to random
chance
• P-value = probability that original alignment is due to
random chance (lower P means alignment more significant)
P = 10-5 - 10-50
P > 10-1
sequences have clear homology
alignment is no better than random
Check out: PRSS (Probability of Random Shuffles)
http://www.ch.embnet.org/software/PRSS_form.html
BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics
BCB 444/544 Fall 07 Dobbs
9/10/07
31
6
Download