Lecture 13 #13_Sept19 BCB 444/544 Star Alignment & Clustal (for MSA)

advertisement
BCB 444/544
Lecture 13
Star Alignment & Clustal (for MSA)
Perhaps: Profiles &
Hidden Markov Models (HMMs)
#13_Sept19
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
1
Required Reading
(before lecture)
√Mon Sept 17 - Lecture 12
Position Specific Scoring Matrices & PSI-BLAST
• Chp 6 - pp 75-78 (but not HMMs)
Wed Sept 19 - Lecture 13
(not covered on Exam 1)
Profiles & Hidden Markov Models
• Chp 6 - pp 79-84
• Eddy: What is a hidden Markov Model?
2004 Nature Biotechnol 22:1315
http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html
Fri Sept 21 - EXAM 1
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
2
Assignments & Announcements
√Sun Sept 16 - Study Guide for Exam 1 was posted
√Mon Sept 17 - Answers to HW#2 were posted
Thu Sept 20 - Lab = Optional Review Session for Exam
Fri Sept 21 - Exam 1 - Will cover:
•
•
•
•
Lectures 2-12 (thru Mon Sept 17)
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming?
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
3
Chp 5- Multiple Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 5
Multiple Sequence Alignment
• √Scoring Function
• √Exhaustive Algorithms
• Heuristic Algorithms
• Star Alignment
• Clustal
• √Practical Issues
• First, review MSA scoring briefly, then back to
Star Alignment & ClustalW
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
4
Scoring an Alignment - in Lecture 12,
so will be covered on Exam 1
In practice, simple scoring functions are used
Usually, columns are scored independently:
S(m)   S mi   G
Gap penalty
i
ith column of alignment m
A
F
P
G
Q
I
K
F
F
F
I
D
D
D
F
F
F
I
Y
Y
Y
G
G
Q
G
Q
G
K
A
F
P
G
Q
I
K
F
F
F
I
I
-
F
F
F
I
D
D
D
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
W
W
W
W
W
W
W
A
F
P
G
Q
I
K
F
F
I
Y
Y
Y
I
D
D
D
G
G
G
G
G
G
G
9/19/07
5
Sum of Pairs (SP) Score
• SP = sum of pairs = sum of scores of all
possible pairs of sequences in an MSA,
based on a particular scoring matrix
• Compute for each column c:
F
F
I
-
S(mi) = k<l s(mik, mil)
mi
residue l
PAM or BLOSUM score
A
F
P
G
F
F
F
I
F
F
F
I
G
G
Q
G
A
F
P
G
F
F
I
-
F
F
F
I
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
W
W
W
W
A
F
P
G
F
F
Y
D
D
G
G
G
G
9/19/07
6
Example: Calculating SP Score
F
Y
G
D
F
Y
5
-2 -2 -1
7
G
m1 m2 m3
D
1
-5
4
-3
I added
more colors
to this slide
M=
F
F
F
Y
G
G
D
G
G
D
5
BLOSUM 60
Gap penalty = -8
s(-,-) = 0
S(m) = S(m1) + S(m2) + S(m3)
= 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D)
= 15 -16 + 0 + 4 -6 = -3
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
7
Algorithms & Software for MSA? #1
Exhaustive Methods
• √ Multidimensional dynamic programming (DP)
• Divide-and-Conquer Alignment (DCA) - "semi-exhaustive"
web-based version available - see textbook
• Full DP Optimal Global Alignment? Prohibitive in both time
& space requirements for more than 10 sequences!!
Heuristic Methods
• Progressive (Star Alignment, Clustal)
• Iterative
• Block-based
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
8
Dynamic Programming for MSA
• As with pairwise alignments, MSAs can be computed by
dynamic programming*
*(if you're not in a rush!)
F
2D
3D
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
9
Generalized Needleman-Wunsch Algorithm
Given 3 sequences x, y, and z:
Main iteration loop:
S(i,j,k) =
max ( S(i-1, j-1, k-1) + (xi, yj, zk),
S(i-1, j-1, k ) + (xi, yj, - ),
S(i-1, j , k-1) + (xi, -, zk),
S(i-1, j , k ) + (xi, -, - ),
S(i , j-1, k-1) + ( -, yj, zk),
S(i , j-1, k ) + ( -, yj, -),
S(i , j , k-1) + ( -, -, zk) )
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
3D
9/19/07
10
What Happens to Computational Complexity?
Given k sequences of length n
• Space for matrix: O(nk)
• Neighbors/cell: 2k-1
• Time to compute SP score: O(k2)
• Overall runtime: O(k22knk)
3D
 Wow!!!
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
11
What's so bad about those exponents?
Example: Running Time of DP for MSA
• Overall runtime: O(k22knk)
# Sequences
Running Rime
2
1 second
3
2 minutes
4
5 hours
5
3 weeks
6
9 years
Sequences? Globins only »150 aa !!
But: There are fast heuristics
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
12
Progressive Alignment
Heuristic procedure:
1. Align most similar sequences first
2. Add sequences progressively
Multiple Alignment by
adding sequences
Often: use guide tree to determine
order of alignments
1
2
3
4
2 Examples:
Star Alignment
ClustalW
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
13
Guide Trees
Binary tree
• Leaves correspond to sequences
• Internal nodes represent alignments
• Root corresponds to final MSA
ATC
ATG
ATC
-TCG
-TCC
ATCATGTCG
TCC
ATG
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
TCG TCC
9/19/07
14
Star Alignment - skipped on Monday:
will NOT be covered on Exam 1
Back to 2 Examples of
Progressive Alignment Heuristics for MSA:
1. STAR Alignment
2. Clustal
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
15
Star Alignment
•
•
Fast heuristic to compute MSA
Good approximation of optimal MSA, if scoring
scheme satisfies triangle inequality
Algorithm:
1. Compute pairwise similarities
2. Select center sc that maximizes Σic S(sc,si)
3. Add sequences in decreasing order of similarity to center sc
4. Produce a multiple alignment M such that, for every i,
the induced pairwise alignment of sc and si is same as
the optimal alignment of sc and si
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
16
Step 2 - Select center sc that maximizes
Σic S(sc,si)
Does that function look familiar?
Recall:
Consensus sequence = single sequence
(more accurately; "model") that
represents most common residue of
each column in MSA
FGGHL-GF
F-GHLPGF
FGGHP-FG
FGGHL-GF
Steiner consensus sequence or string: Given sequences
s1,…, sk, find a sequence s* that maximizes Σi S(s*,si)
"String" equivalent of arithmetic mean: consensus sequence is string
that minimizes sum of edit distances to members of a family of
strings (thus, maximizing similarity score…)
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
17
Step 3 - Add sequences in decreasing order
of similarity to center sc
s1
MPE
MSKE
| |
| ||
MKE
M-KE
s1 :
s2:
s3 :
s4 :
s3
MPE
MKE
MSKE
SKE
s2
MKE
||
SKE
s4
MSKE
M-KE
M-PE
MSKE
M-KE
S-KE
M-PE
MSKE
M-KE
S2+S3
+S1
+S4
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
18
Step 4 - Produce a multiple alignment M
such that for every i:
the induced pairwise alignment of sc and si
is same as optimal alignment of sc and si
Sc
AA--CCTT
Sc
A-ACC-TT
S1
AATGCC--
S2
AGACCGT-
S1
A-ATGCC---
Sc
A-A--CC-TT
S2
AGA--CCGT-
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
19
Complexity of Star Alignment?
Given k sequences of length n, and an upper bound l
for alignment length
We need:
• O(k2n2) to compute the alignments
• O(k2) to compute the center
• O(k2l) to build multiple alignment
Overall: O(k2n2)
Duh - Is this really much better than O(k22knk)?
YES!
Remember:
k = # of sequences
n = length of sequences
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
20
CLUSTAL: Overview
Guide Tree
1
2
3
4
5
1
1
2
Distance Matrix
2
3
3
4
5
Progressive
Alignment
2
3
4
4
1+2
1+3
1+4
2+3
2+4
3+4
Pairwise Alignments
1
1. Compute pairwise alignments (DP)
2. Convert similarities into distances
Distance between a pair = # of mismatched
positions in alignment (divided by total # of
matches)
3. Build guide tree from distances by
Neighbor Joining
4. Align with respect to guide tree
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
21
CLUSTAL: Example
1
2
3
4
5
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
22
One "small" problem?
Finding the Guide Tree
Goal: Given k sequences and their pairwise
distances, find a tree, such that all distances
correspond to path lengths between leaves
Problem: Such a tree might not exist!
Guide Tree
1
2
3
4
5
1
1
2
Distance Matrix
2
3
4
5
3
4
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
23
CLUSTAL W Tree
Tree calculated from an alignment of >1100 ring finger domains,
using ClustalW 1.83
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
24
Algorithms & Software for MSA? #2
√ Exhaustive Methods
• Multidimensional dynamic programming (DP)
• Divide-and-Conquer Alignment (DCA) - "semi-exhaustive"
web-based version available - see textbook
• Full DP Optimal Global Alignment? Prohibitive in both time
& space requirements for more than 10 sequences!!
Heuristic Methods
• √Progressive (Star Alignment, Clustal)
• Iterative
• Block-based
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
25
Algorithms & Software for MSA? #3
will NOT be covered on Exam1
Heuristic Methods - continued
• Progressive alignments (Star Alignment, Clustal)
• Others: T-Coffee, DbClustal -see text: can be better than Clustal
• Match closely-related sequences first using a guide tree
• Partial order alignments (POA)
• Doesn't rely on guide tree; adds sequences in order given
• PRALINE
• Preprocesses input sequences by building profiles for each
• Iterative methods
• Idea: optimal solution can be found by repeatedly modifying existing
suboptimal solutions (eg: PRRN)
• Block-based Alignment
• Multiple re-building attempts to find best alignment
(eg: DIALIGN2 & Match-Box)
• Local alignments
• Profiles, Blocks, Patterns - more on these soon!
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
26
Chp 6 - Profiles & Hidden Markov Models
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 6
Profiles & HMMs
• √Position Specific Scoring Matrices (PSSMs)
• √PSI-BLAST
First, review above briefly, then:
• Profiles
• Markov Models & Hidden Markov Models
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
27
PSI-BLAST (Covered in Lecture 12, so
will be covered on Exam1)
• Position Specific Iterated BLAST
• Intuition: substitution matrices should be
"sensitive" to protein context
• e.g., larger penalty for Ala→Gly substitution if
in a helix rather than in a loop
• Basic idea:
• Use BLAST with high stringency to generate a set of
closely related sequences
• Align those sequences to create a new substitution
matrix for each position
• Use this matrix (iteratively) to find additional sequences
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
28
PSI-BLAST Pseudocode
Position-Specific
Scoring Matrix
This step requires
a user-defined
threshold
Convert query to PSSM (or a Profile)
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
Note: Xiong textbook distinguishes between PSSMs
(which have no gaps) & Profiles (can include gaps).
Thus, based on these definitions, PSI-BLAST uses a
Profile to iteratively add new homologs - other authors
refer to pattern used by PSI-BLAST as a PSSM.
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
29
I added
more text to
this slide
What is a PSSM?
Position-Specific Scoring Matrix
A PSSM is:
• a representation of a motif
• an n by m matrix, where n is
size of alphabet & m is length of
sequence
• a matrix of scores in which
entry at (i, j) is score assigned
by PSSM to letter i at the jth
position
Xiong: PSSM = table that contains
probability information re: residues at
each position of an ungapped MSA
Also, sometimes called:
Position Weight Matrix (PWM)
8 residue sequence
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
I
-3
-3
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
“K”
at0position
3 8
-2
-1
-2
gets
a-3score
of
-4
0
-4 2 -3
Note: Assumes positions are independent
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
30
Assigning a "Match" Score with a PSSM
PSSM assigns sequence
NMFWAFGH
a score of:
0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 =
12
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
31
Creating a PSSM from 1 Sequence
R
L
RNRGQFGH
R
BLOSUM62
matrix
20 by 20
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
20 by L
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
32
Creating a PSSM from Multiple Sequences
1. Discard columns that contain gaps in query sequence
2. Compute relative sequence weights
3. Compute PSSM entries, taking into account
• Observed residues in column
• Sequence weights
• Substitution matrix
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
33
1- Discard Columns with Gaps in Query
EEFG----SVDGLVNNA
QKYG----RLDVMINNA
RRLG----TLNVLVNNA
GGIG----PVD-LVNNA
KALG----GFNVIVNNA
ARFG----KID-LIPNA
FEPEGPEKGMWGLVNNA
AQLK----TVDVLINGA
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVD-LVNNA
KALGGFNVIVNNA
ARFGKID-LIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
34
2- Compute Sequence Weights
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVDLLVNNA
KALGGFNVIVNNA
ARFGKIDTLIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
1.2
1.2
0.8
0.8
1.1
0.9
1.1
1.3
Info re: weights
was added to
this slide
• Smaller weights are assigned
to redundant sequences
• Larger weights are assigned
to unique sequences
How are weights determined?
Based on branch lengths in guide tree: value for each sequence is
then used to multiply raw alignment scores
Goal of weighting? to decrease matching scores of frequent
characters in MSA & increase scores of infrequent characters
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
35
3- Compute PSSM Entries
This slide was modified
(simplified version)
Observed
residues
E
Q
R
G
K
A
F
A
/
Background
frequencies
Usually derived from
large sequence database
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
0.085
0.019
0.054
0.065
0.040
0.072
0.023
0.058
0.056
0.096
0.024
0.053
0.042
0.054
0.072
0.063
0.073
0.016
0.034
=
PSSM
column
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
PSSM
9/19/07
36
This slide was modified
PSSM Entries = Log-Odds Scores
Observed frequency
of residue “A”
1. Estimate probability of observing
each residue (probability of A given
M, where M is PSSM model)
2. Divide by background probability of
observing each residue (probability
of A given B, where B is background
model)
3. Take log so that can add (rather than
multiply) scores
Foreground model
(i.e., the PSSM)
 Pr  A M  

log 2 



Pr
A
B


BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
Background model
9/19/07
37
Why (not) PSI-BLAST?
• Psi-BLAST weights sequences according to observed
diversity specific to family under investigation
• Advantage: If sequences used to construct PSSMs are
all homologous, sensitivity for a given level of
specificity improves significantly
• Disadvantage: However, if any non-homologous
sequences are included in PSSMs, they become
“corrupted” and "pull in" additional non-homologous
sequences, resulting in false positive hits
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
38
How to Use PSI-BLAST Effectively
• Set initial thresholds high
• Inspect each iteration's result for suspicious
sequences (When in doubt, leave it out!)
• Do several iterations (~5), or until no new sequences
are found
• Make initial search very broad
• First, use NR (large, inclusive database) with up to 5 iterations
to set PSSM
• Then use that PSSM to search in a more restricted domain, if
possible
• Be particularly cautious about matches to sequences
with highly biased amino acid content
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
39
Summary: DP, BLAST & PSI-BLAST
• Dynamic programming is O(NM) for pairwise alignment
• BLAST is O(M)
• BLAST produces an index of words in query sequence
that allows fast matching to the database
• At NCBI, target databases are also pre-indexed to
indicate positions in all database sequences that
match each possible search word above some score
threshold
• PSI-BLAST iterates BLAST, adding new homologs at
each iteration
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
40
Applications of MSA
• Building phylogenetic trees
• Finding conserved patterns:
• Regulatory motifs (TF binding sites)
• Splice sites
• Protein domains
• Identifying and characterizing protein families
• Find out which protein domains have same function
• Finding SNPs (single nucleotide polymorphisms) &
mRNA isoforms (alternatively spliced forms)
• DNA fragment assembly (in genomic sequencing)
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
41
Application: Discover Conserved Patterns
Is there a conserved cis-acting regulatory sequence?
Rationale: if sequences are homologous (derived from a common ancestor),
they may be structurally/functionally equivalent
TATA box = transcriptional
promoter element
Sequence Logo
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
42
Sequence Motifs (Patterns)
Other types of representations?
• √ Consensus Sequence
• √ PSSM - Position-Specific Scoring Matrix
• √ Sequence Logo - "enhanced"consensus sequence,
in which symbol size  information entropy
• Information entropy??? In information theory, the Shannon
entropy or information entropy is a measure of the [decrease in]
uncertainty associated with a random variable. Entropy quantifies
information in a piece of data.
- Wikipedia
• Check out this fun website: Tom Scheider, NCIF
• http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo
• Profile
• HMM - Hidden Markov Model
BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs
9/19/07
43
Download