BCB 444/544 PSSMs & Psi-BLAST Lecture 12 #12_Sept17

advertisement
BCB 444/544
Lecture 12
Multiple Sequence Alignment (MSA)
PSSMs & Psi-BLAST
#12_Sept17
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
1
Required Reading
(before lecture)
√Mon Sept 17 - Lecture 12
Position Specific Scoring Matrices & PSI-BLAST
• Chp 6 - pp 75-78 (but not HMMs)
Wed Sept 19 - Lecture 13
(not covered on Exam 1)
Hidden Markov Models
• Chp 6 - pp 79-84
• Eddy: What is a hidden Markov Model?
2004 Nature Biotechnol 22:1315
http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html
Wed Sept 21 - EXAM 1
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
2
Assignments & Announcements
Sun Sept 16 - Study Guide for Exam 1 was posted
Mon Sept 17 - Answers to HW#2
will be posted ~ Noon
Thu Sept 20 - Lab = Optional Review Session for Exam
Fri Sept 21 - Exam 1 - Will cover:
•
•
•
•
Lectures 2-12 (thru Mon Sept 17)
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming~
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
3
Chp 5- Multiple Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 5
Multiple Sequence Alignment
•
•
•
•
Scoring Function
Exhaustive Algorithms
Heuristic Algorithms
Practical Issues
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
4
Multiple Sequence Alignments
Credits for slides: Caragea & Brown, 2007;
Fernandez-Baca, Heber &HunterBCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
5
Overview
1. What is a multiple sequence alignment (MSA)?
2. Where/why do we need MSA?
3. What is a good MSA?
4. Algorithms to compute a MSA
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
6
Multiple Sequence Alignment
• Generalize pairwise alignment of sequences to
include > 2 homologous sequences
• Analyzing more than 2 sequences gives us much more
information:
• Which amino acids are required? Correlated?
• Evolutionary/phylogenetic relationships
• Similar to PSI-BLAST idea (not yet covered in lecture):
use a set of homologous sequences to provide
more "sensitivity"
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
7
Definition: MSA
Given a set of sequences, a multiple sequence
alignment is an assignment of gap characters, such
that
• resulting sequences have same length
• no column contains only gaps
ATT-GC
ATTTGC
ATTTG
AT-TGC
ATTTGC
ATTTG-
AT-T-GC
ATTT-GC
ATTT-G-
NO
YES
NO
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
8
Displaying MSAs: using CLUSTAL W
RED:
AVFPMILW (small)
BLUE:
DE (acidic, negative chg)
MAGENTA: RHK (basic, positive chg)
GREEN:
STYHCNGQ (hydroxyl + amine + basic)
*
:
.
entirely conserved column
all residues have ~ same size AND hydropathy
all residues have ~ same size OR hydropathy
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
9
What is a Consensus Sequence?
A single sequence that represents most common
residue of each column in a MSA
Example:
FGGHL-GF
F-GHLPGF
FGGHP-FG
FGGHL-GF
Steiner consensus seqence: Given sequences s1,…, sk,
find a sequence s* that maximizes Σi S(s*,si)
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
10
Applications of MSA
• Building phylogenetic trees
• Finding conserved patterns, e.g.:
• Regulatory motifs (TF binding sites)
• Splice sites
• Protein domains
• Identifying and characterizing protein families
• Find out which protein domains have same function
• Finding SNPs (single nucleotide polymorphisms) &
mRNA isoforms (alternatively spliced forms)
• DNA fragment assembly (in genomic sequencing)
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
11
Application: Recover Phylogenetic Tree
What was series of events that led to current species?
NYLS
NFLS
NYLS
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
12
Application: Discover Conserved Patterns
Is there a conserved cis-acting regulatory sequence?
Rationale: if they are homologous (derived from a common ancestor),
they may be structurally equivalent
TATA box = transcriptional
promoter element
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
13
Goal: Characterize Protein Families
Which parts of globin sequences are most highly conserved?
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
14
Databases of Multiple Alignments
• Pfam (Protein Domain Families data base)
• Contains alignments and HMMs of protein families
• InterPro
• Integrates: Prosite, Prints, ProDom, Pfam, and SMART
• BLOCKS
• Segments of highly conserved multiple alignments
• Hovergen (Homologous Vertebrate Genes Database)
• COGs (Clusters of Orthologous Groups)
• BaliBASE (Benchmark alignments database)
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
15
Scoring an Alignment
Goal: Align homologous positions.
But: Without knowledge of phylogenetic tree is this
very hard (sometimes impossible) to achieve!
NYLS
NFLS
NYLS
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
16
Scoring an Alignment
In practice, simple scoring functions are used:
usually, columns are scored independently, i.e.
S(m)   S mi   G
gap penalty
i
ith column of alignment m
A
F
P
G
Q
I
K
F
F
F
I
D
D
D
F
F
F
I
Y
Y
Y
G
G
Q
G
Q
G
K
A
F
P
G
Q
I
K
F
F
F
I
I
-
F
F
F
I
D
D
D
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
W
W
W
W
W
W
W
A
F
P
G
Q
I
K
F
F
I
Y
Y
Y
I
D
D
D
G
G
G
G
G
G
G
9/17/07
17
Sum of Pairs (SP) Score
• SP = sum of scores of all possible pairs of
sequences in an MSA based on a particular
scoring matrix
• Compute for each column c
F
F
I
-
S(mi) = k<l s(mik,mil)
residue l
PAM or BLOSUM score
A
F
P
G
F
F
F
I
mi
F
F
F
I
G
G
Q
G
A
F
P
G
F
F
I
-
F
F
F
I
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
W
W
W
W
A
F
P
G
F
F
Y
D
D
G
G
G
G
9/17/07
18
How Score Gaps in MSAs?
Want to align gaps with each other over all sequences.
A gap in a pairwise alignment that “matches” a gap in
another pairwise alignment should cost less than
introducing a totally new gap.
• Possible that a new gap could be made to “match”
an older one by adjusting older pairwise alignment
• Change gap penalty near conserved domains of
various kinds (e.g. secondary structure elements,
hydrophobic regions)
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
19
Example: SP Score
F
Y
G
F
Y
5
-2 -2 -1
7
G
D
1
-5
4
-3
D
F-G
F-G
m= FYD
G
G
D
5
BLOSUM 60
Gap penalty: -8
s(-,-) = 0
S(m) = S(m1) + S(m2) + S(m3)
= 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D)
= 15 -16 + 0 + 4 -6 = -3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
20
Overcoming problems with SP scoring
• Use weights to incorporate evolution in sum of pairs
scoring:
• Some pairwise alignments are more important than
others
• e.g., more important to have a good alignment
between mouse & human sequences than between
mouse & bird
• Assign different weights to different pairwise
alignments
• Weight decreases with evolutionary distance
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
21
How Compute a Multiple Alignment?
Algorithms for MSA:
• Multidimensional dynamic programming
• Optimal global alignment (time & space intensive!!!)
• Progressive alignments (Star alignment, ClustalW)
• Match closely-related sequences first using a guide tree
• Iterative methods
• Combined local alignments (Dialign)
• Multiple re-building attempts to find best alignment
• Partial order alignment (POA)
• Local alignments
• Profiles, Blocks, Patterns
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
22
Dynamic Programming for MSA
• As with pairwise alignments, multiple sequence
alignments can be computed by dynamic programming
F
2D
3D
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
23
Generalized Needleman-Wunsch Algorithm
Given 3 sequences x, y, and z:
Main iteration loop:
F(i,j,k) = max ( F(i-1, j-1, k-1) + S(xi, yj, zk),
F(i-1, j-1, k ) + S(xi, yj, - ),
F(i-1, j , k-1) + S(xi, -, zk),
F(i-1, j , k ) + S(xi, -, - ),
F(i , j-1, k-1) + S( -, yj, zk),
F(i , j-1, k ) + S( -, yj, -),
F(i , j , k-1) + S( -, -, zk) )
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
3D
9/17/07
24
What Happens to Computational
Complexity?
Given k sequences of length n:
• Space for matrix: O(nk)
• Neighbors/cell: 2k-1
• Time to compute SP score: O(k2)
• Overall runtime: O(k22knk)
3D
 Ouch!!!
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
25
What's so bad about those exponents?
An example: Running Time of DP
• Overall runtime: O(k22knk)
# sequences
running time
2
1 second
3
2 minutes
4
5 hours
5
3 weeks
6
9 years
Sequences: globins ( 150 aa)
But: There are fast heuristics.
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
26
Progressive Alignment
Heuristic procedure:
1. Align most similar sequences first
2. Add sequences progressively
Multiple Alignment by
adding sequences
1
2
Often: use guide tree to determine
order of alignments
Examples:
3
4
Star alignment
ClustalW
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
27
Guide Tree
Binary tree
• Leaves correspond to sequences
• Internal nodes represent alignments
• Root corresponds to final MSA
ATC
ATG
ATC
-TCG
-TCC
ATCATGTCG
TCC
ATG
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
TCG TCC
9/17/07
28
Star Alignment - will skip for now,
come back to this on Wed
Star alignment will NOT be covered on
Exam 1
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
29
Chp6 - Profiles & Hidden Markov Models
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 6
Profiles & HMMs
• Position Specific Scoring Matrices (PSSMs)
• PSI-BLAST
• Profiles
• Markov Model & Hidden Markov Model
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
30
PSI Blast
• Position Specific Iterated BLAST
• Intuition: substitution matrices should be specific
to a particular site: penalize alanine→glycine more
in a helix
• Basic idea:
• Use BLAST with high stringency to get a set of closely
related sequences
• Align those sequences to create a new substitution
matrix for each position
• Then use that matrix (iteratively) to find additional
sequences
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
31
Psi-BLAST
Query
PSSM
Multiple
alignment
Sequence
database
BLAST
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
32
PSI-BLAST pseudocode
Convert query to PSSM
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
33
PSI-BLAST pseudocode
Position-specific
scoring matrix
Convert query to PSSM
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
34
PSI-BLAST pseudocode
Convert query to PSSM
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
This step requires a
user-defined
threshold
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
35
Position-specific scoring matrix - PSSM
• A PSSM is an n by m matrix,
where n is the size of
alphabet, and m is length of
sequence
• Entry at (i, j) is score assigned
by PSSM to letter i at the jth
position
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
36
Position-specific scoring matrix
• A PSSM is an n by m matrix,
where n is the size of the
alphabet, and m is the
length of the sequence.
• The entry at (i, j) is the
score assigned by the PSSM
to letter i at the jth
position.
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
L
-2
-3
-2
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
“K”
at-3position
-4
0
-43
-4
0
gets
a-2score
of-4 2
9/17/07
-3
-3
37
Position-specific scoring matrix
This PSSM assigns sequence
NMFWAFGH a score of:
0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
38
Position-specific scoring matrix
• What score does this PSSM
assign to KRPGHFLA?
2 + 0 + -2 + 6 + 0 + 6 + -4 + -2
=6
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
39
Position-specific iterated BLAST
?
Query
PSSM
Multiple
alignment
Sequence
database
BLAST
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
40
Creating a PSSM from 1 sequence
R
L
RNRGQFGH
R
BLOSUM62
matrix
20 by 20
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
20
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
by L
9/17/07
41
Position-specific iterated BLAST
?
Query
PSSM
Multiple
alignment
Sequence
database
BLAST
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
42
Creating a PSSM from multiple sequences
• Discard columns that contain gaps in query
• For each column C
• Compute relative sequence weights
• Compute PSSM entries, taking into account
• Observed residues in this column
• Sequence weights
• Substitution matrix
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
43
Discard query gap columns
EEFG----SVDGLVNNA
QKYG----RLDVMINNA
RRLG----TLNVLVNNA
GGIG----PVD-LVNNA
KALG----GFNVIVNNA
ARFG----KID-LIPNA
FEPEGPEKGMWGLVNNA
AQLK----TVDVLINGA
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVD-LVNNA
KALGGFNVIVNNA
ARFGKID-LIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
44
Compute sequence weights
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVDLLVNNA
KALGGFNVIVNNA
ARFGKIDTLIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
1.2
1.2
0.8
0.8
1.1
0.9
1.1
1.3
• Low weights are assigned to
redundant sequences
• High weights are assigned to
unique sequences
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
45
Compute PSSM entries
(simplified version)
A
C
D
E
F
Background G
Observed E
frequencies H
residues Q
R
I
G
K
K
L
A
M
F
P
A
Q
R
S
These are usually
T
derived from a large
V
sequence database
W
Y
+
0.085
0.019
0.054
0.065
0.040
0.072
0.023
0.058
0.056
0.096
0.024
0.053
0.042
0.054
0.072
0.063
0.073
0.016
0.034
=
PSSM
column
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
PSSM
9/17/07
46
Log-odds score
1.
Estimate the probability of observing
each residue
2. Divide by the background probability
of observing the same residue
3. Take log so scores will be additive
 Pr  A M  

log 2 



Pr
A
B


BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
47
Log-odds score
Residue was
generated by
foreground model
(i.e., the PSSM)
Residue “A” is
observed
1.
Estimate the probability of observing
each residue
2. Divide by the background probability
 Pr
of observing the same residue
log 2 
3. Take log so scores will be additive
 Pr
A M  
A B  
Residue was generated
by the background
model (i.e., randomly
selected)
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
48
Why (not) PSI-BLAST
• Weights sequence according to observed diversity
specific to family of interest
• Advantage: If sequences used to construct Position
Specific Scoring Matrices (PSSMs) are all homologous,
sensitivity at a given specificity improves significantly
• Disadvantage: However, if any non-homologous sequences
are included in PSSMs, they are “corrupted.” Then they "pull
in" addition non-homologous sequences, and become worse
than generic
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
49
How to use PSI BLAST
• Set initial thresholds high
• Inspect each iteration's result for suspicious
sequences
• Do several iterations (~5), or until no new
sequences are found
• Even if only looking for a small set of sequences,
make initial search very broad
• First, use NR (large, inclusive database) with up to 5
iterations to set PSSM
• Then use that PSSM to search in restricted domain
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
50
PSI-BLAST caveats
• Goal: Increased ability to find distant homologs
• Cost? additional care to prevent non-homologous sequences
from being included in PSSM calculation
• When in doubt, leave it out!
• Examine sequences with moderate similarity carefully
• Be particularly cautious about matches to sequences with
highly biased amino acid content
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
51
PSI-BLAST example
Query is human NF-Kappa-B sequence
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
52
First Iteration
…
…
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
53
Second iteration
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
54
Summary
• Dynamic programming is O(NM)
• BLAST is O(M)
• BLAST produces an index of query sequence that
allows fast matching to the database
• Target database is pre-indexed to indicate
positions in all database sequences that match
each possible search word above some score
threshold
• PSI-BLAST iterates BLAST, adding new homologs
at each iteration
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
55
Download