#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 BCB 444/544 PSSMs & Psi-BLAST

advertisement
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Required Reading
BCB 444/544
(before lecture)
√ Mon Sept 17 - Lecture 12
Position Specific Scoring Matrices & PSI-BLAST
• Chp 6 - pp 75-78 (but not HMMs)
Lecture 12
Multiple Sequence Alignment (MSA)
Wed Sept 19 - Lecture 13
(not covered on Exam 1)
Hidden Markov Models
• Chp 6 - pp 79-84
• Eddy: What is a hidden Markov Model?
PSSMs & Psi-BLAST
2004 Nature Biotechnol 22:1315
#12_Sept17
http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html
Wed Sept 21 - EXAM 1
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
1
Assignments & Announcements
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
SECTION II
Mon Sept 17 - Answers to HW#2
will be posted ~ Noon
Xiong: Chp 5
9/17/07
4
9/17/07
6
SEQUENCE ALIGNMENT
Multiple Sequence Alignment
Thu Sept 20 - Lab = Optional Review Session for Exam
•
•
•
•
Fri Sept 21 - Exam 1 - Will cover:
Lectures 2-12 (thru Mon Sept 17)
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming~
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
2
Chp 5- Multiple Sequence Alignment
Sun Sept 16 - Study Guide for Exam 1 was posted
•
•
•
•
9/17/07
9/17/07
3
Scoring Function
Exhaustive Algorithms
Heuristic Algorithms
Practical Issues
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
Overview
Multiple Sequence Alignments
1. What is a multiple sequence alignment (MSA)?
2. Where/why do we need MSA?
3. What is a good MSA?
4. Algorithms to compute a MSA
Credits for slides: Caragea & Brown, 2007;
Fernandez-Baca, Heber &HunterBCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
5
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
1
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Definition: MSA
Multiple Sequence Alignment
Given a set of sequences, a multiple sequence
alignment is an assignment of gap characters, such
that
• resulting sequences have same length
• no column contains only gaps
• Generalize pairwise alignment of sequences to
include > 2 homologous sequences
• Analyzing more than 2 sequences gives us much more
information:
• Which amino acids are required? Correlated?
• Evolutionary/phylogenetic relationships
• Similar to PSI-BLAST idea (not yet covered in lecture):
use a set of homologous sequences to provide
more "sensitivity"
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
7
Displaying MSAs: using CLUSTAL W
ATT-GC
ATTTGC
ATTTG
AT-TGC
ATTTGC
ATTTG-
AT-T-GC
ATTT-GC
ATTT-G-
NO
YES
NO
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
8
What is a Consensus Sequence?
A single sequence that represents most common
residue of each column in a MSA
Example:
RED:
BLUE:
MAGENTA:
GREEN:
FGGHL-GF
F-GHLPGF
FGGHP-FG
FGGHL-GF
AVFPMILW (small)
DE (acidic, negative chg)
RHK (basic, positive chg)
STYHCNGQ (hydroxyl + amine + basic)
*
:
.
Steiner consensus seqence: Given sequences s1,…, sk,
find a sequence s* that maximizes Σi S(s*,si )
entirely conserved column
all residues have ~ same size
all residues have ~ same size
AND
OR
hydropathy
hydropathy
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
9
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
10
Application: Recover Phylogenetic Tree
Applications of MSA
What was series of events that led to current species?
• Building phylogenetic trees
• Finding conserved patterns, e.g.:
• Regulatory motifs (TF binding sites)
• Splice sites
• Protein domains
• Identifying and characterizing protein families
• Find out which protein domains have same function
• Finding SNPs (single nucleotide polymorphisms) &
mRNA isoforms (alternatively spliced forms)
• DNA fragment assembly (in genomic sequencing)
NYLS
NFLS
NYLS
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
11
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
12
2
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Application: Discover Conserved Patterns
Goal: Characterize Protein Families
Which parts of globin sequences are most highly conserved?
Is there a conserved cis-acting regulatory sequence?
Rationale: if they are homologous (derived from a common ancestor),
they may be structurally equivalent
TATA box = transcriptional
promoter element
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
13
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
14
Scoring an Alignment
Databases of Multiple Alignments
• Pfam (Protein Domain Families data base)
Goal: Align homologous positions.
But: Without knowledge of phylogenetic tree is this
very hard (sometimes impossible) to achieve!
• Contains alignments and HMMs of protein families
• InterPro
• Integrates: Prosite, Prints, ProDom, Pfam, and SMART
• BLOCKS
• Segments of highly conserved multiple alignments
• Hovergen (Homologous Vertebrate Genes Database)
• COGs (Clusters of Orthologous Groups)
• BaliBASE (Benchmark alignments database)
NYLS
NFLS
NYLS
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
15
Scoring an Alignment
S(m) = ! S (mi )+ G
F
F
F
I
D
D
D
F
F
F
I
Y
Y
Y
16
• SP = sum of scores of all possible pairs of
sequences in an MSA based on a particular
scoring matrix
• Compute for each column c
gap penalty
G
G
Q
G
Q
G
K
A
F
P
G
Q
I
K
F
F
F
I
I
-
F
F
F
I
D
D
D
W
W
W
W
W
W
W
A
F
P
G
Q
I
K
F
F
I
Y
Y
Y
I
D
D
D
G
G
G
G
G
G
G
BCB 444/544 Fall 07 Dobbs
9/17/07
17
F
F
F
I
mi
residue l
PAM or BLOSUM score
A
F
P
G
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
F
F
I
-
S(mi) = Σk<l s(mik,mil)
i
A
F
P
G
Q
I
K
9/17/07
Sum of Pairs (SP) Score
In practice, simple scoring functions are used:
usually, columns are scored independently, i.e.
ith column of alignment m
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
F
F
F
I
G
G
Q
G
A
F
P
G
F
F
I
-
F
F
F
I
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
W
W
W
W
A
F
P
G
F
F
Y
D
D
G
G
G
G
9/17/07
18
3
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
How Score Gaps in MSAs?
Example: SP Score
Want to align gaps with each other over all sequences.
A gap in a pairwise alignment that “matches” a gap in
another pairwise alignment should cost less than
introducing a totally new gap.
• Possible that a new gap could be made to “match”
an older one by adjusting older pairwise alignment
• Change gap penalty near conserved domains of
various kinds (e.g. secondary structure elements,
hydrophobic regions)
F
Y
G
F
Y
5
-2 -2 -1
7
G
D
1
-5
4
-3
D
F-G
F-G
m= FYD
G
G
D
5
BLOSUM 60
Gap penalty: -8
s(-,-) = 0
S(m) = S(m 1) + S(m2) + S(m3)
= 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D)
= 15 -16 + 0 + 4 -6 = -3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
19
Overcoming problems with SP scoring
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
20
How Compute a Multiple Alignment?
Algorithms for MSA:
• Use weights to incorporate evolution in sum of pairs
scoring:
• Some pairwise alignments are more important than
others
• e.g., more important to have a good alignment
between mouse & human sequences than between
mouse & bird
• Assign different weights to different pairwise
alignments
• Weight decreases with evolutionary distance
• Multidimensional dynamic programming
• Optimal global alignment (time & space intensive!!!)
• Progressive alignments (Star alignment, ClustalW)
• Match closely-related sequences first using a guide tree
• Iterative methods
• Combined local alignments (Dialign)
• Multiple re-building attempts to find best alignment
• Partial order alignment (POA)
• Local alignments
• Profiles, Blocks, Patterns
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
21
Dynamic Programming for MSA
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
22
Generalized Needleman-Wunsch Algorithm
• As with pairwise alignments, multiple sequence
alignments can be computed by dynamic programming
Given 3 sequences x, y, and z:
Main iteration loop:
F(i,j,k) = max ( F(i-1, j-1, k-1) + S(xi, yj, zk),
F(i-1, j-1, k ) + S(xi, y j, - ),
F(i-1, j , k-1) + S(xi, -, zk),
F(i-1, j , k ) + S(xi, -, - ),
F(i , j-1, k-1) + S( -, yj, zk),
F(i , j-1, k ) + S( -, yj, -),
F(i , j , k-1) + S( -, -, zk) )
F
2D
3D
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
23
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
3D
9/17/07
24
4
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
What's so bad about those exponents?
An example: Running Time of DP
What Happens to Computational
Complexity?
• Overall runtime: O(k22kn k)
Given k sequences of length n:
• Space for matrix: O(nk )
• Neighbors/cell: 2k-1
• Time to compute SP score: O(k2)
• Overall runtime: O(k22kn k)
3D
 Ouch!!!
# sequences
running time
2
1 second
3
2 minutes
4
5 hours
5
3 weeks
6
9 years
Sequences: globins (≈ 150 aa)
But: There are fast heuristics.
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
25
Progressive Alignment
Heuristic procedure:
1. Align most similar sequences first
2. Add sequences progressively
9/17/07
26
Guide Tree
Binary tree
• Leaves correspond to sequences
• Internal nodes represent alignments
• Root corresponds to final MSA
Multiple Alignment by
adding sequences
1
2
Often: use guide tree to determine
order of alignments
Examples:
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
3
-TCG
-TCC
ATCATG-
4
ATC
ATG
Star alignment
ClustalW
ATC
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
27
Star Alignment - will skip for now,
come back to this on Wed
TCG
TCC
ATG
TCG TCC
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
28
Chp6 - Profiles & Hidden Markov Models
Star alignment will NOT be covered on
Exam 1
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 6
Profiles & HMMs
• Position Specific Scoring Matrices (PSSMs)
• PSI-BLAST
• Profiles
• Markov Model & Hidden Markov Model
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
29
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
30
5
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
PSI Blast
Psi-BLAST
• Position Specific Iterated BLAST
• Intuition: substitution matrices should be specific
to a particular site: penalize alanine→glycine more
in a helix
• Basic idea:
Query
PSSM
• Use BLAST with high stringency to get a set of closely
related sequences
• Align those sequences to create a new substitution
matrix for each position
• Then use that matrix (iteratively) to find additional
sequences
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Multiple
alignment
Sequence
database
31
PSI-BLAST pseudocode
32
Position-specific
9/17/07
33
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
34
Position-specific scoring matrix - PSSM
Convert query to PSSM
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
This step requires a
user-defined
threshold
BCB 444/544 Fall 07 Dobbs
9/17/07
Convert query to PSSM
scoring matrix
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
PSI-BLAST pseudocode
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
PSI-BLAST pseudocode
Convert query to PSSM
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BLAST
• A PSSM is an n by m matrix,
where n is the size of
alphabet, and m is length of
sequence
• Entry at (i, j) is score assigned
by PSSM to letter i at the jth
position
A
-1
-2
-1
0
-1
-2
0
R
5
0
5
-2
1
-3
-2
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
35
0
1
0
1
-2
5
-3
-2
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
9/17/07
-2
0
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
36
6
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Position-specific scoring matrix
• A PSSM is an n by m matrix,
where n is the size of the
alphabet, and m is the
length of the sequence.
• The entry at (i, j) is the
score assigned by the PSSM
to letter i at the jth
position.
Position-specific scoring matrix
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
This PSSM assigns sequence
NMFWAFGH a score of:
0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12
-2
-1
0
-1
-2
0
5
0
5
-2
1
-3
-2
N
0
6
0
0
0
-3
0
-2
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
0
1
0
1
-2
5
-3
-2
0
Q
1
0
1
-2
5
-3
-2
0
0
0
-2
2
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
8
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-3
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-3
L
-2
-3
-2
-4
-2
0
-4
-3
-2
-1
-2
“K”
at0 position
3
-4
0
gets
a-3score
of-4 2
-4
-2
0
-4
0
2
0
2
-2
1
-3
-2
-1
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
37
Position-specific scoring matrix
2 + 0 + -2 + 6 + 0 + 6 + -4 + -2
=6
-1
R
E
K
• What score does this PSSM
assign to KRPGHFLA?
A
9/17/07
38
Position-specific iterated BLAST
A
-1
-2
-1
0
-1
-2
0
R
5
0
5
-2
1
-3
-2
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
-2
0
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
?
Query
PSSM
Multiple
alignment
Sequence
database
39
BLAST
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
40
Position-specific iterated BLAST
Creating a PSSM from 1 sequence
R
L
RNRGQFGH
A
-1
-2
-1
0
-1
-2
0
R
5
0
5
-2
1
-3
-2
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
R
20 by 20
1
0
1
-2
5
-3
-2
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
BCB 444/544 Fall 07 Dobbs
by L
9/17/07
Query
PSSM
0
H
20
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
?
0
E
K
BLOSUM62
matrix
-2
Multiple
alignment
Sequence
database
41
BLAST
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
42
7
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Discard query gap columns
Creating a PSSM from multiple sequences
• Discard columns that contain gaps in query
• For each column C
• Compute relative sequence weights
• Compute PSSM entries, taking into account
• Observed residues in this column
• Sequence weights
• Substitution matrix
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
43
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
1.2
1.2
0.8
0.8
1.1
0.9
1.1
1.3
A
C
D
E
F
Background
Observed E
G
frequencies H
residues Q
R
I
G
K
K
L
A
M
F
P
A
Q
R
S
These are usually
T
derived from a large
V
sequence database
W
Y
• Low weights are assigned to
redundant sequences
+
• High weights are assigned to
unique sequences
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
45
Estimate the probability of observing
each residue
2. Divide by the background probability
of observing the same residue
3. Take log so scores will be additive
0.085
0.019
0.054
0.065
0.040
0.072
0.023
0.058
0.056
0.096
0.024
0.053
0.042
0.054
0.072
0.063
0.073
0.016
0.034
44
=
PSSM
column
PSSM
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
Log-odds score
1.
9/17/07
Compute PSSM entries
(simplified version)
Compute sequence weights
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVDLLVNNA
KALGGFNVIVNNA
ARFGKIDTLIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVD-LVNNA
KALGGFNVIVNNA
ARFGKID-LIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
EEFG----SVDGLVNNA
QKYG----RLDVMINNA
RRLG----TLNVLVNNA
GGIG----PVD-LVNNA
KALG----GFNVIVNNA
ARFG----KID-LIPNA
FEPEGPEKGMWGLVNNA
AQLK----TVDVLINGA
9/17/07
46
Log-odds score
Residue was
generated by
foreground model
(i.e., the PSSM)
Residue “A” is
observed
1.
Estimate the probability of observing
each residue
2. Divide by the background probability
& Pr
of observing the same residue
log 2 $$
Pr
3. Take log so scores will be additive
& Pr (A M )#
!
log 2 $$
!
% Pr (A B ) "
%
(A M )#!
(A B ) !"
Residue was generated
by the background
model (i.e., randomly
selected)
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
47
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
48
8
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Why (not) PSI-BLAST
How to use PSI BLAST
• Weights sequence according to observed diversity
specific to family of interest
• Set initial thresholds high
• Inspect each iteration's result for suspicious
sequences
• Do several iterations (~5), or until no new
sequences are found
• Even if only looking for a small set of sequences,
make initial search very broad
• Advantage: If sequences used to construct Position
Specific Scoring Matrices (PSSMs) are all homologous,
sensitivity at a given specificity improves significantly
• Disadvantage: However, if any non-homologous sequences
are included in PSSMs, they are “corrupted.” Then they "pull
in" addition non-homologous sequences, and become worse
than generic
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
• First, use NR (large, inclusive database) with up to 5
iterations to set PSSM
• Then use that PSSM to search in restricted domain
49
PSI-BLAST caveats
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
50
9/17/07
52
9/17/07
54
PSI-BLAST example
• Goal: Increased ability to find distant homologs
• Cost? additional care to prevent non-homologous sequences
from being included in PSSM calculation
Query is human NF-Kappa-B sequence
• When in doubt, leave it out!
• Examine sequences with moderate similarity carefully
• Be particularly cautious about matches to sequences with
highly biased amino acid content
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9/17/07
51
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
Second iteration
First Iteration
…
…
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
53
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
9
#11 - MSAs; PSSMs & Psi-BLAST
9/17/07
Summary
• Dynamic programming is O(NM)
• BLAST is O(M)
• BLAST produces an index of query sequence that
allows fast matching to the database
• Target database is pre-indexed to indicate
positions in all database sequences that match
each possible search word above some score
threshold
• PSI-BLAST iterates BLAST, adding new homologs
at each iteration
BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST
BCB 444/544 Fall 07 Dobbs
9/17/07
55
10
Download