BLAST

advertisement
An Introduction to Bioinformatics
Comparing biological sequences (3):
Database searching and
Multiple alignment
An Introduction to Bioinformatics
Database searching
• Goal: find similar (homologous) sequences of
a query sequence in a sequence of database
• Input: query sequence & database
• Output: hits (pairwise alignments)
An Introduction to Bioinformatics
Database searching
• Core: pair-wise alignment algorithm
• Speed (fast sequence comparison)
• Relevance of the search results (statistical
tests)
• Recovering all information of interest
• The results depend of the search parameters
like gap penalty, scoring matrix.
• Sometimes searches with more than one
matrix should be preformed
An Introduction to Bioinformatics
What program to use for searching?
1) BLAST is fastest and easily accessed on the Web
• limited sets of databases
• nice translation tools (BLASTX, TBLASTN)
2) FASTA
• precise choice of databases
• more sensitive for DNA-DNA comparisons
• FASTX and TFASTX can find similarities in sequences with
frameshifts
3) Smith-Waterman is slower, but more sensitive
• known as a “rigorous” or “exhaustive” search
• SSEARCH in GCG and standalone FASTA
An Introduction to Bioinformatics
FASTA
1) Derived from logic of the dot plot
• compute best diagonals from all frames of
alignment
2) Word method looks for exact matches
between words in query and test sequence
•
•
•
•
hash tables (fast computer technique)
DNA words are usually 6 bases
protein words are 1 or 2 amino acids
only searches for diagonals in region of word
matches = faster searching
An Introduction to Bioinformatics
FASTA Algorithm
An Introduction to Bioinformatics
Makes Longest Diagonal
3) after all diagonals found, tries to join
diagonals by adding gaps
4) computes alignments in regions of best
diagonals
An Introduction to Bioinformatics
FASTA Alignments
FASTA Results - Histogram
An Introduction to Bioinformatics
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences:
2,050 Symbols:
913,285 Word Size: 6
Searching with both strands of the query.
Scoring matrix: GenRunData:fastadna.cmp
Constant pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Histogram Key:
Each histogram symbol represents 4 search set sequences
Each inset symbol represents 1 search set sequences
z-scores computed from opt scores
z-score obs
exp
(=)
(*)
< 20
0
0:
22
0
0:
24
3
0:=
26
2
0:=
28
5
0:==
30
11
3:*==
32
19
11:==*==
34
38
30:=======*==
36
58
61:===============*
38
79
100:====================
*
40
134
140:==================================*
42
167
171:==========================================*
44
205
189:===============================================*====
46
209
192:===============================================*=====
48
177
184:=============================================*
An Introduction to Bioinformatics
FASTA Results - List
The best scores are:
init1 initn
SW:PPI1_HUMAN
Begin: 1 End: 269
! Q00169 homo sapiens (human). phosph... 1854
SW:PPI1_RABIT
Begin: 1 End: 269
! P48738 oryctolagus cuniculus (rabbi... 1840
SW:PPI1_RAT
Begin: 1 End: 270
! P16446 rattus norvegicus (rat). pho... 1543
SW:PPI1_MOUSE
Begin: 1 End: 270
! P53810 mus musculus (mouse). phosph... 1542
SW:PPI2_HUMAN
Begin: 1 End: 270
! P48739 homo sapiens (human). phosph... 1533
SPTREMBL_NEW:BAC25830
Begin: 1 End: 270
! Bac25830 mus musculus (mouse). 10, ... 1488
SP_TREMBL:Q8N5W1
Begin: 1 End: 268
! Q8n5w1 homo sapiens (human). simila... 1477
SW:PPI2_RAT
Begin: 1 End: 269
! P53812 rattus norvegicus (rat). pho... 1482
opt
z-sc E(1018780)..
1854
1854
2249.3
1.8e-117
1840
1840
2232.4
1.6e-116
1543
1837
2228.7
2.5e-116
1542
1836
2227.5
2.9e-116
1533
1533
1861.0
7.7e-96
1488
1522
1847.6
4.2e-95
1477
1522
1847.6
4.3e-95
1482
1516
1840.4
1.1e-94
FASTA Results - Alignment
An Introduction to Bioinformatics
SCORES
Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374
(2038 nt)
initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58
66.2% identity in 875 nt overlap
(83-957:151-1022)
60
70
80
90
100
110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
|| ||| | ||||| |
||| |||||
DMU09374
AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130
140
150
160
170
180
120
130
140
150
160
170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
|||||||||
|| |||
|
| || ||| |
|| || ||||| ||
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190
200
210
220
230
240
180
190
200
210
220
230
u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
||| | ||||| ||
|||
||||
| || | |||||||| || ||| ||
DMU09374
AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250
260
270
280
290
300
240
250
260
270
280
290
u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
||||||||||
||||| |
|||||| |||| |||
|| ||| || |
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310
320
330
340
350
360
An Introduction to Bioinformatics
FASTA on the Web
Many websites offer FASTA searches
• Various databases and various other services
• Be sure to use FASTA 3
• Each server has its limits
• Be aware that you are depending on the
kindness of strangers.
An Introduction to Bioinformatics
Institut de Génétique Humaine, Montpellier France,
GeneStream server
http://www2.igh.cnrs.fr/bin/fasta-guess.cgi
Oak Ridge National Laboratory GenQuest server
http://avalon.epm.ornl.gov/
European Bioinformatics Institute, Cambridge, UK
http://www.ebi.ac.uk/htbin/fasta.py?request
EMBL, Heidelberg, Germany
http://www.embl-heidelberg.de/cgi/fasta-wrapper-free
Munich Information Center for Protein Sequences (MIPS)
at Max-Planck-Institut, Germany
http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html
Institute of Biology and Chemistry of Proteins Lyon, France
http://www.ibcp.fr/serv_main.html
Institute Pasteur, France
http://central.pasteur.fr/seqanal/interfaces/fasta.html
GenQuest at The Johns Hopkins University
http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html
National Cancer Center of Japan
http://bioinfo.ncc.go.jp
An Introduction to Bioinformatics
BLAST Searches GenBank
[BLAST= Basic Local Alignment Search Tool]
The NCBI BLAST web server lets you
compare your query sequence to various
sections of GenBank:
• nr = non-redundant (main sections)
• month = new sequences from the past few
weeks
• ESTs
• human, drososphila, yeast, or E.coli genomes
• proteins (by automatic translation)
• This is a VERY fast and powerful computer.
An Introduction to Bioinformatics
BLAST
• Uses word matching like FASTA
• Similarity matching of words (3 aa’s, 11 bases)
• does not require identical words.
• If no words are similar, then no alignment
• won’t find matches for very short sequences
• Does not handle gaps well
• New “gapped BLAST” (BLAST 2) is better
An Introduction to Bioinformatics
BLAST Algorithm
An Introduction to Bioinformatics
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA
AAV
Break query
AVK
VKE
into words:
KEE
EEI
EIS
ISV
...
Break database
sequences
into words:
An Introduction to Bioinformatics
Compare Word Lists
Database Sequence Word Lists
Query Word List:
MEA
EAA
AAV
AVK
VKL
KEE
EEI
EIS
ISV
?
Compare word lists
by Hashing
(allow near matches)
RTT
SDG
SRW
QEL
VKI
DKI
LFC
AAV
PFR
…
AAQ
KSS
LLN
RWY
GKG
NIS
WDV
KVR
DEI
…
An Introduction to Bioinformatics
Find locations of matching words
in database sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA
EAA
AAV
AVK
KLV
KEE
EEI
EIS
ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
An Introduction to Bioinformatics
Extend hits one base at a time
An Introduction to Bioinformatics
Seq_XYZ:
Query:
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
QSVFDYIYYGCYCGWGLG_GK__PRDA
E-val=10-13
•Use two word matches as anchors to build an alignment
between the query and a database sequence.
•Then score the alignment.
An Introduction to Bioinformatics
HSPs are Aligned Regions
• The results of the word matching and
attempts to extend the alignment are
segments
- called HSPs (High-scoring Segment Pairs)
• BLAST often produces several short HSPs
rather than a single aligned region
An Introduction to Bioinformatics
BLAST 2 algorithm
• The NCBI’s BLAST website now both use
BLAST 2 (also known as “gapped BLAST”)
• This algorithm is more complex than the
original BLAST
• It requires two word matches close to each
other on a pair of sequences (i.e. with a
gap) before it creates an alignment
An Introduction to Bioinformatics
Statistical tests
• Evaluate the probability of an event taking
place by chance (at random).
• P-value
• Randomized data
• Distribution under the same setup
• Z-score
• Chebyshev Inequality
An Introduction to Bioinformatics
BLAST Statistics
• E value is equivalent to standard P value
(based on Karlin-Altschul theorem)
• Significant if E < 0.05 (smaller numbers are
more significant)
• The E-value represents the likelihood that the
observed alignment is due to chance alone. A value
of 1 indicates that an alignment this good would
happen by chance with any random sequence
searched against this database.
An Introduction
to Bioinformatics
BLAST
variants
for different searchesa
(after S. Brenner, Trends Guide to Bioinformatics, 1998)
Program Query
Database
Comparison
Common use
blastn
DNA
DNA
DNA level
Seek identical DNA
sequences and
splicing patterns
blastp
Protein
Protein
Protein level
Find homologous
proteins
blastx
DNA
Protein
Protein level
Analyze new DNA
to find genes and
seek homologous
proteins
tblastn
Protein
DNA
Protein level
Search for genes in
unannotated DNA
tblastx
DNA
DNA
Protein level
Discover gene
structure
An Introduction to Bioinformatics
BLAST is Approximate
• BLAST makes similarity searches very
quickly because it takes shortcuts.
• looks for short, nearly identical “words” (11 bases)
• It also makes errors
• misses some important similarities
• makes many incorrect matches
• easily fooled by repeats or skewed composition
An Introduction to Bioinformatics
Interpretation of output
• very low E values (e-100) are homologs or
identical genes
• moderate E values are related genes
• long list of gradually declining of E values
indicates a large gene family
• long regions of moderate similarity are more
significant than short regions of high identity
An Introduction to Bioinformatics
Biological Relevance
• It is up to you, the biologist to scrutinize these
alignments and determine if they are significant.
• Were you looking for a short region of nearly
identical sequence or a larger region of general
similarity?
• Are the mismatches conservative ones?
• Are the matching regions important structural
components of the genes or just introns and
flanking regions?
An Introduction to Bioinformatics
Borderline similarity
• What to do with matches with E() values in
the 0.5 -1.0 range?
• this is the “Twilight Zone”
• retest these sequences and look for related
hits (not just your original query sequence)
• similarity is transitive:
if A~B and B~C, then A~C
An Introduction to Bioinformatics
Position Specific Iterated BLAST
• Collect all database sequence segments that
have been aligned with query sequence with Evalue below set threshold (default 0.01)
• Construct position specific scoring matrix for
collected sequences. Rough idea:
• Align all sequences to the query sequence as the
template.
• Assign weights to the sequences
• Construct position specific scoring matrix
• Iterate
An Introduction to Bioinformatics
Motif finding
• Observation : Some regions have been better
conserved than others during evolution
• Idea: By analyzing the constant and variable
properties of such groups of similar
sequences, it is possible to derive a signature
for a protein family or domain (motifs)
An Introduction to Bioinformatics
PROSITE patterns
• PROSITE fingerprints are described by regular grammars
• There is a number of programs that allow to search databases
for PROSITE patterns (example GCG package)
• Example [EDQH]-x-K-x-[DN]-G-x-R-[GACV]
• Rules:
• Each position is separated by a hyphen
• One character denotes residuum at a given position
• […] denoted a set of allowed residues
• (n) denotes repeat of n
• (n,m) denoted repeat between n and m inclusive
Ex. ATP/GTP binding motive [SG]=X(4)-G-K-[DT]
An Introduction to Bioinformatics
Multiple sequence
alignment
An Introduction to Bioinformatics
Generalizing the Notion of Pairwise Alignment
• Alignment of 2 sequences is represented as a
2-row matrix
• In a similar way, we represent alignment of 3
sequences as a 3-row matrix
A T _ G C G _
A _ C G T _ A
A T C A C _ A
• Score: more conserved columns, better alignment
An Introduction to Bioinformatics
Alignments = Paths
• Align 3 sequences: ATGC, AATC,ATGC
A
--
T
G
C
A
A
T
--
C
--
A
T
G
C
An Introduction to Bioinformatics
Alignment Paths
0
1
1
2
3
4
A
--
T
G
C
A
A
T
--
C
--
A
T
G
C
x coordinate
An Introduction to Bioinformatics
Alignment Paths
• Align the 3 sequences: ATGC, AATC,ATGC
0
0
•
1
1
2
3
4
A
--
T
G
C
1
2
3
3
4
A
A
T
--
C
--
A
T
G
C
x coordinate
y coordinate
An Introduction to Bioinformatics
Alignment Paths
0
0
0
1
1
2
3
4
A
--
T
G
C
1
2
3
3
4
A
A
T
--
C
0
1
2
3
4
--
A
T
G
C
x coordinate
y coordinate
z coordinate
• Resulting path in (x,y,z) space:
(0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
An Introduction to Bioinformatics
Aligning Three Sequences
• Same strategy as
aligning two sequences
• Use a 3-D “Manhattan
Cube”, with each axis
representing a sequence
to align
• For global alignments,
go from source to sink
source
sink
An Introduction to Bioinformatics
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D edit graph
An Introduction to Bioinformatics
2-D cell versus 2-D Alignment Cell
In 2-D, 3 edges
in each unit
square
In 3-D, 7 edges
in each unit cube
An Introduction to Bioinformatics
Architecture of 3-D Alignment Cell
(i-1,j,k-1)
(i-1,j-1,k-1)
(i-1,j,k)
(i-1,j-1,k)
(i,j,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i,j,k)
An Introduction to Bioinformatics
Multiple Alignment: Dynamic Programming
• si,j,k = max
si-1,j-1,k-1 + (vi, wj, uk)
si-1,j-1,k +  (vi, wj, _ )
si-1,j,k-1 +  (vi, _, uk)
si,j-1,k-1 +  (_, wj, uk)
si-1,j,k
+  (vi, _ , _)
si,j-1,k
+  (_, wj, _)
si,j,k-1
+  (_, _, uk)
cube diagonal:
no indels
face diagonal:
one indel
edge diagonal:
two indels
• (x, y, z) is an entry in the 3-D scoring matrix
An Introduction to Bioinformatics
Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is 7n3;
O(n3)
• For k sequences, build a k-dimensional
Manhattan, with run time (2k-1)(nk); O(2knk)
• Conclusion: dynamic programming approach for
alignment between two sequences is easily
extended to k sequences but it is impractical
due to exponential running time.
An Introduction to Bioinformatics
Profile Representation of Multiple Alignment
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
.4
1
C
C
C
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
.8
A
A
A
A
G
.8
1 .2
.2
.2
C
C
C
C
C
.6
An Introduction to Bioinformatics
Profile Representation of Multiple Alignment
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
A
A
A
A
G
.4
1
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
C
C
C
.8
1 .2
.2
.2
C
C
C
C
C
.6
.8
In the past we were aligning a sequence against a sequence
Can we align a sequence against a profile?
Can we align a profile against a profile?
An Introduction to Bioinformatics
Aligning alignments
• Given two alignments, can we align them?
x GGGCACTGCAT
y GGTTACGTC-z GGGAACTGCAG
w GGACGTACC-v GGACCT-----
Alignment 1
Alignment 2
An Introduction to Bioinformatics
Aligning alignments
• Given two alignments, can we align them?
• Hint: use alignment of corresponding profiles
x
y
z
w
v
GGGCACTGCAT
GGTTACGTC-GGGAACTGCAG
GGACGTACC-GGACCT-----
Combined Alignment
An Introduction to Bioinformatics
Multiple Alignment: Greedy Approach
• Choose most similar pair of strings and combine into a
profile , thereby reducing alignment of k sequences to an
alignment of of k-1 sequences/profiles. Repeat
• This is a heuristic greedy method
k
u1= ACGTACGTACGT…
u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT…
…
…
uk = CCGGCCGGCCGG…
uk = CCGGCCGGCCGG
k-1
An Introduction to Bioinformatics
Greedy Approach: Example
• Consider these 4 sequences
s1
s2
s3
s4
GATTCA
GTCTGA
GATATT
GTCAGC
An Introduction to Bioinformatics
Greedy Approach: Example (cont’d)
• There are
 4
 
 2
= 6 possible alignments
s2
s4
GTCTGA
GTCAGC (score = 2)
s1
s4
GATTCA-G—T-CAGC(score = 0)
s1
s2
GAT-TCA
G-TCTGA (score = 1)
s2
s3
G-TCTGA
GATAT-T (score = -1)
s1
s3
GAT-TCA
GATAT-T (score
s3
s4
GAT-ATT
G-TCAGC (score = -1)
= 1)
An Introduction to Bioinformatics
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine:
s2
s4
GTCTGA
GTCAGC
s2,4 GTCt/aGa/cA
(profile)
new set of 3 sequences:
s1
s3
s2,4
GATTCA
GATATT
GTCt/aGa/c
An Introduction to Bioinformatics
Progressive Alignment
• Progressive alignment is a variation of greedy
algorithm with a somewhat more intelligent
strategy for choosing the order of alignments.
• Progressive alignment works well for close
sequences, but deteriorates for distant
sequences
• Gaps in consensus string are permanent
• Use profiles to compare sequences
An Introduction to Bioinformatics
ClustalW
• Popular multiple alignment tool today
• ‘W’ stands for ‘weighted’ (different parts of
alignment are weighted differently).
• Three-step process
1.) Construct pairwise alignments
2.) Build Guide Tree
3.) Progressive Alignment guided by the tree
An Introduction to Bioinformatics
Step 1: Pairwise Alignment
• Aligns each sequence again each other
giving a similarity matrix
• Similarity = exact matches / sequence length
(percent identity)
v1
v2
v3
v4
v1
v2 v3 v4
.17 .87 .28 .59 .33 .62 -
(.17 means 17 % identical)
An Introduction to Bioinformatics
Step 2: Guide Tree
•Create Guide Tree using the similarity matrix
•ClustalW uses the neighbor-joining method
•Guide tree roughly reflects evolutionary
relations
An Introduction to Bioinformatics
Step 2: Guide Tree (cont’d)
v1
v2
v3
v4
v1
v2 v3 v4
.17 .87 .28 .59 .33 .62 -
v1
v3
v4
v2
Calculate:
v1,3
= alignment (v1, v3)
v1,3,4
= alignment((v1,3),v4)
v1,2,3,4
= alignment((v1,3,4),v2)
An Introduction to Bioinformatics
Step 3: Progressive Alignment
• Start by aligning the two most similar sequences
• Following the guide tree, add in the next
sequences, aligning to the existing alignment
• Insert gaps as necessary
FOS_RAT
FOS_MOUSE
FOS_CHICK
FOSB_MOUSE
FOSB_HUMAN
PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD
PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD
SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD
PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ
PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
.
. :
** .
:.. *:.*
*
. *
**:
Dots and stars show how well-conserved a column is.
An Introduction to Bioinformatics
Multiple Alignments: Scoring
• Number of matches (multiple longest
common subsequence score)
• Entropy score
• Sum of pairs (SP-Score)
An Introduction to Bioinformatics
Multiple LCS Score
• A column is a “match” if all the letters in the
column are the same
AAA
AAA
AAT
ATC
• Only good for very similar sequences
An Introduction to Bioinformatics
Entropy
• Define frequencies for the occurrence of each letter in
each column of multiple alignment
• pA = 1, pT=pG=pC=0 (1st column)
• pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)
• pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)
• Compute entropy of each column

p
X
X  A ,T ,G ,C
log p X
AAA
AAA
AAT
ATC
An Introduction to Bioinformatics
Entropy: Example
 A
 
 A
entropy   0
A
 
 A
 
Worst case
Best case
 A
 
1
1
1
T 
entropy    log  4(  2)  2
G
4
4
4
 
C 
 
An Introduction to Bioinformatics
Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the
sum of entropies of its columns:

over all columns

X=A,T,G,C
pX logpX
An Introduction to Bioinformatics
Entropy of an Alignment: Example
column entropy:
-( pAlogpA + pClogpC + pGlogpG + pTlogpT)
A A A
•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0]
=0
A C C
•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0]
= -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811
A C G
A C T
•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)]
= 4* -[(1/4)*(-2)] = +2.0
•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811
An Introduction to Bioinformatics
Multiple Alignment Induces Pairwise
Alignments
Every multiple alignment induces pairwise alignments
x:
y:
z:
AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
Induces:
x: ACGCGG-C;
y: ACGC-GAC;
x: AC-GCGG-C;
z: GCCGC-GAG;
y: AC-GCGAG
z: GCCGCGAG
An Introduction to Bioinformatics
Sum of Pairs Score(SP-Score)
• Consider pairwise alignment of sequences
ai and aj
imposed by a multiple alignment of k sequences
• Denote the score of this suboptimal (not necessarily
optimal) pairwise alignment as
s*(ai, aj)
• Sum up the pairwise scores for a multiple alignment:
s(a1,…,ak) = Σi,j s*(ai, aj)
An Introduction to Bioinformatics
Computing SP-Score
Aligning 4 sequences: 6 pairwise alignments
Given a1,a2,a3,a4:
s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3)
+ s*(a1,a4) + s*(a2,a3)
+ s*(a2,a4) + s*(a3,a4)
An Introduction to Bioinformatics
SP-Score: Example
a1 ATG-C-AAT
. A-G-CATAT
ak ATCCCATTT
To calculate each column:
sS (a1...ak ) 
S
*
s* (
 n
  Pairs of Sequences
 2
(ai , a j )
i, j
A
1
A
G
1
1
Score=3
A
Column 1
1
m
C
m
G
Column 3
Score =
1 – 2m
An Introduction to Bioinformatics
Multiple Alignment: History
1975 Sankoff
Formulated multiple alignment problem and gave dynamic
programming solution
1988 Carrillo-Lipman
Branch and Bound approach for MSA
1990 Feng-Doolittle
Progressive alignment
1994 Thompson-Higgins-Gibson-ClustalW
Most popular multiple alignment program
1998 Morgenstern et al.-DIALIGN
Segment-based multiple alignment
2000 Notredame-Higgins-Heringa-T-coffee
Using the library of pairwise alignments
2004 MUSCLE
Download