Powerpoint Slides (seqalign_1999)

advertisement
Sequence Comparison
& Alignment
Computing in Molecular Biology
Hugues Sicotte
National Center for Biotechnology Information
sicotte@ncbi.nlm.nih.gov
Sequence Comparison
& Alignment
COMPARATIVE ANALYSIS
Sequence alignment
is similar to other
types of comparative
analysis
Involves scoring
similarities and
differences among a
group of related
entities
Finches of the Galápagos Islands observed by
Charles Darwin on the voyage of HMS Beagle
Sequence Comparison
& Alignment
Homology
Homology Is the central concept for all of biology.
Whenever we say that a mammalian hormone is the
‘same’ hormone as a fish hormone, that a human gene
sequence is the ‘same’ as a sequence in a chimp or a
mouse, that a HOX gene is the ‘same’ in a mouse, a fruit
fly, a frog and a human - even when we argue that
discoveries about a worm, a fruit fly, a frog, a mouse, or a
chimp have relevance to the human condition - we have
made a bold and direct statement about homology. The
aggressive confidence of modern biomedical science
implies that we know what we are talking about.”
David B. Wake
Sequence Comparison
& Alignment
Alignment algorithms
model evolutionary
processes
Derivation from a
common ancestor
through incremental
change due to dna
replication errors,
mutations, damage,
or unequal crossingover.
COMPARATIVE ANALYSIS
GATTACCA
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATTACCA
T
GATCATCA
GATTGATCA
GAT ACCA
Substitution
insertion
deletion
Sequence Comparison
& Alignment
Alignment algorithms
model evolutionary
processes
Derivation from a
common ancestor
through incremental
change
COMPARATIVE ANALYSIS
GATTACCA
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATCATCA
GATTACCA
GATTGATCA
GATACCA
Only extant sequences are known,
ancestral sequences are postulated.
Sequence Comparison
& Alignment
Alignment algorithms
model evolutionary
processes
Derivation from a
common ancestor
through incremental
change. Mutations
that do not kill the
host may carry over
to the population.
Rarely are mutations
kept/rejected by
natural selection.
COMPARATIVE ANALYSIS
GATTACCA
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATCATCA
GATTACCA
GATTGATCA
GATACCA
The term homology implies a common
ancestry, which may be inferred from
observations of sequence similarity
Sequence Comparison
& Alignment
Comparative Analysis of Genes
3000Myr
1000Myr
500Myr
Bacteria
Yeast
Worm Fly
Mouse
Human
Align Extant Sequences
MSH2_Human
SPE1_DROME
MSH2_Yeast
MUTS_ECOLI
TGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATK
VGTAVLMAHIGAFVPCSLATISMVDSILGRVGASDNIIKGLSTFMVEMIETSGIIRTATD
VGVISLMAQIGCFVPCEEAEIAIVDAILCRVGAGDSQLKGVSTFMVEILETASILKNASK
TALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTETANILRNATE
*** ** **
* * ****
**** * ** *
*
Human Colon Cancer MSH2 gene is homologous to DNA repair proteins
Sequence Comparison
& Alignment
Why Align sequences?
- Finding similar sequences helps determine the properties
and function of a new sequence. (Must be verified
experimentally)
-Conserved positions in homologous sequences hint at
functionally important sites in proteins. (active or catalytic
sites, dna binding domains, di-sulfide bridges, structural
bends, hydrophobic pockets, protein binding domains,…)
-Conserved nucleotides can hint at regulatory elements,
either pre-transcriptional or post-transcriptional.
Sequence Comparison
& Alignment
Sound alignment methods reflect evolution.
DNA Evolution:
- Mutation: Errors in DNA replication of DNA repair.
-substitutions: replacement of one base by another.
-deletions/insertions: By dna mispairing during
replication or unequal crossing over.
- Gene conversion or unequal crossing over: Large
segments of DNA can be inserted/deleted.
- Mutations that do not kill the host are propagated.
Sometimes positive mutations are selected for.
Reference: Molecular Evolution: Wen-Hsiung Li, 1997,Sinauer Associates publishing
4.5
3' Flank
4
3.5
introns
3
5'UTR
2.5
2
3'UTR
1.5
1
non-denerate
0.5
0
Pseudogenes
Substitution
rate per
nucleotide
site per
billion years.
5' Flank
5
Non-Coding
Different regions
evolve at
different rates,
consistent with
evolutionary
constraints.
Synonymous versus non-synonymous mutations
Coding
Sequence Comparison
& Alignment
Twofold
degenerate
4-fold
degenerate
Pseudogenes
Sequence Comparison
& Alignment
Alignment definition and Type:
Alignment:
Each Base is used at most once.
Global Alignment:
All bases aligned with another base or with
a gap (symbol of “-” or sometimes “.”).
G-ATES
GRATED
Local Alignments:
Do not need to align all the bases in all sequences.
Align BILLGATESLIKESCHEESE and GRATEDCHEESE
G-ATESLIKESCHEESE
GRATED-----CHEESE
or
G-ATES
& CHEESE
GRATED
& CHEESE
Sequence Comparison
& Alignment
Insertions and
deletions (‘indels’)
are represented
by gaps in
alignments
COMPARATIVE ANALYSIS
GATTATACCA
GATTA---CA
gap of length 3
Sequence Comparison
& Alignment
SEQUENCE ALIGNMENT
Alignment of trypsin sequences from mouse and crayfish
An alignment
provides a mapping
of residues in one
sequence onto
those of another
S-S
*
Mouse
IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQV
Crayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI
*
Mouse
RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTA
Crayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ
S-S
Conserved
residues are often
of structural or
functional
importance
Mouse
PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLE
Crayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE
S-S
*
Mouse
GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAAN
Crayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV--
Figure 7.1
Sequence Comparison
& Alignment
SEQUENCE ALIGNMENT
Alignment of trypsin sequences from mouse and crayfish
S-S
*
Mouse
IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQV
Crayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI
*
Mouse
RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTA
Crayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ
S-S
Mouse
PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLE
Crayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE
S-S
*
Mouse
GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAAN
Crayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV--
Conserved positions are often of functional importance.
Alignment of trypsin proteins of mouse (Swiss-Prot P07146) and crayfish
(Swiss-Prot P00765). Identical residues are highlighted red and underlined.
Indicated above the alignment are three disulfide bonds (-S-S-) whose
participating cysteine residues are conserved, amino acids whose side chains
are involved in the charge relay system (asterisk) and the active side residue
which governs substrate specificity (diamond). The other conserved positions
have no known role. These conserved residues could be coincidentally
conserved or have some unknown structural role. Figure 7.1
Sequence Comparison
& Alignment
SEQUENCE ALIGNMENT
Human zeta crystallin vs
E.coli quinone
oxidoreductase
Stars indicate identical residues and
dots indicate conservative substitutions
CLUSTAL W (1.7) multiple sequence alignment
Human-Zcr
Ecoli-QOR
MATGQKLMRAVRVFEFGGPEVLKLRSDIAVPIPKDHQVLIKVHACGVNPVETYIRSGTYS
------MATRIEFHKHGGPEVLQA-VEFTPADPAENEIQVENKAIGINFIDTYIRSGLYP
:
:...:.******:
::: . * :::: :: :* *:* ::****** *.
Human-Zcr
Ecoli-QOR
RKPLLPYTPGSDVAGVIEAVGDNASAFKKGDRVFTSSTISGGYAEYALAADHTVYKLPEK
-PPSLPSGLGTEAAGIVSKVGSGVKHIKAGDRVVYAQSALGAYSSVHNIIADKAAILPAA
* **
*::.**::. **.... :* ****. :.: *.*:.
... **
Human-Zcr
Ecoli-QOR
LDFKQGAAIGIPYFTAYRALIHSACVKAGESVLVHGASGGVGLAACQIARAYGLKILGTA
ISFEQAAASFLKGLTVYYLLRKTYEIKPDEQFLFHAAAGGVGLIACQWAKALGAKLIGTV
:.*:*.** : :*.* * :: :*..*..*.*.*:***** *** *:* * *::**.
Human-Zcr
Ecoli-QOR
GTEEGQKIVLQNGAHEVFNHREVNYIDKIKKYVGEKGIDIIIEMLANVNLSKDLSLLSHG
GTAQKAQSALKAGAWQVINYREEDLVERLKEITGGKKVRVVYDSVGRDTWERSLDCLQRR
** : : .*: ** :*:*:** : ::::*: .* * : :: : :.. . .:.*. *.:
Human-Zcr
Ecoli-QOR
GRVIVVG-SRGTIEINPRDTMAKES----SIIGVTLFSSTKEEFQQYAAALQAGMEIGWL
GLMVSFGNSSGAVTGVNLGILNQKGSLYVTRPSLQGYITTREELTEASNELFSLIASGVI
* :: .* * *::
. : ::.
: .: : :*:**: : : * : : * :
Human-Zcr
Ecoli-QOR
KPVIGSQ--YPLEKVAEAHENIIHGSGATGKMILLL
KVDVAEQQKYPLKDAQRAHE-ILESRATQGSSLLIP
* :..* ***:.. .*** *:.. .: *. :*:
Figure 7.2
Sequence Comparison
& Alignment
Score and Statistics
Percent Identity.
Can be misleading.
Score:
A simple quality measure is the “score”. The
score assigns points for each aligned base
(or gap) of the alignment.
identical bases : “match” score
mismatching bases: “mismatch” score
gaps: “gap opening” penalty for starting a gap
“gap extension” penalty for each gap symbol.
Example:
match = +1 , mismatch =-1,
gap opening = -5, gap extension=-1
G-ATESLIKESCHEESE
AND/OR
GRATED-----CHEESE
Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))
= -7
G-ATES
&
CHEESE
GRATED
&
CHEESE
Sequence Comparison
& Alignment
Which alignment is
“better”?
SCORING SYSTEMS
GCTACTAG-T-T--CGC-T-TAGC
GCTACTAGCTCTAGCGCGTATAGC
0 mismatches, 5 gaps
GCTACTAGTT------CGCTTAGC
GCTACTAGCTCTAGCGCGTATAGC
3 mismatches, 1 gap
Sequence Comparison
& Alignment
SCORING SYSTEMS
High penalty for
“opening” a gap
(e.g. G = 5)
Lower penalty for
“entending” a gap
(e.g. L = 1)
GCTACTAG-T-T--CGC-T-TAGC
GCTACTAGCTCTAGCGCGTATAGC
Penalty = 5G + 6L = 31
GCTACTAGTT------CGCTTAGC
GCTACTAGCTCTAGCGCGTATAGC
Penalty = 1G + 6L = 11
Sequence Comparison
& Alignment
Mix-and-match
protein modules
confound alignment
algorithms
LOCAL SIMILARITY
Protein modules in coagulation factor XII (F12) and
tissue plasminogen activator (PLAT)
F12
F2 E F1 E
PLAT
F1 E
F1,F2
E
K
Catalytic
K
K
K
Catalytic
Catalytic
Fibronectin repeats
EGF similarity domain
Kringle domain
Serine protease activitiy
Figure 7.3
Sequence Comparison
& Alignment
Mix-and-match
protein modules
confound alignment
algorithms
LOCAL SIMILARITY
Protein modules in coagulation factor XII (F12) and
tissue plasminogen activator (PLAT)
F12
F2 E F1 E
K
Catalytic
modules in
reverse order
PLAT
F1 E
F1,F2
E
K
Catalytic
K
K
Catalytic
Fibronectin repeats
EGF similarity domain
Kringle domain
Serine protease activitiy
Figure 7.3
Sequence Comparison
& Alignment
Mix-and-match
protein modules
confound alignment
algorithms
LOCAL SIMILARITY
Protein modules in coagulation factor XII (F12) and
tissue plasminogen activator (PLAT)
F12
F2 E F1 E
K
Catalytic
repeated
modules
PLAT
F1 E
F1,F2
E
K
Catalytic
K
K
Catalytic
Fibronectin repeats
EGF similarity domain
Kringle domain
Serine protease activitiy
Figure 7.3
Sequence Comparison
& Alignment
DOT PLOTS
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
Horizontal axis is
coordinates for
one sequence
Vertical axis is
coordinates for the
other
C
G T A C
C
G T
A
0
0
0
1
0
0
0
0
C
1
0
0
0
1
1
0
0
G
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
1
Figure 7.4
Sequence Comparison
& Alignment
DOT PLOTS
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
Horizontal axis is
coordinates for
one sequence
Vertical axis is
coordinates for the
other
Can also score not 1 position at a time, but in sliding window. For
example a window of 3 nucleotides where we score 1 for identical
triplets and 0 for all other combinations yields.
C
G T A C
C
A
0
0
0
0
0
0
C
1
0
0
0
0
1
G T
G
T
Figure 7.4b
Sequence Comparison
& Alignment
DOT PLOTS
Horizontal axis is
coordinates for
one sequence
Vertical axis is
coordinates for the
other
Tissue Plasminogen Activator (PLAT)
Coagulation Factor XII (F12)
Figure 7.4
Sequence Comparison
& Alignment
DOT PLOTS
K
K
Catalytic
Adjacent dots
merge to form
diagonal segments
Tissue Plasminogen Activator (PLAT)
Plot dots for high
similarity within a
short window
E F1
Coagulation Factor XII (F12)
F2
E F1 E
K
Catalytic
Figure 7.4
Sequence Comparison
& Alignment
DOT PLOTS
Catalytic
K
K
Tissue Plasminogen Activator (PLAT)
Repeated domains
show a
characteristic
pattern
E F1
Coagulation Factor XII (F12)
F2
E F1 E
K
Catalytic
Figure 7.4
Sequence Comparison
& Alignment
PATH GRAPHS
Dot plots suggest
paths through the
alignment space
EGF similarity domains of urokinse plasminogen activator
(PLAU) and tissue plasminogen activator (PLAT)
137
90
137
23
23
90
PLAU
PLAT
90
23
72
Each path is a
unique alignment
72
Path graphs are
more explicit
representations
EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE
ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE
137
72
Figure 7.5
Sequence Comparison
& Alignment
Best-path problems
are common in
computer science
A best-path
algorithm used for
sequence alignment
is called ‘dynamic
programming’
PATH GRAPHS
Routing a phone call from
Washington DC to San Francisco
Sequence Comparison
& Alignment
DYNAMIC PROGRAMMING
Dynamic Programming Example
Construct an
optimal of these
two sequences:
G A T A C T A
G A T T A C C A
Using these
scoring rules:
Match:
+1
Mismatch: -1
Gap:
-1
Sequence Comparison
& Alignment
Arrange the
sequence
residues along a
two-dimensional
lattice
Vertices of the
lattice fall
between letters
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
Sequence Comparison
& Alignment
The goal is to find
the optimal path
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
from here
to here
Sequence Comparison
& Alignment
Each path
corresponds to a
unique alignment
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
Which one is
optimal?
Sequence Comparison
& Alignment
The score for a
path is the sum of
its incremental
edges scores
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
A aligned with A
Match = +1
Sequence Comparison
& Alignment
The score for a
path is the sum of
its incremental
edges scores
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
A aligned with T
Mismatch = -1
Sequence Comparison
& Alignment
The score for a
path is the sum of
its incremental
edges scores
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
T aligned with NULL
Gap = -1
NULL aligned with T
Sequence Comparison
& Alignment
Incrementally
extend the path
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
+1
Sequence Comparison
& Alignment
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
-2
+1
-2
Sequence Comparison
& Alignment
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
-2
+1
-2
0
0
+2
Sequence Comparison
& Alignment
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-1
+1
-2
0
-2
0
+2
Sequence Comparison
& Alignment
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-1
+1
-2
0
-1
-2
0
+2
+1
-3
-1
+1
+3
Sequence Comparison
& Alignment
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-1
+1
0
-1
-2
-3
-2
0
+2
+1
0
-1
-3
-1
+1
+3
+2
+1
-4
-2
0
+2
+2
+1
-5
-3
-1
+1
+3
+2
Sequence Comparison
& Alignment
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-6
-7
-1
+1
0
-1
-2
-3
-4
-5
-2
0
+2
+1
0
-1
-2
-3
-3
-1
+1
+3
+2
+1
0
-1
-4
-2
0
+2
+2
+1
+2
+1
-5
-3
-1
+1
+3
+2
+1
+3
-6
-4
-2
0
+2
+4
+3
+2
-7
-5
-3
-1
+1
+3
+3
+2
-8
-6
-4
-2
0
+2
+2
+4
Sequence Comparison
& Alignment
Trace-back to get
optimal path and
alignment
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-6
-7
-1
+1
0
-1
-2
-3
-4
-5
-2
0
+2
+1
0
-1
-2
-3
-3
-1
+1
+3
+2
+1
0
-1
-4
-2
0
+2
+2
+1
+2
+1
-5
-3
-1
+1
+3
+2
+1
+3
-6
-4
-2
0
+2
+4
+3
+2
-7
-5
-3
-1
+1
+3
+3
+2
-8
-6
-4
-2
0
+2
+2
+4
Sequence Comparison
& Alignment
Print out the
alignment
G A - T A CT A
G A T T A CC A
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
Sequence Comparison
& Alignment
Global Alignment
methods:
Local Alignment
methods:
Two different types of Alignment
Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 :
Problem of finding the best path. Revelation: Any partial subpath that ends at a point along the true optimal path must itself
be the optimal path leading to that point. This provides a
method to create a matrix of path “score”, the score of a path
leading to that point. Trace the optimal path from one end to
the other of the two sequences.
Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use
Needleman &Wunch, but report all non-overlapping paths,
starting at the highest scoring points in the path graph.
FASTP(Lipman &Pearson(1985),Science 227,1435-1441
BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t
report all overlapping paths, but only attempt to find paths if
there are words that are high-scoring. Speeds up considerably the
alignments.
Sequence Comparison
& Alignment
Implementations of
dynamic
programming for
global and local
similarities
GLOBAL & LOCAL SIMILARITY
Optimal global
alignment
Optimal local
alignment
Needleman & Wunsch (1970)
Smith & Waterman (1981)
Sequences align
essentially from
end to end
Sequences align
only in small,
isolated regions
Sequence Comparison
& Alignment
Score and Statistics
Some amino acids mutations do not affect structure/function
very much. Amino acids with similar physico-chemical and
steric properties can often replace each other.
Scoring system that doesn’t penalize very much mutations to
similar amino acid.
PAM Matrices: Point Accepted Mutations. Defined in terms of
a divergence of 1 percent PAM. For distant sequences use
PAM250, while for closer sequences (like DNA) use PAM100.
Some sites accumulate mutations some others don’t, thus use
of the PAM100 matrice doesn’t mean that the sequences
compared were 100% mutated.
BLOSUM: BLOCK substitution matrices. Started with the
BLOCKS database of multiple alignment only involving
distant sequences. BLOSUM62 means that the proteins
compated were never closer than 62% Identity. BLOSUM50
matrices involved alignment of more distant sequences.
Recommend use BLOSUM matrices (BLOSUM62) for most
protein alignments.
Sequence Comparison
& Alignment
SCORING SYSTEMS
A
Some amino acid
substitutions are
more common than
others
R -1
5
N -2
0
6
D -2 -2
1
C
BLOSUM62
6
0 -3 -3 -3
9
Q -1
1
0
0 -3
5
E -1
0
0
2 -4
2
G
Substitution scores
come from an odds
ratio based on
measured
substitution rates
4
0 -2
H -2
0
5
0 -1 -3 -2 -2
1 -1 -3
0
6
0 -2
8
I -1 -3 -3 -3 -1 -3 -3 -4 -3
4
L -1 -2 -3 -4 -1 -2 -3 -4 -3
2
K -1
2
0 -1 -3
M -1 -1 -2 -3 -1
1
4
1 -2 -1 -3 -2
5
0 -2 -3 -2
1
2 -1
5
F -2 -3 -3 -3 -2 -3 -3 -3 -1
0
0 -3
0
6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
0 -1
0
0
0 -1 -2 -2
7
S
1 -1
1
0 -1 -2 -1
4
T
0 -1
0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1
5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1
1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3
3 -3 -2 -2
V
2 -1 -1 -2 -1
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3
3
1 -2
1 -1 -2 -2
0 -3 -1
4
A
I
L
M
T
V
R
N
D
C
Q
E
G
H
K
F
P
S
W
Y
Figure 7.8
Sequence Comparison
& Alignment
SCORING SYSTEMS
A
Identities get
positive scores, but
some are better
than others
4
R -1
5
N -2
0
6
D -2 -2
1
C
BLOSUM62
6
0 -3 -3 -3
9
Q -1
1
0
0 -3
5
E -1
0
0
2 -4
2
G
0 -2
H -2
0
5
0 -1 -3 -2 -2
1 -1 -3
0
6
0 -2
8
I -1 -3 -3 -3 -1 -3 -3 -4 -3
4
L -1 -2 -3 -4 -1 -2 -3 -4 -3
2
K -1
2
0 -1 -3
M -1 -1 -2 -3 -1
1
4
1 -2 -1 -3 -2
5
0 -2 -3 -2
1
2 -1
5
F -2 -3 -3 -3 -2 -3 -3 -3 -1
0
0 -3
0
6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
0 -1
0
0
0 -1 -2 -2
7
S
1 -1
1
0 -1 -2 -1
4
T
0 -1
0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1
5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1
1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3
3 -3 -2 -2
V
2 -1 -1 -2 -1
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3
3
1 -2
1 -1 -2 -2
0 -3 -1
4
A
I
L
M
T
V
R
N
D
C
Q
E
G
H
K
F
P
S
W
Y
Figure 7.8
Sequence Comparison
& Alignment
SCORING SYSTEMS
A
Some non-identities
have positive
scores, but most
are negative
4
R -1
5
N -2
0
6
D -2 -2
1
C
BLOSUM62
6
0 -3 -3 -3
9
Q -1
1
0
0 -3
5
E -1
0
0
2 -4
2
G
0 -2
H -2
0
5
0 -1 -3 -2 -2
1 -1 -3
0
6
0 -2
8
I -1 -3 -3 -3 -1 -3 -3 -4 -3
4
L -1 -2 -3 -4 -1 -2 -3 -4 -3
2
K -1
2
0 -1 -3
M -1 -1 -2 -3 -1
1
4
1 -2 -1 -3 -2
5
0 -2 -3 -2
1
2 -1
5
F -2 -3 -3 -3 -2 -3 -3 -3 -1
0
0 -3
0
6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
0 -1
0
0
0 -1 -2 -2
7
S
1 -1
1
0 -1 -2 -1
4
T
0 -1
0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1
5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1
1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3
3 -3 -2 -2
V
2 -1 -1 -2 -1
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3
3
1 -2
1 -1 -2 -2
0 -3 -1
4
A
I
L
M
T
V
R
N
D
C
Q
E
G
H
K
F
P
S
W
Y
Figure 7.8
Sequence Comparison
& Alignment
Compare one query
sequence against an
entire database
A typical search has
four basic elements
DATABASE SEARCHING
> fasta
myquery swissprot -ktup 2
search
query
sequence optional
program sequence database parameters
Sequence Comparison
& Alignment
With exponential
database growth,
searches keep
taking more time
DATABASE SEARCHING
> fasta
myquery swissprot -ktup 2
searching . . . . . .
Sequence Comparison
& Alignment
The “hit list” gives
titles and scores for
matched sequences
DATABASE SEARCHING
> fasta
myquery swissprot -ktup 2
The best scores are:
initn
gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996
gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412
gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238
gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153
gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163
gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN
164
gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130
gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125
gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4
42
gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128
gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT76
gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1
27
gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO
66
gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO
65
gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT52
gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO
66
gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO
66
gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN
73
gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN
76
gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP
58
gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE
47
gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T
63
gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA
58
gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA
70
gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE
75
gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU
36
gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC
36
gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA
45
gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA
48
gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED
59
gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC
48
gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO
38
init1 opt z-sc E(77110)
996 996 1262.1
0
382 395 507.6 1.4e-21
133 316 407.4 5.4e-16
98 190 253.1 2.1e-07
163 184 244.8 6.1e-07
164 170 227.2 5.8e-06
91 157 210.3 5.1e-05
125 148 199.7 0.0002
42 140 191.3 0.00058
73 139 188.7 0.00082
76 133 181.0 0.0022
27 119 165.2 0.017
66 118 163.0 0.022
65 116 160.5
0.03
52 117 160.3 0.031
66 115 159.3 0.035
66 112 155.5 0.057
73 112 155.4 0.058
76 110 153.8 0.072
58 104 138.5
0.51
47 103 137.8
0.56
63
98 131.3
1.3
58
99 129.4
1.6
48
91 122.9
3.7
50
92 121.9
4.3
36
85 121.3
4.7
36
84 120.0
5.5
45
90 118.9
6.3
48
92 117.4
7.6
59
89 117.0
8
48
97 117.0
8
38
83 116.8
8.3
Sequence Comparison
& Alignment
E-value
“Hits” can be sorted according to their E-value or their
score.
The E-value is better known as the EXPECT value
and is a function of score, database size and query
sequence length.
E-value: Number of alignments with a score >=S that
you expect to find if the database was a collection of
random letters.
e.g. For a score of 1, one only requires 1 match, and there should
be an enormous amount of alignments. One expects to find less
alignments with a score of 5, and so on.. Eventually when the
score is big enough, one expects to find an insignificant number of
of alignments that could be due to chance.
E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually
very good and for proteins, E<1e-2 is usually considered
significant. It is still possible for a Hit with E>1 to be biologically
meaningful, but more analysis is required to comfirm that.
Even for VERY good hits, it is possible that the hit is due to a
biological artifact (sequencing/cloning vector, repeats, lowcomplexity sequence…)
Sequence Comparison
& Alignment
E-value
Another type of statistics is the P-value, which given a
score S for an alignment is the Probability that an
alignment of the query against a database of random
sequences has a score >= S.For gapless alignments the Pvalue can be computed from theory.
Sometimes one has an alignments algorithms, or biologically
complex databases that do not allow the computation of P-value
based on the statistical theory of a uniform database. In this case,
one computes uses an alternate statistics, the Z-value (e.g. FASTA
suite), which shuffles the query sequence and thus creates many
compositionally identical query sequence. Each random sequences
is then re-queried agains the database. When done enough times,
this provides a distribution of scores which is approximately
normally distributed (if lucky) around some mean.
Z-value = score distance away from mean/ standard devuation
.. a Z-value of 3 or greater is good.
= Standard deviation
S = score of alignment
Prob
Distrib
Score
Deviation from mean
Sequence Comparison
& Alignment
Detailed alignments
are shown farther
down in the output
DATABASE SEARCHING
> fasta
myquery swissprot -ktup 2
>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)
initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21
Smith-Waterman score: 395;
52.3% identity in 109 aa overlap
10
20
30
40
50
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF
: X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::
gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF
10
20
30
40
50
60
gi|170
60
70
80
90
100
110
gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK
....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:
gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK
70
80
90
100
110
120
120
130
140
gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ
..
gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE
130
140
150
160
170
180
>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)
initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16
Smith-Waterman score: 316;
37.4% identity in 131 aa overlap
gi|170
10
20
30
40
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER
:.. :. .v^: :.. ..:::: ::.::::::. ::X :
Sequence Comparison
& Alignment
HASHING METHODS
Simplest Database
searching could is a
large dynamic
programming
example.
Database sequence
For a query of N
letters against a
database of M letters,
it requires MxN
comparisons.
Query sequence
Sequence Comparison
& Alignment
Hashing is a
common method for
accelerating
database searches
Compile “dictionary”
of words from the
query sequence. Put
each word in a lookup table that points
to the original
position in the
sequence. Thus
given one word, you
can know if it is in
the query in a single
operation.
HASHING METHODS
query
sequence
MLIIKRDELVISWASHERE
MLI
LII
IIK
IKR
all overlapping
KRD
words of size 3
RDE
DEL
ELV
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
Sequence Comparison
& Alignment
Index lookup
Each word is assigned a unique integer.
E.g. for a word of 3 letters made up of an alphabet of 20
letters.
1. Assign a code to each letter Code(l) (0 to 19)
2. For a word of 3 letters L1 L2 L3 the code is
index = Code(L1)*202 + Code(L2)*201 + Code(L3)
3. Have an array with a list of the positions that have
that word.
0 1 2 3
1
Position in query sequence of word
Sequence Comparison
& Alignment
Building the
dictionary for the
query sequence
requires (N-2)
operations.
The database
contains (M-2)
words, and it takes
only one operation to
see if the word was
in the query.
HASHING METHODS
query
sequence
MLIIKRDELVISWASHERE
MLI
LII
IIK
IKR
all overlapping
KRD
words of size 3
RDE
DEL
ELV
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
Sequence Comparison
& Alignment
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
fills the dynamic
programming matrix
in (N-2)+(M-2)
operations instead
of MxN.
Database sequence
Scan the database,
looking up words in
the dictionary
Sequence Comparison
& Alignment
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
Database sequence
Scan the database,
looking up words in
the dictionary
FASTA searches in a band
Sequence Comparison
& Alignment
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
Database sequence
Scan the database,
looking up words in
the dictionary
BLAST extends from word hits
Sequence Comparison
& Alignment
Multiple Alignment
FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV...
-----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV...
-----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIV
MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV
PGHVLI...
PGHVLI...
-----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...
Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
A true multiple
alignment method
will align all the
sequences
together at the
same time.
Sequence Comparison
& Alignment
Multiple Alignment
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...
A true multiple
alignment method
will align all the
sequences
together at the
same time.
Unfortunately, there is no formal computationally tractable method for
more than 3 sequences.
There are many approximate methods, such as Progressive multiple
alignment methods.
Sequence Comparison
& Alignment
Progressive Multiple Alignment
HNT2_YEAST
Y866_METJA
FHIT_HUMAN
APH1_SCHPO
Align all pairs of
sequences.
Pairwise alignments: compute distance matrix
FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA
FHIT_HUMAN
APH1_SCHPO 395
HNT2_YEAST 316
380
Y866_METJA 290
300
340
Sequence Comparison
& Alignment
Progressive Multiple Alignment
FHIT_HUMAN
Guide Tree
APH1_SCHPO
HNT2_YEAST
Y866_METJA
Pairwise alignments: compute distance matrix
FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA
FHIT_HUMAN
APH1_SCHPO 395
HNT2_YEAST 316
380
Y866_METJA 290
300
340
Sequence Comparison
& Alignment
Multiple Alignment
FHIT_HUMAN MSFR
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV...
FGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO MPKQ
MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV...
LYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI...
Y866_METJA MCIF
MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
Align two closest
sequences
This alignment creates a consensus
sequence that is next used to align
subsequent sequences.
From the point of view of this pairwise
alignment, the gap can be inserted anywhere
In the green region (between the 1st M , and
base 13 (S))
Sequence Comparison
& Alignment
Multiple Alignment
FHIT_HUMAN -----------MSF
MS-F RFGQHLIKP-SVVFL
RFGQHLIKP-SVVFL
KTELSFALVNRKPVV
KTELSFALVNRKPVV
PGHVLV... PGHVLV...
APH1_SCHPO -----------MPK
MPKQ LYFSKFPVG-SQVFY
QLYFSKFPVGSQVFY
RTKLSAAFVNLKPIL
RTKLSAAFVNLKPIL
PGHVLV... PGHVLV...
HNT2_YEAST MILSKTKKPKSMNK
MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI...
PIYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA MCIF
MCIFCKIINGEIP-AKVVYEDEHVLAFLDINPRNKGHTLV...
CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
Align Next closest
sequence to the
consensus.
Once inserted gap position
cannot move because they
are part of the consensus.
Sequence Comparison
& Alignment
Multiple Alignment
FHIT_HUMAN
FHIT_HUMAN -----------MS-F
-----------MSFR RFGQHLIKP-SVVFL
FGQHLIKP-SVVFL KTELSFALVNRKPVV
KTELSFALVNRKPVV PGHVLV...
PGHVLV...
APH1_SCHPO
APH1_SCHPO -----------MPKQ
-----------MPKQ LYFSKFPVG-SQVFY
LYFSKFPVGSQVFY RTKLSAAFVNLKPIL
RTKLSAAFVNLKPIL PGHVLV...
PGHVLV...
HNT2_YEAST
HNT2_YEAST MILSKTKKPKSMNKP
MILSKTKKPKSMNKP IYFSKFLVT-EQVFY
IYFSKFLVTEQVFY KSKYTYALVNLKPIV
KSKYTYALVNLKPIV PGHVLI...
PGHVLI...
Y866_METJA
Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
-----------MCIF CKIINGEIPAKVVY EDEHVLAFLDINPRN KGHTLV...
Align Next closest
sequence to new
consensus.
Hopefully, the result should be similar
to what a true multiple alignment
method would have yielded. We saw
that the order of alignment determines
the existence of gaps.
Because of the order of alignments, the gap
position cannot be changed to align these two P,
which would have resulted in a higher score.
Sequence Comparison
& Alignment
Clustalw:
CLUSTALW
is a progressive multiple alignment tool.
- Adaptive gap opening and extension scores, makes
it relatively insensitive to small changes in gap
parameters.
- Choice of DNA or protein gap penalty alignments.
- Available on the web or on PC/Mac/unix.
http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html
The uppercase “O” in options is relevant.
Sequence Comparison
& Alignment
BLAST
BLAST and BLAST2SEQUENCES
is a database search engine based on
using hashing to accelerate the search.
blastn (for nucleotides) or
blastp (for proteins)
blastx (translates a nucleotide query in all 6 reading frames
and compare it to a protein database.)
tblastn (compare a protein against a nucleotide database
translated in all 6 reading frames.)
tblastx (compares a nucleotide sequence against a
nucleotide database by translating the query and
database in all 6 reading frames.)
http://www.ncbi.nlm.nih.gov/BLAST/
A pairwise alignment implementation of these
program is available at:
http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
Sequence Comparison
& Alignment
Clustalw:
Blast:
Query-Anchored Alignments (master Slave)
Is a multiple alignment program. Every
Sequence is aligned to every other one.
NOT a multiple alignment program, but may
display Query-Anchored multiple pairwise
alignments that look like multiple alignment, but
all the sequences are only aligned to the first
sequence!
Gaps in the query,
means NOTHING can
be aligned to it. Gaps
may optionally be
shown(flat view), or
entire column
omitted.
Gap in subject sequence
This Column is NOT
aligned together. It is
displayed there for
convenience.
Sequence Comparison
& Alignment
BLAST and BLAST2SEQUENCES
Exercizes: Use Entrez to find the
protein sequences with LOCUS name
FHIT_HUMAN
HNT2_YEAST
Use clustalw to align these two sequences,
And WITHOUT LOSING THAT RESULT SCREEN!!!
Use pairwise blast to align these two sequences as
well.
EXERCIZE: Try to reproduce the example of
clustalW alignment (the order of input sequences is
not important)
Sequence Comparison
& Alignment
TextBook:
References
"Bioinformatics" A Practical
Guide to the Analysis of Genes
and Proteins. Edited by Andy D.
Baxevanis and B.F. Ouellette
readings: chapters 7,8,9
http://www.ncbi.nlm.nih.gov/BLAST/b
last_overview.html
Download