Last lecture summary

advertisement
Last lecture summary
• Sequence database searching
• exhaustive, heuristic
• BLAST
• How it works, steps, parameters
• BLAST variants, reading frame
1
query sequence
For each word, the list of similar words
is created using a substitution matrix
2
database sequences
scan
match
list
The query sequence is cut in words of length W
the extension of
the similarity on
both sides of
the word
extend
3
high scoring pair
New stuff
Multiple sequence alignment
MSA
What is MSA
• Comparison of many (i.e., >2) sequences
• local or global
Why MSA
• Biological sequences often occur in families. Homologous
sequences often retain similar structures and functions.
• related genes within an organism
• genes in various species
• sequences within a population (polymorphic variants)
• MSA reveals more biological information than pairwise
alignment
• two sequences that may not align well to each other can be aligned
via their relationship to a third sequence, thereby integrating
information in a way not possible using only pairwise alignments
Sequence motif
• A short conserved region in DNA, RNA or protein
sequence
• Corresponds to a structural or functional feature in
proteins
• Shared by several sequences, can be generated by MSA
Edgar R.S. et al. Peroxiredoxins are conserved markers of circadian rhythms. Nature 2012 485(7399):459-64
Protein family
• A protein family is a group of evolutionarily-related
proteins.
• Proteins in a family descend from a common ancestor
and typically have similar three-dimensional structures,
functions, and significant sequence similarity.
• Currently, several thousands protein families have been
defined, although ambiguity in the definition of protein
family leads different researchers to wildly varying
numbers
Edgar R.S. et al. Peroxiredoxins are conserved markers of circadian rhythms. Nature 2012 485(7399):459-64
Pfam - http://pfam.xfam.org/ /
• Database of protein families that includes their
annotations and multiple sequence alignments
Sequence logo
Sequence logo
Conserved: BIG letters with few others in that space
Divergent: small letters with many others in that space
Logo visualization of the alignment
• Make a logo from your alignment
• Can be easier to compare
• Nice graphic
• Students love them
• http://weblogo.berkeley.edu/logo.cgi
Doing MSA
• As with aligning a pair of sequences, the difficulty in
aligning a group of sequences varies considerably with
sequence similarity.
• If the amount of sequence variation is minimal, it is quite
straightforward to align the sequences, even without the
assistance of a computer program.
• If the amount of sequence variation is great, it may be
very difficult to find an optimal alignment of the sequences
because so many combinations of substitutions,
insertions, and deletions, each predicting a different
alignment, are possible.
Challenges of the MSA
• Finding an optimal alignment of more than two sequences
that includes matches, mismatches and gaps, and that
takes into account the degree of variation in all of the
sequences at the same time poses a very difficult
challenge.
• A second computational challenge is identifying a
reasonable method of obtaining a cumulative score for the
substitutions in the column of an MSA.
MSA algorithms
• As with the pairwise sequence comparisons, there are
two types of multiple alignment algorithms
• optimal
• heuristic
Optimal algorithms
• Extension of dynamic programming to multiple sequences
• Exhaustive search
• Produce best alignment
• Computationally expensive
• Not feasible for n>10 sequences of length m>200
residues
Heuristic algorithms
• Limit the exhaustive search
• Attempt to rapidly find a good, but not necessarily optimal
alignment
• Most popular methods:
• progressive methods (ClustalW)
• start from the most similar sequences and progressively add new
sequences
• iterative methods (MUSCLE)
• make initial crude alignment, then revise it
Progressive sequence alignment
• The most commonly used algorithm, the most commonly
used software ClustalW.
• Latest upgrade – Clustal OMEGA
• Permits the rapid alignment of even hundreds of
sequences.
• Limitation: the final alignment depends on the order in
which sequences are joined.
• Not guaranteed to provide the most accurate alignments.
ClustalW
• http://www.clustal.org
• EMBOSS – a free open source software analysis package
(European Molecular Biology Open Software Suite) http://emboss.sourceforge.net
• Program emma is a ClustalW wrapper
• A variety of EMBOSS servers hosting emma are available, e.g.
http://embossgui.sourceforge.net/demo/emma.html
• ClustalX – a downloadable stand-alone program offering
a graphical user interface for editing multiple sequence
alignments
• http://www.clustal.org/clustal2/
Sequences to align
>neuroglobin 1OJ6A NP_067080.1 [Homo sapiens]
-------------MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCR
QFSSPEDCLSSPEFLDHIRKVMLVI---DAAVTNVEDLSSLEEYLASLGRKHRAVGVKLS
SFSTVGESLLYMLEKCLGPA-FTPATRAAWSQLYGAVVQAMSRGWDGE--->rice_globin 1D8U rice Non-Symbiotic Plant Hemoglobin NP_001049476.1 [Oryza sativa
(japonica cultivar-group)]
MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMFSF-LR NSDVP-LEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLKYGVGDA
HFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE-->soybean_globin 1FSL leghemoglobin P02238 LGBA_SOYBN [Glycine max]
----------MVAFTEKQDALVSSSFEAFKANIPQYSVVFYTSILEKAPAAKDLFSF-LA NGVDP---TNPKLTGHAEKLFALVRDSAGQLKASGTVVAD----AALGSVHAQKAVTDP
QFVVVKEALLKTIKAAVGDK-WSDELSRAWEVAYDELAAAIKKA------->beta_globin 2hhbB NP_000509.1 [Homo sapiens]
----------MVHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYPWTQRFFES-FG
DLSTPDAVMGNPKVKAHGKKVLGAF---SDGLAHLDNLKGTFATLSELHCDKLH--VDPE
NFRLLGNVLVCVLAHHFGKE-FTPPVQAAYQKVVAGVANALAHKYH----->myoglobin 2MM1 NP_005359.1 [Homo sapiens]
-----------MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDK-FK
HLKSEDEMKASEDLKKHGATVLTAL---GGILKKKGHHEAEIKPLAQSHATKHK--IPVK
YLEFISECIIQVLQSKHPGD-FGADAQGAMNKALELFRKDMASNYKELGFQG
ClustalW – how it works?
Three stages
1st stage
Global alignment (Needlman-Wunsch) is used to create
pairwise alignments of every protein pair.
number of pairwise
alignments
𝑛(𝑛 − 1)
2
2nd stage
• A guide tree is calculated from these scores.
• A guide tree is a template used in the third stage of
ClustalW to define the order in which sequences are
added to multiple alignment.
Newick
format
The construction of a guide tree
• Unweighted Pair Group Method with Arithmetic Mean
(UPGMA)
• A simple hierarchical clustering method
• How it works?
http://www.southampton.ac.uk/~re1u06/teaching/upgma/
• Neighbor joining
UPGMA
based on A Tutorial on Clustering Algorithms
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
Similarity
A
B
C
Let’s describe each object by a vector containing:
• The size of the object (‘b’ or ‘s’)
• The color of the inner area (‘g’ or ‘b’)
• The color of the border line (‘r’ or ‘o’)
• The width of the border line (3 or 6)
D
Similarity
The more two objects are
similar, the more features they
share.
B
A
A
B
C
D
=
=
=
=
(b,
(b,
(b,
(s,
g,
b,
b,
b,
C
r,
o,
o,
r,
6)
6)
3)
6)
D
Similarity
Let’s score the vector pairs.
Match = 1
Mismatch = 0
B
A
C
D
similarity
A
B
C
D
=
=
=
=
(b,
(b,
(b,
(s,
g,
b,
b,
b,
r,
o,
o,
r,
6)
6)
3)
6)
AB =
AC =
AD =
BC =
BD =
CD =
Similarity and distance
While similarity between two objects
is a number of features they share,
distance is a number of features they
differ at.
B
A
A
B
C
D
=
=
=
=
(b,
(b,
(b,
(s,
g,
b,
b,
b,
C
r,
o,
o,
r,
6)
6)
3)
6)
D
similarity
dist
AB = 2
AB =
AC = 1
AC =
AD = 2
AD =
BC = 3
BC =
BD = 1
BD =
CD = 1
CD =
Cluster analysis
We have data, we don’t
know classes.
Assign data objects
into groups
(called clusters) so
that data objects from
the same cluster are
more similar to each
other than objects from
different clusters.
Cluster analysis
We have data, we don’t
know classes.
Assign data objects
into groups
(called clusters) so
that data objects from
the same cluster are
more similar to each
other than objects from
different clusters.
Cluster analysis
• Many algorithms for cluster analysis exist.
• We are interested in so-called hierarchical clustering.
• UPGMA is a hierarchical clustering method.
• Hierarchical clustering algorithms connect objects to form
clusters based on their distance. Closest (i.e. most
similar) objects are connected first and then other objects
are gradually added.
• This process is represented graphically by a tree structure
called a dendrogram. In UPGMA, a guide tree is a
dendrogram.
• I will explain the construction of a guide tree (dendrogram)
in UPGMA in the next section.
BA
FL
MI/TO
MI/TO
NA
RM
0
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
MI/TO
FL
936.5
MI/TO
NA
RM
0
𝟖𝟕𝟕 + 𝟗𝟗𝟔
= 𝟗𝟑𝟔. 𝟓
𝟐
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
MI/TO
FL
MI/TO
936.5 347.5
0
BA
FL
BA
0
FL
662
NA
MI
RM
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
MI/TO
FL
MI/TO
NA
936.5 347.5
0
811.5
BA
FL
BA
0
FL
662
MI
RM
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
MI/TO
FL
MI/TO
NA
RM
936.5 347.5
0
811.5
616.5
BA
FL
BA
0
FL
662
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
FL
MI/TO
NA
RM
BA
0
662
936.5
255
412
FL
662
0
347.5
468
268
MI/TO
936.5
347.5
0
811.5
616.5
NA
255
468
811.5
0
219
RM
412
268
616.5
219
0
BA
FL
MI/TO
NA/RM
BA
0
662
936.5
333.5
FL
662
0
347.5
268
MI/TO
936.5
347.5
0
714
NA/RM
333.5
268
714
0
FL/NA/RM
BA
MI/TO
FL/NA/RM
0
497.75
530.75
BA
497.75
0
936.5
MI/TO
530.75
936.5
0
BA/FL/NA/RM
MI/TO
BA/FL/NA/RM
0
733.625
MI/TO
733.625
0
Torino → Milano
Rome → Naples
Dendrogram
→ Florence
→ Bari
Join Torino–Milano and Rome–Naples–Bari–Florence
Dendrogram
Torino → Milano (138)
Rome → Naples (219)
→ Florence (268)
→ Bari (497.75)
Join Torino–Milano and Rome–Naples–Bari–Florence (733.625)
RM
138
NA
dissimilarity
733.625
497.75
219
268
FL
BA
MI
TO
3rd stage
• Multiple sequence alignment is created in the series of
steps based on the order presented in the guide tree.
• First select two most closely related sequences from the
guide tree and create a pairwise alignment.
3rd stage
3rd stage
• The next sequence is either added to the pairwise
alignment (to generate an aligned group of three
sequences, sometimes called a profile) or used in
another pairwise alignment.
• At some point, profiles are aligned with profiles.
• The alignment continues progressively until the root of the
tree is reached, and all sequences have been aligned.
Gaps
• “once a gap, always a gap” rule
• The most closely related pair of sequences is aligned first.
• As further sequences are added to the alignment, there
are many ways that gaps could be included.
• Gaps are often added to first two (closest) sequences.
Iterative approaches
• Progressive alignment methods have the inherent
limitation that once an error occurs in the alignment
process it cannot be corrected.
• Iterative approaches can overcome this limitation.
• Create an initial alignment and then modify it to try to
improve it.
• e.g. MUSCLE, IterAlign, Praline, MAFFT
MUSCLE
• Since its introduction in 2004, the MUSCLE program of
Robert Edgar has become popular because of its
accuracy and its exceptional speed, especially for multiple
sequence alignments involving large number of
sequences.
• Multiple sequence comparison by log expectation
• Three stages
MUSCLE
1. Draft alignment
2. Improved alignment
3. Refinement
Edgar, R. C. Nucl. Acids Res. 2004 32:1792-1797; doi:10.1093/nar/gkh340
MUSCLE online
https://www.ebi.ac.uk/Tools/msa/muscle/
Download