Multiple Sequence Alignment Software: ClustalW

advertisement
Multiple Sequence Alignment
•
•
•
•
•
•
What is it
Why do we use it
How to use it
Tools
ClustalW
Exercise
Multiple Sequence Alignment
• Many genes are represented in highly conserved forms in a wide
range of organisms
• Patterns of change in these gene sequences may be analyzed by
simultaneous alignment of the sequences (identify conserved
regions)
• This is known as multiple sequence alignment (msa)
• A multiple alignment arranges a set of sequences in a scheme
where positions believed to be homologous are written in a
common column.
Applications of Multiple Sequence Alignment
• Predict protein function
• Predict protein structure (using structure superposition
programs).
• Predict the evolutionary history of sequences (using phylogenetic
analysis programs).
• Contig Assembly (Shotgun sequences & ESTs)
• Identify new family members
• Design PCR primers for amplification of related sequences
• Database searching with the consensus sequences to identify
other sequences with a similar pattern.
Multiple Sequence Alignment Guidelines
• Select the sequences carefully. Make sure they are members of
the same family and they all share a common ancestor
• Use protein sequences if possible. Translate if necessary and then
convert back to DNA after the alignment.
• Protein seqs are three times shorter and provide a more informative
alphabet
• If there is little signal at the aa level there will be no signal at the nt
level
• If you are interested in non-coding sequences you have no choice
but beware DNA alignment is tricky (need a very high level of
conservation)
Multiple Sequence Alignment Guidelines Cont.
• Ensure that at least half of the sequences share more than 30%
identity and avoid sequences that have > 90% identity to another
sequence
• An alignment that contains only very similar sequences is not very
informative
• If you make sure that each sequence is between 30 and 70%
identical with half of the sequences in the set you will have made a
reasonable compromise between new information and alignment
quality
Multiple Sequence Alignment Guidelines Cont.
• Start with 10-15 sequences and avoid aligning more than 50
sequences (if you do employ a high level of manual curation)
• Multiple alignment programs are not good at handling large sets of
sequences.
• Visualizing many alignments is difficult and if it falls on more than
one page interpretation can become difficult if not impossible.
• Aligning a lot of sequences is computationally difficult and public
servers have limited resources, so it may take a long time to run
and make it difficult for you to fine tune alignment parameters or
alternative sequences.
Multiple Sequence Alignment Guidelines Cont.
• Tree building and structure prediction programs do not handle big
alignments well
• Making accurate big alignments is difficult and not so reliable
making it difficult to have confidence in the fidelity of the sequences
that you are saying belong to a family. Best to start small and
gradually increase the size of the multiple alignments.
• Before adding a sequence to a multiple alignment, you can figure
out whether it is a good choice by doing a pairwise comparison.
Multiple Sequence Alignment Guidelines Cont.
• Use sequences of similar length. Programs have problems
aligning partial and complete sequences.
• Repeated domains are problematic for the alignment programs,
especially if the number of domains is different.
• Name Sequences appropriately
• Never use white spaces such as clone 2 (clone2 or clone_2)
• Do not use special symbols, stick to plain letters, numbers and the
underscore
• Do not use names any longer than 15 characters
• Use unique names for each sequence
• Use informative names (OSJLBa0001A01f compared to
Main_Clone1)
EXPASY INTEGRATED BLAST &
MSA SERVER
EXPASY INTEGRATED BLAST &
MSA SERVER (databases and options)
• Output of search displayed
• Links to Pfam
Scroll down
• View Alignments
(helps inform selection)
• Make selections for inclusion in msa
• Send your selections options
• Select your sequences in fasta
format
• Send your selections options
• Example selected sequences
• Note the range of scores and E values selected
ACC # SwissProt #
P20472
P80079
P02626
P02619
P43305
P32930
Q91482
P02620
P02622
Description
Organism
Score EXP
PRVA_HUMAN Parvalbumin alpha [PVALB] [Homo sapiens 186
PRVA_FELCA Parvalbumin alpha [PVALB] [Felis silvestris... 162
PRVA_AMPME Parvalbumin alpha [Amphiuma (Salamand... 109
PRVB_ESOLU Parvalbumin beta [Esox lucius (Northern pike)] 95
PRVU_CHICK Parvalbumin, thymic CPV3 (Parvalbumin 3) [G 92
ONCO_HUMAN Oncomodulin (OM) (Parvalbumin beta) [OCM] 89
PRV1_SALSA Parvalbumin beta 1 (Major allergen Sal s 1)... 85
PRVB_MERME Parvalbumin beta [Merluccius merluccius (Eu... 80
PRVB_GADCA Parvalbumin beta (Allergen Gad c 1) (Gad c ... 74
9e-47
1e-39
1e-23
2e-19
2e-18
2e-17
3e-16
7e-15
5e-13
Multiple Sequence Alignment Software
•
•
•
•
•
•
•
•
•
•
ClustalW (Unix, Mac, PC, VMS).
ClustalX (IGBMC , EBI) (graphical interface) (Unix, Mac, PC, VMS).
Multalin
MSA (Unix).
DIALIGN (Unix).
DCA (Unix).
Multiple alignment by randomized iterative strategy (Unix).
MACAW (Mac, PC).
T-Coffee (Unix).
MAFFT (Linux, Unix, Windows XP, Mac OS X).
Multiple Sequence Alignment Online Tools
• ClustalW at EBI (Hinxton, UK). Display and edit alignments with JalView.
• ClustalW, Multalin at PBIL (Lyon, France). Colored alignments and secondary
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
structure predictions.
ClustalW, MAP, PIMA at BCM MSA, ClustalW, ctree at IBC (St Louis, USA)
Multalin at INRA (Toulouse, France). Colored alignments.
ClustalW, DCA, DIALIGN2 at Pasteur (Paris, France)
ClustalW at EMBL (Heidelberg, Germany). Performs multiple alignment on
homologous sequences detected by BLAST.
ClustalW at DDBJ (Mishima, Japan)
MAP (Michigan Tech. Univ., USA)
ProbModel at CBRG (Zurich, Switzerland)
DIALIGN2 at BiBiServ (Bielefeld, Germany)
DCA at BiBiServ (Bielefeld, Germany)
ITERALIGN (Stanford, USA)
T-COFFEE (Lausanne, Switzerland)
MATCH-BOX (Namur, Belgium)
BLOCK Maker at FHCRC (Washington, USA)
MEME at SDSC (San Diego, USA)
MEME at Pasteur (Paris, France)
PIMA II at BMERC (Boston, USA)
MAVID at UCB (Berkeley, USA)
Multiple Sequence Alignment Software: ClustalW
• First msa that could run on almost any platform
• Most widely used msa program
• ClustalW is the latest version
• There are many Clustal servers around the world,
most operating the same version but their different
interfaces provide access to different options.
• It is available as a stand-alone package also.
Multiple Sequence Alignment Software: ClustalW
• CLustalW uses a progressive method to build
its alignments
• It compares two sequences at a time and clusters them
by similarity.
• This clustering resembles a phylogenetic tree
(.dnd file from ClustalW output). This clustering is
called as dendogram
A
B
Root
C
D
• Reveals that A and B are more
similar than C and D
• To make the progressive alignment
ClustalW follows the dendogram and
starts aligning A and B and then
C and D.
• It then treats the multiple alignments
like single sequences and aligns them
two by two.
Multiple Sequence Alignment Software: ClustalW
• Pairwise Scores
• This is the pairwise comparisons
ClustalW uses to build its tree
• This can be ignored
Multiple Sequence Alignment Software: ClustalW
• Shows the alignment
• Can be saved as a text file
• Can view it in color
Multiple Sequence Alignment Software: ClustalW
• The Guide Tree
• Shows the tree that ClustalW uses
to guide its progressive alignment
• It is displayed in Phylip tree format
• A cladogram is a branching diagram (tree)
assumed to be an estimate of a phylogeny
where the branches are of equal length,
thus cladograms show common ancestry,
but do not indicate the amount of evolutionary
"time" separating taxa
Multiple Sequence Alignment Software: ClustalW
• The Phylogram Tree
• A Phylogram is a branching diagram
(tree) assumed to be an estimate of
a phylogeny, branch lengths are
proportional to the amount of inferred
evolutionary change
Interpreting Multiple Sequence Alignments
• Interpreting an alignment is more art than a science !!|
• No E values exist to tell us how reliable the search was as in
database searching
• Best method of evaluation is based on knowledge of
protein structures.
• Structures contain loops that evolve rapidly
• Loops are softer portions of the protein that connect
its more rigid portions
• Protein structures also contain core regions inside the
protein that act as support walls for the protein.
These support walls evolve less rapidly than the loops
on the surface
• In a good multiple alignment can expect to find nice gap
free blocks that correspond to core regions and gap rich
regions that correspond to the loops
Interpreting Multiple Sequence Alignments
• How Can you tell whether a block is good?
• Take a look at the alignment symbols
• * A star indicates an entirely conserved region
• : A colon indicates columns where all the residues
have roughly the same size and same hydropathy
• . A period indicates columns where the size or hydropathy
has been preserved in the course of evolution
• An average GOOD block is at least 10-30 aa long
exhibiting at least 1 to 3 stars, five to seven colons
and a few periods
• In a good multiple alignment can expect to find nice gap
free blocks that correspond to core regions and gap rich
regions that correspond to the loops
Multiple Sequence Alignment Tools
• BLAST Servers with integrated MSA’s
• www.expasy.ch/cgi-bin/BLASTEMBnet-ch.pl
•
•
•
•
•
Extract entire sequences
Export sequences in FASTA format
Submit sequences to ClustalW
Submit sequences to Tcofee
http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html
•
•
•
•
Extract entire sequences
Extract sequence fragments
Export sequences in FASTA format
Submit sequences to ClustalW
• srs.ebi.ac.uk
• Submit sequences to ClustalW
www.expasy.ch/cgi-bin/BLASTEMBnet-ch.pl
http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html
srs.ebi.ac.uk
Download