presentation

advertisement
Alignments and alignment reliability
The first critical step in sequence
analysis – the know how
Eyal Privman and Osnat Penn
Tel Aviv University
COST Training School
Rehovot, 2010
What are alignments good for?
 To compare sequences


Find homology
Similar sequence  similar function
 To learn about sequence evolution




Mismatch = point mutation
Gap = indel (insertion or deletion)
Reconstruct phylogenetic tree
Infer selection forces, e.g., detecting positive
selection
Sequences evolution
ATGAAATAA
30 MYA
ATGTTTTAA
5 MYA
Today
ATGTTTTAA
Human
A
T
Chimp
Mouse
A T
A T
G -
ATGCCCAAATAA
ATGCCCAAATAA
ATGTTT
-
-
T
T
T
T
A A
G - - - T T T G C C C A A A T
- A A
Alignment and phylogeny
are mutually dependant
MSA
Unaligned
sequences
Sequence
alignment
Phylogeny
reconstruction
Inaccurate
tree building
0.4
Alignment and phylogeny
are both challenging
25% of
residues are
aligned
wrong
Based on BAliBASE: a large
representative set of proteins
Alignment and phylogeny
are both challenging
5% of
tree
branches
are wrong
Based on simulations of
100 protein sequences
Making an alignment
 For 2 sequences : use exact methods.
 For more sequences:


Exact methods are not feasible (too slow)
We use heuristic methods
Progressive alignment
A
B
First step:
C
compute pairwise distances D
E
Compute the pairwise
alignments for all against all
(10 pairwise alignments).
The similarities are
converted to distances and
stored in a table
A
B
C
D
A
B
8
C
15
17
D
16
14
10
E
32
31
31
32
E
Second step:
build a guide tree
Cluster the sequences to create a tree
(guide tree):
A
B
C
D
A
B
8
C
15
17
D
16
14
10
32 31 31 32
• represents the order The
in whichguide
pairs of treeEis imprecise
sequences are to be aligned
and is NOT the tree which
• similar sequences are neighbors in the
truly describes the
tree
• distant sequences are distant from each
A
evolutionary
relationship
other in the tree
between the sequences!
B
C
D
E
E
Third step: align sequences in a bottom up order
A
Sequence A
Sequence B
B
C
D
E
1. Align the most similar (neighboring) pairs
2. Align pairs of pairs
3. Align sequences clustered to pairs of pairs
deeper in the tree
Sequence C
Sequence D
Sequence E
Multiple sequence alignment
(MSA)
A
B
C
D
E
Pairwise distance
table
Iterative
progressive
alignment
Guide tree
A
B
C
D
E
MSA
Multiple sequence alignment
(MSA)
Several advanced MSA programs are available.
Today we will use two:
 MAFFT – fastest and one of the most accurate
 PRANK – distinct from all other MSA programs because of its
correct treatment of insertions/deletions
Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066
© 2002 Oxford University Press
MAFFT: a novel method for rapid multiple sequence
alignment based on fast Fourier transform
MAFFT
Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata*
 Web server & download:
http://align.bmr.kyushu-u.ac.jp/mafft/online/server/
 Efficiency-tuned variants
 quick & dirty or slow but accurate
Choosing a MAFFT strategy
quick & dirty
slow but accurate
Choosing a MAFFT strategy
quick & dirty
slow but accurate
Choosing a MAFFT strategy
quick & dirty
slow but accurate
Choosing a MAFFT strategy
quick & dirty
slow but accurate
E-INS-i
oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo
---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-----------------ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX---------------------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo--------
L-INS-i
G-INS-i
ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
XXXXXXXXXXX-XXXXXXXXXXXXXXX
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
XX-XXXXXXXXXXXXXXX-XXXXXXXX
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
XXXXX----XXXXXXXX---XXXXXXX
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
XXXXX-XXXXXXXXXX----XXXXXXX
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------
XXXXXXXXXXXXXXXX----XXXXXXX
MAFFT output
A colored view of the alignment
Saving the output
 Choose a format: Clustal, Fasta,
or click "Reformat" to convert to
a selection of other formats
 Save page as a text file
e.g. save as "phylip" file and upload
to PhyML for reconstructing the tree
PhyML: tree reconstruction
The most widely used maximum likelihood (ML) program
 Web server & download: http://www.atgc-montpellier.fr/phyml/
PRANK
Classical alignment errors for HIV env
PRANK
 Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/
PRANK output
If you need a different format – copy the
results to the READSEQ sequence converter:
http://www-bimas.cit.nih.gov/molbio/readseq/
1.
Download and save the sequences file from Osnat's homepage
(you can google “Osnat Penn" and look for the workshop
materials under "Teaching"). Save the file as "trim5a.AA.fas"
(File  “Save page as”). This file contains 20 protein sequences
in FASTA format.
2.
Run PRANK web-server to create a protein alignment:
a.
In the “Default alignment” section browse for
“trim5a.AA.fas”.
b.
Run (press the “Start alignment“ button) .
3.
While you wait: copy the sequences into the MAFFT web server
and run the "automatic" "moderately accurate" strategy – which
strategy did MAFFT choose for you? Click on the "Fasta
format“ link, and save as “trim5a.AA.mafft.aln“ (File  “Save
page as”) and try the "Jalview" button.
4.
When PRANK finishes click on the “Show Fasta file” button,
and save the MSA by the name “trim5a.AA.prank.aln“.
Sources of alignment errors
Progressive alignment algorithms are greedy heuristics
 Co-optimal solutions
 Heads-or-Tails (HoT) scores (Landan & Graur 2007)
 Guide-tree errors
 GUIDANCE scores (Penn, Privman et al. MBE 2010)
GUIDANCE: Guide-tree based alignment
confidence scores
Base MSA
Bootstrap sampling
of NJ trees
Progressive
alignment
Tree 1
Tree 2
…
Tree 99
Tree 100
MSA 1
MSA 2
…
MSA 99
MSA 100
GUIDANCE
Scores
Confident
Uncertain
1
0
Penn, Privman et al. MBE. 2010
http://guidance.tau.ac.il
Extracellular
domain
(a)
Transmembrane
domain
Cytoplasmic
domain
HIV1 group M
SIV chimp
HIV1 group N
HIV1 group O
SIV gorilla
GUIDANCE
Scores
GUIDANCE score
SIV cerco
Column
Confident
Uncertain
Extracellular
domain
(b)
Transmembrane
domain
Cytoplasmic
domain
HIV1 group M
SIV chimp
GUIDANCE score
HIV1 group O
Column
1. Run GUIDANCE web-server to calculate confidence scores for
the MAFFT alignment:
a. In the “Upload your sequence file” window browse for
“trim5a.AA.fas”.
b. Choose “Amino Acids” in the “Sequences Type” option.
c. In order to speed the run, change the “Number of bootstrap
repeats” in the “Advanced options” section to 30. Note that
this is not recommended for real life.
d. Run (press the “Submit“ button) .
Detecting
selection forces
 Positive selection
Empirical findings
variation among genes:
“Important” proteins evolve
slower
than “unimportant” ones
Histone 3 protein
Empirical findings
variation among sites:
Functional sites evolve
slower
than nonfunctional sites
Silent and non-silent mutations
Silent:
UUU -> UUC
(both encode
phenylalanine)
Non-silent:
UUU -> CUU
(phenylalanine to leucine)
For most proteins, the rate of silent
substitutions is much higher
than the non-silent rate
This is called purifying selection
= conservation
There are rare cases where the non-silent
rate is much higher than the silent rate
This is called positive
selection
Positive Selection
Examples:
 Pathogen proteins evading the host immune
system
 Proteins of the immune system detecting
pathogen proteins
 Pathogen proteins that are drug targets
 Proteins that are products of gene duplication
 Proteins involved in the reproductive system
http://selecton.tau.ac.il
Selecton results
False positive predictions
 Selecton uses an MSA as input
 The MSA may contain unreliable regions
Errors in Selecton computations
Errors in the positive selection inference
1. Go to the GUIDANCE results of the last exercise.
2. Which columns are not well aligned? Are these sites
also predicted to evolve under positive selection?
See Selecton results in:
http://selecton.tau.ac.il/results/1268662868/colors.html
Summary
 Different alignment programs may result
different MSAs.
 Alignment uncertainty may cause errors in
downstream analyses such as positive
selection analysis.
 GUIDANCE can detect alignment errors.
Thanks for your attention!
Download