Phylogenetic Methods 1

advertisement
An Introduction to Phylogenetic Methods
Part one
Dr Laura Emery
Laura.Emery@ebi.ac.uk
www.ebi.ac.uk/training
Objectives
• After this tutorial you should be able to…
• Discuss a range of methods for phylogenetic inference, their
advantages, assumptions and limitations
• Implement some phylogenetic methods using publicly
available software
• Appreciate some approaches for assessing branch support
and selecting an appropriate substitution model
Outline
•
•
•
•
Alignment for phylogenetics
Phylogenetics: The general approach
Phylogenetic Methods (1 – simple methods)
Assessing Branch Support
BREAK
•
•
•
•
Substitution Models
Phylogenetic Methods (2 - statistical inference)
Deciding which model to use (hypothesis testing)
Software
Alignment for phylogenetics
• Phylogenetic analyses are typically applied to alignments
of sequence data
• Occasionally other data such as morphological traits are
used (e.g. when no sequence data is available)
• Alignments must contain homologous sequences
• We assume that sites in the same column in an alignment
are homologous
Alignment for phylogenetics
Benjamin Redelings
Columns in alignments should be
homologous
Benjamin Redelings
Phylogenetics: The general approach
• We want to find the tree that best explains our aligned
sequences
• We need to be able to define “best explains”
• we need a model of sequence evolution
• we need a criterion (or set of criteria) to use to choose
between alternative trees
• then evaluate all possible trees
(NB: if N=20, then 2 x 1020 possible unrooted trees!)
• or take a short cut
Paul Sharp
There is only one true tree
• The true tree refers to what actually happened in the
evolutionary past
• All methods attempt to reconstruct the true phylogeny
• Even the best method may not give you the true tree
Methodological approaches
1. Distance matrix methods (pre-computed
distances)
•
UPGMA assumes perfect molecular clock Sokal & Michener
(1958)
•
Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei
(1987)
2. Maximum parsimony Fitch (1971)
•
Minimises number of mutational steps
3. Maximum likelihood, ML
• Evaluates statistical likelihood of alternative trees, based on
an explicit model of substitution
4. Bayesian methods
• Like ML but can incorporate prior knowledge
What is a distance matrix?
A table that indicates the number of substitutions between
pairs of sequences
Distance Matrix Methods
Andrew Rambaut
UPGMA Method
1. Identify the pair of most closely related taxa according to
the pairwise-genetic distance matrix
2. Cluster these together
Figures Andrew Rambaut
UPGMA Method
3. Recalculate distance matrix (calculate the distances from
the new cluster to every other sequence)
Take the average of both distances
E.g. distance[spinach, monkey/human] :
• = (distance[spinach, human] + distance[spinach, monkey]) / 2
• = (86.3 + 90.8)/2 = 88.55
Figures Andrew Rambaut
UPGMA Method
4. Repeat the procedure until the tree is finished
distance between (spi,ric) and
mos(mon,hum) is 108.7
Andrew Rambaut
UPGMA Method
• Assumptions:
• Strict molecular clock
• Ultrametric distance data
• Advantages:
• Fast and simple
• Disadvantages:
• Data are almost never ultrametric
• Usage: Almost never used
Neighbour Joining Method
• An improvement over the UPGMA: does not require data
to be ultrametric
• Identifies the topology that gives the least total branch
length at each step
Figures Olivier Gascuel
Neighbour Joining Method
• Advantages:
• allows the use of an explicit model of evolution
• fast and simple
• able to deal with thousands of taxa
• Disadvantages:
• only produces one tree
• reduces all sequence information into a single distance
value
• dependant on the evolutionary model used
• Usage: commonly used due to being widely available in
many software packages
Methodological approaches
1. Distance matrix methods (pre-computed
distances)
•
UPGMA assumes perfect molecular clock Sokal & Michener
(1958)
•
Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei
(1987)
2. Maximum parsimony Fitch (1971)
•
Minimises number of mutational steps
3. Maximum likelihood, ML
• Evaluates statistical likelihood of alternative trees, based on
an explicit model of substitution
4. Bayesian methods
• Like ML but can incorporate prior knowledge
Maximum Parsimony
The most parsimonious tree is the tree requiring the
smallest number of substitutions to explain the sequences
?
C
C
MP
*
?
A
A
(unrooted)
? *
C
*
C
C
length = 3
T
* C
A
T
C *
C
C
T
length = 2
C
*
A
A
*
*
*
C
C
length = 3
T
*
T
A
C
*
C
C
length = 3
T
T
Maximum Parsimony
• Assumptions:
• Multiple substitutions rare
• Advantages:
• fast
• Disadvantages
• not consistent with most models of evolution
• can result in multiple optimal trees
• Usage: still used with morphological data
Figures Andrew Rambaut
The problem of multiple substitutions
*
G
*
A
A
hidden
mutations
*
A
*
T
• More likely to have
occurred between
distantly related
species
• > We need an explicit
model of evolution to
account for these (to
be covered in part
two)
Methodological approaches
1. Distance matrix methods (pre-computed
distances)
•
UPGMA assumes perfect molecular clock Sokal & Michener
(1958)
•
Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei
(1987)
2. Maximum parsimony Fitch (1971)
How well supported are my branches?
•
Minimises number of mutational steps
3. Maximum likelihood, ML
• Evaluates statistical likelihood of alternative trees, based on
an explicit model of substitution
4. Bayesian methods
• Like ML but can incorporate prior knowledge
How well supported are my branches?
A tree is a collection of hypotheses
so we assess our confidence in each
of its parts or branches independently
0.99
100
0.81
63
There are three main approaches:
0.93
85
• Bootstraps
• Bayesian methods
• Approximate likelihood ratio test (aLRT) methods probabilistic
Bootstrapping
2. Resample
columns with
replacement to
create many
dummy alignments
1. Take your
alignment,
and
consider
each
column
separately
repeat lots
repeat lots
3. Use these to draw many trees
and count up the occurrences of
each branch among these trees
Figures Andrew Rambaut
Felsenstein, J. 1985. Confidence limits on phylogenies: an approach
using the bootstrap. Evolution 39: 783-791.
Issues with bootstrapping
• Sites may not evolve independently
• P values are biased (too conservative)
• Calculating bootstraps for many branches results in
multiple testing
• Bootstrapping does not correct biases in phylogeny
methods
• Nevertheless they perform surprisingly well
Outline
•
•
•
•
Alignment for phylogenetics
Phylogenetics: The general approach
Phylogenetic Methods (1 – simple methods)
Assessing Branch Support
BREAK
•
•
•
•
Substitution Models
Phylogenetic Methods (2 - statistical inference)
Deciding which model to use (hypothesis testing)
Software
Now it's your turn…
• Open your course manuals and begin Tutorial 1
• Also available to download from:
http://www.ebi.ac.uk/training/course/scuola-dibioinformatica-2013
• You will require the alignment file 5SrRNA.txt
• There are answers available online but it is much better
to ask for help!
Thank you!
www.ebi.ac.uk
Twitter: @emblebi
Facebook: EMBLEBI
Download