tree - KU Leuven

advertisement
IE68 - Biological databases
Phylogenetic analysis
Phylogenetic analysis
• Phylogeny
a reconstruction of the evolutionary (genealogical)
history of a group of organisms/genes or proteins
from biological data
organisms: populations, species, genera,... => taxa =>
operational taxonomic units (OTU’s)
data: molecular, morphological, archaeological,... =>
characters
• Phylogenetic tree
the graphical reconstruction of a phylogeny
tree structure: phylogram, cladogram
IE68 - biological databases phylogeny
Phylogenetic tree
A tree consists of nodes connected by branches
polytomy
A
B
C
D
E
=> OTU’s for
which we
have data
outgroup/midpoint
=> Ancestor of all the taxa that comprise the tree
notation: ((A,B),(C,D,E))
IE68 - biological databases phylogeny
Phylogenetics <> Phenetics
• Phenetics: method of grouping taxa that is
based on overall (dis)similarities of
characters => with no reference to evolution!
• Phylogenetics: method of grouping taxa that
is based on shared derived characters
(synapomorphies) or a model of evolution
IE68 - biological databases phylogeny
Why do we need phylogenies?
• Intrinsic interest in the tree => tree of life
origin of organisms
IE68 - biological databases phylogeny
Why do we need phylogenies?
• Phylogenies can also be used as tools for
investigating other problems
e.g. biogeography
phylogeny reflects the order of separation of
the areas the different taxa occupy
T
IE68 - biological databases phylogeny
Why do we need phylogenies?
• Phylogenies can also be used as tools for
investigating other problems
e.g. forensic science
IE68 - biological databases phylogeny
IE68 - biological databases phylogeny
Phylogenetic analysis
• Molecular Phylogenetics
reconstruction of the evolutionary (geneological)
history of a group of organisms from molecular
data, i.e. DNA or protein sequences
In this lecture, we will focus on phylogenetic
analysis of organisms based on DNA sequence
data
IE68 - biological databases phylogeny
Molecular phylogenetics:
approach
Step 1: PCR with primers that target
cytoplasmic DNA or nuclear loci of
taxa, followed by DNA sequence
analysis
Step 2: Multiple DNA sequence alignment
Step 3: Phylogenetic analysis
IE68 - biological databases phylogeny
PCR and DNA sequencing
• Which loci?
DNA sequence information, primers, variability,
single or low-copy, orthologous, neutral,
recombination...
• Gene trees versus organismal trees
phylogenies for genes do not always match those
of their corresponding organisms => analyse more
than one gene
IE68 - biological databases phylogeny
Confounding influence of gene
duplication
2 types of homology:
orthology (speciation) and paralogy (gene duplication)
IE68 - biological databases phylogeny
Lineage sorting and coalescence
species
alleles
IE68 - biological databases phylogeny
Molecular phylogenetics:
approach
Step 1: PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by
DNA sequence analysis
Step 2: Multiple DNA sequence alignment
Step 3: Phylogenetic analysis
IE68 - biological databases phylogeny
Multiple DNA sequence alignment
• Problem: alternative alignments
possible to align any two sequences by postulating
some combination of gaps (insertion/deletions =
indels) and substitutions
=> which one to choose?
• Basic task of sequence alignment is to find
the alignment with the highest similarity,
smallest distance, or lowest overall cost
IE68 - biological databases phylogeny
Multiple DNA sequence alignment
• 2 sequences + scoring scheme => optimal alignment
• Scoring scheme:
- scoring matrix: distance weights or similarity scores for
each pair of aligned bases
e.g. transition – transversion matrix
A
T
G
C
A
0
5
1
5
T
5
0
5
1
G
1
5
0
5
C
5
1
5
0
- gap weight, cost or penalty
IE68 - biological databases phylogeny
Multiple DNA sequence alignment
• Cost of an alignment D = s + wg
s = no of substitutions, g = total length of gaps
w = gap penalty = cost of gap relative to substitution
• Gap penalty W makes implicit assumptions
about how the sequences have evolved
if indels are thought to be rare, then W should be
large (and vice versa)
=> have to use knowledge of biology e.g. translation (3
bp indel, position), transition<>transversion, ...
IE68 - biological databases phylogeny
Multiple DNA sequence alignment
• Software programs:
e.g. CLUSTALW (global alignment)
http://www.ebi.ac.uk/clustalw/index.html
• The optimal alignment is not always the
true alignment => new developments
phylogenetic analysis without the multiple DNA
sequence alignment step
IE68 - biological databases phylogeny
Molecular phylogenetics:
approach
Step 1: PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by
DNA sequence analysis
Step 2: Multiple DNA sequence alignment
Step 3: Phylogenetic analysis
IE68 - biological databases phylogeny
Inferring phylogenies from DNA sequences
C
Sequence alignment
A
B
C
taxa
..AGCGTCT..
..AGCGTGT..
..AG–GAGT..
A
Phylogenetic
methods
B
unrooted tree
A
B
characters
C
rooted tree
IE68 - biological databases phylogeny
Phylogenetic methods
Character-based
methods
Methods based on an
explicit model of evolution
Maximumlikelihood methods
Methods not based on an
explicit model of evolution
Maximum
parsimony methods
IE68 - biological databases phylogeny
Non characterbased methods
Pairwise distance
methods
Pairwise distance methods
• Dissimilarity matrix:
count the number of differences
between all possible pairs of sequences
• Convert dissimilarity to evolutionary
distance by correcting for multiple
events per site according to a certain
model of evolution
• Infer tree topology on the basis of the
evolutionary distances by using a
clustering algorithm or optimality
criterion
IE68 - biological databases phylogeny
3 taxa, 3 sequences
1
2
3
1
2 0.26
3 0.20 0.33
1
2
1
2 0.32
3 0.23 0.44
tree
3
Models of sequence evolution
expected  observed difference => correction
(linear)
(not linear)
Apply a
substitution
model that tries
to estimate the
correct number
of substitutions
IE68 - biological databases phylogeny
Models of sequence evolution
• Distance “correction” methods:
convert observed distances into measure that
correspond to ACTUAL distance
• Several methods have been proposed, all with
different assumptions about the nature of the
evolutionary process
• Essentially they differ by the number of
parameters they include
• We can use a general framework to show how
these models are inter-related
IE68 - biological databases phylogeny
Substitution models: general framework
IE68 - biological databases phylogeny
Substitution models: general framework
IE68 - biological databases phylogeny
e.g. Model of Jukes & Cantor (JC)
• One of the first proposed – perhaps the simplest model of
evolution
• Assumes that all four bases have equal frequency and that
all substitutions are equally likely
• Under this model, the distance between any two sequences
is given by d = -3/4ln(1-4/3p), where p is the proportion of
nucleotides that are different in the two sequences
IE68 - biological databases phylogeny
e.g. Kimura 2 parameter model (K2P)
•
incorporates the observation that transitions accumulate
more rapidly than transversion
• assumes all four bases have equal frequencies but that there
are 2 rate classes for substitutions
• Under this model, the distance between any two sequences
is given by d = 1/2ln[1/(1-2P-Q)] + 1/4ln[1/(1-2Q)], where
P and Q are the proportional differences between the two
sequences due to transitions and transversions, respectively
IE68 - biological databases phylogeny
Substitution models
Other models: adding more parameters
- Felsenstein model (F81)
variation in base composition => base frequency
f = [A C G T] may vary
- Hasewaga Kishino Yano (HKY) model
unequal base frequency, transition/transversion
- General reversible model (REV)
unequal base frequency, all six pairs of substitutions have
different rates
=> ideally, we want the simplest model we can get away with
that still yields a reasonable estimate
IE68 - biological databases phylogeny
Substitution models
• Assumptions of these models:
– all nucleotide sites change independently
– base composition equilibrium
– substitution rate is constant over time and in
different lineages
– each site in a sequence is equally likely to
undergo substitution
=> gamma distribution has a parameter that specifies
the range of rate variation among sites: model + ’
IE68 - biological databases phylogeny
Pairwise distance methods
• Dissimilarity matrix:
count the number of differences between
all possible pairs of sequences
• Convert dissimilarity to evolutionary
distance by correcting for multiple events
per site according to a certain model of
evolution
• Infer tree topology on the basis of the
evolutionary distances by using a
clustering algorithm
IE68 - biological databases phylogeny
3 taxa, 3 sequences
1
2
3
1
2 0.26
3 0.20 0.33
1
2
1
2 0.32
3 0.23 0.44
tree
3
Clustering methods
• Clustering methods follow a set of steps (an
algorithm) and arrive at a tree
– UPGMA (Unweighted Pair Group Method using
Arithmetic Averages): results in an rooted and additive
tree with molecular clock
– Neighbor-joining: results in an unrooted and
additive tree
– Other approaches: least-squares, Fitch, Kitch,...
IE68 - biological databases phylogeny
UPGMA clustering
B
C
D
A
B
2
least differences
4
6
C
4
6
6
1
A
1
B
Compute new distances between (AB) and other OTU’s
d(AB)C = (dAC + dBC) /2 = 4
d(AB)D = (dAD + dBD) /2 = 6
IE68 - biological databases phylogeny
UPGMA clustering
AB
C
4
D
6
C
1
A
1
B
1
2
6
C
1
1
Compute new distances
between (ABC) and other
OTU’s
1
d(ABC)D = (d(AB)D + dCD) /2 = 6
3
IE68 - biological databases phylogeny
1
2
A
B
C
D
Clustering methods
• UPGMA
additive and ultrametric distances
=> assumes a molecular clock => very sensitive to
unequal rate of evolution! => relative-rate test
• Use other clustering methods for phylogeny
e.g. Neighbor-joining
• “Goodness of fit” statistics: to select the metric
tree that best accounts for the observed distances
IE68 - biological databases phylogeny
Pairwise distance methods
• Dissimilarity matrix:
count the number of differences between
all possible pairs of sequences
• Convert dissimilarity to evolutionary
distance by correcting for multiple events
per site according to a certain model of
evolution
• Infer tree topology on the basis of the
evolutionary distances by using an
optimality criterion
IE68 - biological databases phylogeny
3 taxa, 3 sequences
1
2
3
1
2 0.26
3 0.20 0.33
1
2
1
2 0.32
3 0.23 0.44
tree
3
Minimum evolution
• Distance matrix => unrooted metric trees
• Each tree has a length L, which is the sum
of all the branch lengths
• Optimality criterion:
the minimum evolution tree ME is the tree
which minimizes L
IE68 - biological databases phylogeny
Pairwise distance method
• Advantages
– very fast
– based on a model of evolution
• Disadvantages
– sequence information is reduced to one number
– branch lengths may not be biologically
interpreted
– most methods provide only one tree topology
– dependent on the model of evolution used
IE68 - biological databases phylogeny
Phylogenetic methods
Character-based
methods
Methods based on an
explicit model of evolution
Maximumlikelihood methods
Methods not based on an
explicit model of evolution
Maximum
parsimony methods
IE68 - biological databases phylogeny
Non characterbased methods
Pairwise distance
methods
Character-based methods
• Character-based (discrete) methods operate
directly on sequences, rather than on
pairwise distances
• Two major discrete methods:
– Maximum parsimony (MP): chooses tree(s) that
require fewest evolutionary changes
– Maximum Likelihood (ML): chooses tree(s) that is
the one most likely to have produced the observed data
IE68 - biological databases phylogeny
Maximum parsimony
• Maximum parsimony infers a phylogenetic
tree by minimizing the total number of
evolutionary steps
• Principle:
– Investigate all possible tree topologies
– Reconstruct ancestral sequences
– Choose topology with smallest number of steps
IE68 - biological databases phylogeny
Maximum parsimony - principle
1
3
2
1
3
1
4
2
4
3
possible tree
topologies
IE68 - biological databases phylogeny
4
2
A
B
C
Maximum parsimony - principle
IE68 - biological databases phylogeny
Maximum parsimony - principle
IE68 - biological databases phylogeny
Maximum parsimony - principle
IE68 - biological databases phylogeny
Maximum parsimony - generalized
• In previous example, cost of each substitution was
“one step” => equal weight
• Instead, we can use different costs for different
types of change (e.g. transitions vs transversions)
to better match our assumptions about
evolutionary processes => weighted parsimony
according to Dollo, Wagner, Fitch, ...
IE68 - biological databases phylogeny
Maximum parsimony - characters
IE68 - biological databases phylogeny
Maximum parsimony – search methods
• Number of tree topologies: Nu = (2n-5)!/2n-3(n-3)!
i.e., 3 sequences ~ 1 tree, 4 seq ~ 3 trees, 5 seq ~ 15, 6 ~ 105, =>
the more sequences (~ taxa), the more trees => computationally
expensive
• Finding optimal trees:
– Exhaustive search: limited number of taxa (<10)
find the minimum tree of all possible trees
– Branch and bound: small number of taxa (<18)
find the minimum tree without evaluating all trees by
discarding families of trees during tree construction that
cannot be shorter than the shortest tree found so far
– Heuristic search: large number of taxa
IE68 - biological databases phylogeny
Maximum parsimony – search methods
- Heuristic search:
explore a subset of all possible trees, by using stepwise
addition of taxa plus a rearrangement process (branch
swapping), but not guaranteed to find the minimal tree
Global optimum
Local optimum
IE68 - biological databases phylogeny
Maximum parsimony - output
• Consensus tree:
MP can yield multiple equally most parsimonious
(optimal) trees => relationships common to all the
optimal trees are summarized with a consensus
tree
– Strict consensus: includes splits found in all trees
– Majority-rule consensus: includes splits found in the
majority of the trees (> 50%)
IE68 - biological databases phylogeny
Maximum parsimony - output
• Consistency index (CI) - Retention index (RI)
– measures of the parsimony fit of a character to a
tree, or of the average fit of all characters to a tree
– more specifically: index of how much homoplasy
the constructed tree has
– Value from 0 to 1
– higher value => less homoplasy
IE68 - biological databases phylogeny
IE68 - biological databases phylogeny
Parsimony – branch support and tree stability
• Bootstrap analysis
– is a resampling technique used to measure sampling
error
– gives an idea about the reliability of branches and
clusters
– original dataset => resample => construct trees =>
compare trees to original trees
– >70% quite confident of tree topology
• Decay index (Bremer support)
– gives us a sense of how many steps would be required
before a grouping collapses
– higher value => better branch support
IE68 - biological databases phylogeny
Maximum parsimony
• Advantages
– based on shared derived characters
– evaluates different tree topologies
– does not reduce the information
• Disadvantages
– computationally intensive for large datasets
– no correction for multiple mutations
– sensitive to unequal rates of evolution (long
branch attraction)
IE68 - biological databases phylogeny
Phylogenetic methods
Character-based
methods
Methods based on an
Maximumexplicit model of evolution likelihood methods
Methods not based on an
explicit model of evolution
Maximum
parsimony methods
IE68 - biological databases phylogeny
Non characterbased methods
Pairwise distance
methods
Maximum likelihood
• Statistical method
• If given some data D and a hypothesis H,
the likelihood of that data is given by
LD = Pr (D|H)
• Which is the probability of D given H?
IE68 - biological databases phylogeny
Maximum likelihood
• In the context of molecular phylogenetics
– D is the set of sequences being compared
– H is a phylogenetic tree
• We want to find the likelihood of obtaining
the observed data given the tree
• The tree that makes the data the most
probable evolutionary outcome is the
Maximum Likelihood estimate of the
phylogeny
IE68 - biological databases phylogeny
Maximum likelihood
• In other words
Which tree is most likely to have yielded
these sequences (observed data) under a
given model of evolution (JC, K2P, ...)?
IE68 - biological databases phylogeny
Maximum likelihood
• Advantages
–
–
–
–
–
Statistically well founded
Based on a model of evolution
Evaluates different topologies
Uses all sequence information
Often yields estimates that have lower variance
than other methods
• Disadvantages
– Very slow (computationally intensive)
– Dependent on the model of evolution used
IE68 - biological databases phylogeny
Software programs for phylogenetic
analysis
• Overview:
http://evolution.genetics.washington.edu/phylip/
software.html
• Most widely used software programs
– PHYLIP: free available (downloadable or online
http://bioweb.pasteur.fr/seqanal/phylogeny/phylipuk.html)
– PAUP: user friendly but not free available
IE68 - biological databases phylogeny
Phylogenetic information on the internet
• http://tolweb.org/tree/phylogeny.html
• http://www.treebase.org/treebase/
• ....
IE68 - biological databases phylogeny
If you need more information
• Jacqueline Vander Stappen
K.U.Leuven
Laboratory of Gene Technology
Kasteelpark Arenberg 21
B-3001 Leuven
• Jacqueline.vanderstappen@agr.kuleuven.ac.be
IE68 - biological databases phylogeny
Download