Phylogenetic analysis

advertisement
Phylogenetic analysis
A brief introduction in 2 x 4 hours
brigitte.boeckmann@isb-sib.ch
What you can learn today
•
•
•
•
•
•
•
•
Understand trees
Different types of gene relationships
The difference between a cladogram and a phylogram
Phylogenetic analysis methods
Steps performed during a phylogenetic analysis
Search strategies for tree topologies
Measures for tree robustness
Gene relationships and function prediction
© 2009 SIB
Outline
• Introduction to phylogenetic analysis
• Application: Protein function prediction
• Databases, servers and software
• TP5
© 2009 SIB
Introduction
Phylogeny is the study of evolutionary relationships.
Phylogenetic analysis is the means of inferring evolutionary relationships.
Ancestral genome
Polymorphisms - CNV
Gene duplication – Gene loss – gene fusion – gene fission - exon shuffling
– retroposition – mobile elements – de novo gene origination
HGT
© 2009 SIB
Genome species 1
Genome species 2
HGT
Trees
A
B
C
D
E
F
G
End nodes
Internal
nodes
Branches
Roots
© 2009 SIB
A
B
C
D
E
F
G
Phylogenetic trees
• Cladogram
• Phylogram
The branch length
represents the number of
character changes
Molecular clock
© 2009 SIB
Phylogenetic trees
•
•
A phylogenetic tree is a model about the evolutionary relationship between
operational taxonomic units (OTUs) based on homologous characters.
But not all trees are phylogenetic trees
– Dendrogram: general term for a branching diagram
– Cladogram: branching diagram without branch length estimates
– Phylogram or phylogenetic tree: branching diagram with branch length
estimates
Please note:
Guide trees produced during multiple sequence
alignment have no phylogenetic meaning: the
dendrograms are based on distances derived from
pair-wise alignments; they are used to determine in
what order sequences are aligned during the
construction of the MSA.
© 2009 SIB
Rooted and unrooted trees
Outgroup
© 2009 SIB
How many distinct trees?
Solved (bifurcating) and un(re)solved
(multifurcating) trees
© 2009 SIB
A
A
B
B
C
C
D
D
E
E
F
F
G
G
Speciation and gene duplication
A1
A1
B1
B1
Gene
duplication
C1
B2
A2
C
B2
D
C2
E
D
F
Gene
duplication
© 2009 SIB
Relationships within homologs
Frog gene 1
Human gene 1
Orthologs
Mouse gene 1
Gene
duplication
Paralogs
Mouse gene 2
Ancestral
gene
Human gene 2
Frog gene 2
Drosophila gene
© 2009 SIB
Homologs
Orthologs
Relationships between orthologs and paralogs
Frog gene 1
Human gene 1
Orthologs
(Group 1)
Mouse gene 1
Gene
duplication
Inparalogs
of Group 2
Mouse gene 2
Ancestral
gene
Human gene 2
Frog gene 2
Drosophila gene
© 2009 SIB
Outparalogs
of Group 1
Co-orthologs
of the
Drosophila
gene
Orthologs
(Group 2)
Gene trees versus species trees …
© 2009 SIB
Gene relationships
Homologs = Genes of common origin
Orthologs = 1. Genes resulting from a speciation event, 2. Genes originating
from an ancestral gene in the last common ancestor of the compared
genomes
Co-orthologs = Orthologs that have undergone lineage-specific gene
duplications subsequent to a particular speciation event
Paralogs = Genes resulting from gene duplication
Inparalogs = Paralogs resulting from lineage-specific duplication(s)
subsequent to a particular speciation event
Outparalogs = Paralogs resulting from gene duplication(s) preceding a
particular speciation event
One-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene
duplications subsequent to a particular speciation event
One-to-many (1:n) orthologs: Orthologs of which at least one - and at most all
but one - has undergone lineage-specific gene duplication subsequent to
a particular speciation event
Many-to-many (n:n) orthologs = Orthologs which have undergone lineagespecific gene duplications subsequent to a particular speciation event
Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologs
Xenologs = Orthologs derived by horizontal gene transfer from another
lineage
© 2009 SIB
Phylogenetic analysis – an approach I
Sequence data of actin-related protein 2
Species are:
>Species A - RecName: Full=Actin-related protein 2;
MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI
ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI
EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG
RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL
VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE
IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF
LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR
IRSTTKVGNI
DTRNCKILLT
DGVTHICPVY
CYVGYNIEQE
LLFNTIQAAD
KIRIEDPPRR
EIKDLMVGDE
EPPMNPTKNR
EGFSLPHLTR
QKLALETTVL
IDTRSEFYKH
KHMVFLGGAV
>Species B - RecName: Full=Actin-related protein 2;
MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI
ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI
EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG
RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL
VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE
IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF
LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR
IRSTTKVGNI
DTRNCKILLT
DGVTHICPVY
CYVGYNIEQE
LLFNTIQAAD
KIRIEDPPRR
EIKDLMVGDE
EPPMNPTKNR
EGFSLPHLTR
QKLALETTVL
IDTRSEFYKH
KHMVFLGGAV
….
© 2009 SIB
Caenorhabditis briggsae
Drosophila melanogaster
Homo sapiens
Mus musculus
Schizosaccharomyces pombe
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
*:*
:* ******** *** *** . **::****::*: . *::::**:***:*
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
AEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTF-FEKLKIDPRGRKILLTEPPMNPVANR
CSQLRQMLDINYPMDNGIVRNWDDMAHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR
ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR
ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR
ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR
.. :*.:*::.***:**::*::::* ::**:** :*:.**.
*:******:** **
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
EKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVG
EKMFQVMFEQYGFNSIYVAVQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTR
EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR
EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR
EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR
**: :.*** * *.. *:*:****:****** :***:********* **** . * **.
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
RLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVL
RLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVL
RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL
RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL
RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL
***:**** *.***.*** .** **.:******* :******:.*::* : .*: *****
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
MRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRA
SQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKH
VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH
VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH
VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH
.*******:*.:*.**:*.** ******:. * *::*: *. ***:*:* * :*.
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
IVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAV
IVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAV
IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV
IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV
IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV
******::**.*******.*:***:::***:.:
: :**:.** .* *. **:****
•
•
•
•
•
•
ARP2_A
ARP2_B
ARP2_C
ARP2_D
ARP2_E
LADIMAQND-HMWVSKAEWEEYGV-RALDKLGPRTT
LANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKALAEVTKDRD-GFWMSKQEYQEQGL-KVLQKLQKISH
LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR
LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR
**:: :.* :*::. *::* *: . : **
Species are:
Caenorhabditis briggsae
Drosophila melanogaster
Homo sapiens
Mus musculus
Schizosaccharomyces pombe
Which sequence is likely to correspond
to which species?
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
ARP2_A
ARP2_B
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
ARP2_A
ARP2_C
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ARP2_A
ARP2_D
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ARP2_A
ARP2_E
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
ARP2_B
ARP2_C
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ARP2_B
ARP2_D
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ARP2_B
ARP2_E
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
ARP2_C
ARP2_D
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ARP2_C
ARP2_E
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
ARP2_D
ARP2_E
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
© 2009 SIB
Species are:
Caenorhabditis briggsae
Drosophila melanogaster
Homo sapiens
Mus musculus
Schizosaccharomyces pombe
Distance matrix
© 2009 SIB
A
B
C
D
E
A
0
-
-
-
-
B
158
0
-
-
-
C
143
107
0
-
-
D
139
97
73
0
-
E
139
97
73
0
0
Expected species tree for …
•
•
•
•
•
Caenorhabditis briggsae
Drosophila melanogaster
Homo sapiens
Mus musculus
Schizosaccharomyces pombe
© 2009 SIB
Phylogenetic analysis
1.
2.
3.
4.
5.
6.
© 2009 SIB
Data selection
Data comparison
Selection of a data model
Selection of an evolutionary model
Tree-building
Tree evaluation
What data types can be used to infer
phylogenies?
•
•
•
•
•
•
© 2009 SIB
Morphological characters
Physiological characters
Gene order
Sequence data (nucleotide sequences, amino acid sequences)
Mixed characters
….
Data selection
• To be considered:
– Input data must be homolog!
– Taxonomic range and ~ distribution (balance, avoid LB)
– Content of phylogenetic information
– Number of character states
– Size of the dataset
– etc
© 2009 SIB
Phylogenetic analysis
1.
2.
3.
4.
5.
6.
© 2009 SIB
Data selection
Data comparison
Selection of a data model
Selection of an evolutionary model
Tree-building
Tree evaluation
Data comparison
• To be considered:
– Prediction of characters that are derived from a common
ancestor
– Chose a suitable alignment method
– Highly diverged sequences
• Domain/family predictions
• Structures
© 2009 SIB
Alignment
•
•
•
Pairwise alignment versus MSA
MSA methods
– ClustalW (very fast)
– Muscle (very fast)
– MAFFT (fast)
– Probcons
– T-coffee
– …
When to use which method and why?
© 2009 SIB
Phylogenetic analysis
1.
2.
3.
4.
5.
6.
© 2009 SIB
Data selection
Data comparison
Selection of a data model
Selection of an evolutionary model
Tree-building
Tree evaluation
Selection of a data model
• Characters to be selected for the analysis
• To be considered:
– Each position in the alignment should be homolog!
– Missing data (in some OTU)
– Number of characters
– etc
© 2009 SIB
Selection of a data model
• Common methods
– Gap removal
– GBLOCKS
© 2009 SIB
Phylogenetic analysis
1.
2.
3.
4.
5.
6.
© 2009 SIB
Data selection
Data comparison
Selection of a data model
Selection of an evolutionary model
Tree-building
Tree evaluation
Evolutionary models
•
Phylogenetic tree-building presumes particular evolutionary
models
•
The model chosen influences the outcome of the analysis and
should be considered in the interpretation of the analysis
results
© 2009 SIB
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
–
…
–
…
–
…
–
…
–
etc
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
–
…
–
…
–
…
–
etc
http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif
Frequencies of aa exchange
•
Substitution matrices
– Empirically derived from alignment datasets
• PAM (Dayhoff, 1968)
• JTT (Jones, Taylor, Thornton, 1992)
• Gonnet et al. (1992)
• WAG (Whelan, Goldman, 2001)
• mtrev (Hadachi, Hasegawa, 1996, specific for mitochondrial
data)
– Estimated rate matrix -> series of replacement probability matrices
(e.g. PAM1 … PAM250)
© 2009 SIB
Evolutionary models
•
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
–
…
–
…
–
etc
Why?
© 2009 SIB
© 2009 SIB
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
• GC content
– Differs between species (20-72%)
– Differs within a genome (isochores)
– Biased recombination-associated DNA repair
– Temperature
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
• Exchangeability matrix can be build for a particular
dataset
• JTT + F
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate
heterogenity
Alignment
© 2009 SIB
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate
heterogenity
• Variation in substitution rates among different positions
• Mostly discrete gamma model
Gamma distribution is a continuous probability density function
Alpha parameter
Scaling factor
Infinitely large alpha value, rate variation is the same for all sites
Probability density
alpha = 1, extensive rate variation
alpha < 1, many invariable sites
Relative evolutionary rate
http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate
heterogenity
• Variation in substitution rates among different positions
• Mostly discrete gamma model
• Select the number of categories (4/8)
Evolutionary models
•
© 2009 SIB
Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate
heterogenity
4. Presence of invariable sites
Evolutionary models
Notation, e.g.
JTT
JTT + F
JTT + F + gamma (4 )
JTT + F + gamma (8 ) + I (under discussion)
JTT + F + I
•
•
© 2009 SIB
It is not always the most complex model that produces the
best result.
The more complex the model, the more complex the
explanation of the results.
Evolutionary models
• Selection of best-fit models (statistically) of evolution
– ProtTest
• AIC (Akaike Information Criterion);
– simple relationship between the likelihood and the
number of parameters to estimate the distance of a
model from truth
• BIC (Bayesian Information Criterion)
– includes a penalty for the number of parameters to avoid
overfitting of the selected model
© 2009 SIB
Phylogenetic analysis
1.
2.
3.
4.
5.
6.
© 2009 SIB
Data selection
Data comparison
Selection of a data model
Selection of an evolutionary model
Tree-building
Tree evaluation
Tree-building methods
•
•
© 2009 SIB
Distance (matrix) methods
1. Calculate distances for all pairs of taxa based on the
sequence alignment
2. Construct a phylogenetic tree based on a distance matrix
Character-based (Sequence) methods
1. Constructs a phylogenetic tree based on the sequence
alignment
Step 1: Compute distances
Simple measure for the extend of sequence divergence:
p distance:
^p=n /n
d
p = proportion (p distance)
nd= number of aa differences
n = number of aa used
© 2009 SIB
Step 1: Compute distances
Number of substitutions per site
• Relationship of p with t (time)
1.0
0.5
25
© 2009 SIB
50
75
Time in million years
Step 1: Compute distances
•
•
Nonlinear relationship of p with t (time)
Estimate the true number of amino acid substitutions between sequence
pairs
–
–
© 2009 SIB
Poisson correction (PC distance)
Gamma correction (Gamma distance)
Step 1: Compute distances
© 2009 SIB
Step 2: Tree-building
Common distance methods
•
Neighbor Joining (NJ)
•
(Un)-Weighted pair-group method using arithmetic averages (UPGMA /
WPGMA)
•
Least Square (LS)
•
Minimal Evolution (ME)
© 2009 SIB
Neighbor Joining (NJ)
•
•
Saitou, Nei (1987)
Principle
– Bottom-up clustering method
– Neighbours are defined as taxa connected by a single node
in an unrooted tree; closest neighbours are successively
joined by a new node until the tree is resolved.
– Result: A single, unrooted tree with branch length estimates
© 2009 SIB
Neighbor Joining (NJ)
Mol Biol Evol. 1987 Jul;4(4):406-25.
© 2009 SIB
Neighbor Joining (NJ)
Neighbor Joining (NJ)
• Advantage
– Very efficient
– Also for large datasets
• Disadvantage
– Does not examine all possible topologies
© 2009 SIB
Character- (Sequence-) based methods
Most common:
•
Maximum Parsimony (MP)
•
Maximum Likelihood (ML)
•
Baysian Inference
© 2009 SIB
Maximum Parsimony (MP)
•
•
•
•
•
© 2009 SIB
Henning, 1966
Originally developed for morphological characters
William of Ockham (1285-1349, Franciscan friar): the best
hypothesis is the one that requires the smallest number of
assumptions
The topology of the result tree is the one that requires the
smallest number of evolutionary changes
Group of related methods
Maximum Parsimony (MP)
•
© 2009 SIB
Principle:
–
Estimate the minimum number of substitutions for a given
topology
–
Parsimony-informative sites (shared-derived characters,
exclude invariable sites and singletons)
–
Searching MP trees
• Exhaustive search
• Branch-and-bound (Hendy-Penny, 1982)
– Good but time-consuming, if m>20
• Heuristic search
– Result tree might not be the most parsimonious tree
–
Result
• Multiple result trees are possible (consensus tree)
• Most parsimonious tree vs true tree
• Unrooted result trees
Maximum Parsimony (MP)
•
•
© 2009 SIB
Advantages
–
Free from assumptions (model-free)
Disadvantages
–
Generally produces multiple result trees
–
Does not take into account homoplasy
–
Long-branch attraction (LBA): creates wrong topologies, if
the substitution rate varies extensively between lineages
Maximum Likelihood (ML)
•
•
•
•
•
© 2009 SIB
Cavalli-Sforza, Edwards (1967), gene frequency data
Felsenstein (1981), nucleotide sequences
Kishino (1990), proteins
Principle
–
Calculates likelihoods for each position in the alignment
and for all possible topologies (gaps generally removed)
–
Result = tree with the highest likelihood
–
Maximizes the likelihood of observing the sequence data
for a specific model of character state changes
–
Maximized to estimate branch lengths, not topologies
Search strategies: rarely exhaustive, mostly heuristic
• NNI (Nearest neighbor interchanges)
• TBR (Tree bisection-reconnection)
• SPR (Subtree pruning and regrafting)
Number of possible trees
Leaves
© 2009 SIB
Unrooted
Rooted
Maximum Likelihood (ML)
•
Software
–
–
–
–
–
© 2009 SIB
PhyML (fast)
ProML (Phylip)
ProtML
RaxML (very fast)
…
Bayesian estimation of phylogenies
•
•
probability
probability
Very time-intensive
Programs: MrBayes, PhyloBayes
© 2009 SIB
Tree
topology 1
Tree
topology 2
Tree
topology 3
1.0
Prior distribution
Data (observations)
1.0
Posterior distribution
Phylogenetic analysis
1.
2.
3.
4.
5.
6.
© 2009 SIB
Data selection
Data comparison
Selection of a data model
Selection of an evolutionary model
Tree-building
Tree evaluation
Tree evaluation
Analyze how well the data supports the result tree
Tests
1.
Topology
• Tree reconciliation (comparison of the gene tree with the
species tree)
• Robustness, e.g. bootstrap, aLRT (PhyML)
2. Branch lengths tests
© 2009 SIB
Bootstrap
•
•
•
•
by Bradley Efron (1979)
Felsenstein (1985)
Used to test the robustness of a tree topology
Principle:
– new MSA datasets are created by choosing randomly N columns
from the original MSA; where N is the length of the original MSA
– Phylogenetic analysis is then performed on all bootstrap
replicates
– The consensus tree indicates bootstrap support for each node
• Mostly 1000 replicates (100 copies for large datasets)
• Bootstrap support values: min. 98% (strict), min. 95% (accepted)
© 2009 SIB
Create a bootstrap replicate
Seq_1
Seq_2
Seq_3
Seq_4
Seq_5
ILKAEEK
IVRSTQR
IIRSSTK
IIRSTTK
LLKTTSR
© 2009 SIB
Bootstrap and Bayesian support
values.
PhyML aLRT
•
•
approximate Likelihood-Ratio Test (aLRT)
aLRT is a statistical test to compute branch supports: It uses the
likelihood score of a branch to calculate the approximate probability
that a particular branch really exist in the true tree. It is much faster
than bootstrapping.
1. aLRT
2. Chi2: parametric branch support
3. aLRT-SH: non-parametric branch support based on a
Shimodaira-Hasegawa-like procedure
4. aLRT Chi2 and SH: calculates parametric and non-parametric
branch support; result is the minimum support of both methods
© 2009 SIB
Application
Phylogenetic analysis for function prediction
© 2009 SIB
Gene duplication
• Prokaryots: at least 50%
• Eukaryots: >90%
© 2009 SIB
After gene duplication
• Coexistence (normally only for a short while)
• Mostly, only one copy is retained
– becomes nonfunctional (non-functionalization),
– becomes a pseudogene (pseudogenization)
– is lost
• Both copies are retained
– Distinct expression pattern
– Distinct subcellular location (rare)
– One copy keeps the original function, the other copy
acquires a new function (neofunctionalization)
– Deleterious mutations in both entries (subfunctionalization)
After gene duplication
• Synfunctionalization
1. Functional divergence of the paralogs (e.g. expression)
2. One paralog takes over ( in part / fully ) the function of the
other paralog, which leads either to
1. Orthologs that are not functionally equivalent
2. gene loss
© 2009 SIB
Gene duplication followed by lineage-specific a) gene loss, b) function shuffling
© 2009 SIB
Steroid hormone receptors
Cephalochordate Branchiostoma
floridae diverged from other chordates
after duplication of the ancestral SR gene.
BfER, the ortholog of vertebrate estrogen
receptors, negatively regulates BfSR.
BfSR is specifically activated by estrogens
and recognizes estrogen response
elements.
Bridgham JT, et al: PLoS Genet. 2008 Sep 12;4(9);
© 2009 SIB
ancestral function:
Phylogenomic databases
Some phylogenomic databases
•
•
•
•
•
•
•
•
•
© 2009 SIB
COG/KOG
eggNOG
Ensembl (Compara)
HOGENOM
InParanoid
OMA browser
OrthoDB
OrthoMCL
PhylomeDB
Phylogenomic databases differ in their
• Goals
• Methodologies
• Number of species
• Taxonomic range
• Hierarchies
• Result presentation
• Update frequencies
Software for phylogenetic analysis
Examples of software packages
•
•
•
•
•
•
•
•
© 2009 SIB
Phylip
BioNJ
PhyML
PAML
MEGA
PAUP
Tree Puzzle
MrBayes
Servers for phylogenetic analysis
•
•
•
•
•
© 2009 SIB
http://www.phylogeny.fr/
http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html
http://atgc.lirmm.fr/phyml/
http://phylobench.vital-it.ch/raxml-bb/
http://power.nhri.org.tw/power/home.htm
Take home
• Phylogenetic trees are models - not knowledge
• Data selection is a very important step and can largely facilitate
phylogenetic analysis
• It is not always the most complex evolutionary model that leads to
the best results – but complex models make the interpretation of the
results more difficult!
• The most applied tree-building method is ML
• Tree evaluation is the major step in phylogenetic analysis
• Orthology prediction is helpful for function assignment, but the
function is only known when confirmed by wet lab experiments.
© 2009 SIB
Further reading …
•
•
Masatoshi Nei, Sudhir Kumar. Molecular Evolution and Phylogenetics. Oxford
University Press 2000.
Dan Graur and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Sinauer
Associates, Massachusetts.
© 2009 SIB
TP5 1/2: Phylogenetic analysis
http://education.expasy.org/cours/phylo/MPB10_phylo_TP5.html
© 2009 SIB
TP5 2/2 - Analysis Refinement, Interpretation of Results
Preparation for the next course (if you wish you can work in groups up to 5)
Phylogenetic analysis of the X,K-ATPase beta subunit family
•
•
•
•
•
•
Collect homologs from chordates (human, macaque, mouse, rat, chicken, zebrafinch,
frog, zebrafish, fugu, Ciona intestinalis, C. savignyi); outgroup: Drosophila,
Caenorhabditis elegans
Perform a multiple sequence alignment
Construct a data model
Reconstruct a phylogenetic tree using ML
Create one or more slides to present your analysis results to your colleagues next
Monday. Easiest, if you paste links to the analysis servers for alignments, trees, etc
Please send me your slides by Friday morning; if you work in groups, please indicate
the names of your colleagues
© 2009 SIB
Thank You
Remember:
Monday 17 December 2012
The course and practicals will take place in the computer room located in
10-12 Passage Baud-Bovy, behind UniMail
Download