Large Scale Phylogenetic Inference
Mark Pagel and Andrew Meade
Reading University m.pagel@rdg.ac.uk
Large-Scale Phylogenetic Inference: Approaches and Problems
Availability of data
Inference from aligned gene sequences: traversing the universe
MCMC and MCMCMC inference (assessing the potential for large-scale inference)
A model of pattern-heterogeneity suitable for concatenated sequences
A Tree of Life n= 4000 species David Hills
The accumulation of gene sequence data
Year No. of Sequences
1994 215,273
2001 14,976,310
= 70X growth over 7 years
Compare 20% per annum = 3.6X
growth over 7 years
16000000
14000000
12000000
10000000
8000000
6000000
4000000
2000000
0
Numbers of gene sequences for metazoan phyla
Source: GenBank
All nucleotide sequences
Group No. of genera No. of species
Primates 58
Carnivores 110
Aves 1139
263
204
2903
Cytochrome-b
Group No. of sub/families No. of genera
Primates 20
Carnivores 12
58
93
Large-Scale Phylogenetic Inference: Approaches and Problems
Availability of data
Inference from aligned gene sequences: traversing the universe
MCMC and MCMCMC inference (assessing the potential for large-scale inference)
A model of pattern-heterogeneity suitable for concatenated sequences
Number of Possible Phylogenetic Trees
Species Unrooted Rooted
3 1 3
6
10
4
5
50
3
15
15
105
105 945
2,027,025 34,459,425
2.83 X 10 74 2.75 X 10 76
No.
of
Trees
No. of tips (species)
N=50 No. rooted = 27529213532835651545259729751524430639300973035816196098326553772152587890625
No. unrooted = 283806325080779912837729172696128150920628587998105114415737667754150390625
Markov-Chain Monte Carlo (MCMC) Methods
• Generate a large number of phylogenetic trees from a Markov Chain
• at equilibrium randomly sample from universe of trees sampling mechanism: The Metropolis-Hastings Algorithm
Accept new tree with p=1.0 if L(T n+1
) > L(T n
) otherwise… accept with probability
L(T n+1
)/ L(T n
)
Sampling the universe of possible trees:
Markov-chain Monte Carlo methods
5’
Long Interspersed Nuclear Elements -- LINEs
--autonomously replicating retrotransposons endonuclease reverse transcriptase 3’
~6000 bases
-- as old as mammals (at least)
--20-40 active elements --500,0001,000,000 ‘fossil’ fragments
--account for ~20% of nucleotide content of human genome
Phylogenetic tree of LINE’s in the
Human genome n=500
Sampled from Markov Chain
Convergence of a Markov chainsampling phylogenetic tree of n=500 tips using an alignment of n=4400 nucleotides
-70 0000
-75 0000
-80 0000
-85 0000
-90 0000
-70 0260
-70 0280
-70 0300
-70 0320
-70 0340
-70 0360
-70 0380
50 0000 0 57 5000 0 65 0000 0
-95 0000
-10 0000 0
-10 0000 90 0000 19 0000 0 29 0000 0 39 0000 0 49 0000 0 59 0000 0 69 0000 0
Iteration number
NB: 99% of increase in likelihood in first 2.8% of run. 0.07% change in final 2 million iterations
Mean
Std. Dev.
n=1000
-700299.7
15.91
Frequency histogram of log-likelihoods for phylogenetic trees of n=500 LINEs in Human genome (alignment = 4000 bp).
Note: unconverged chain.
50
45
40
35
30
25
20
15
10
5
0
-700 350 n= 1000 trees
-700 330 -700 310 -700 290
Colum n 2 log-likelihood
-700 270 -700 250
Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC)
Given m simultaneous Markov chains, swap states each iteration among a randomly chosen pair i and j according to:
R
min 1,
f i
( y i
) f j
( y j
) f i
( x i
) f j
( x j
)
{likelihood ratio chain i * likelihood ratio chain j} x i x j y i y j x k y k
cold chain
‘Temperatures’ of heated chains number of chains, i t=0.2
t=0.5
1/i
1/(1+t( i-1)
Swapping behaviour of an MCMCMC analysis
Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) possum
~120
0.1
~90 millions of years ago c21 b3 c6 b2 c6 b3 c1 b3
~10-15
C1.18
gorilla mouse
C6.20
c21 b1
C6.15
C22.14
C21.19R
C1.19 C21.20
C1.17
C6.19
C21.17R
C1.10
C1.14
C21.9 C1.12
C6.11
C21.10
C6.10
C22.8
C21.6
C6.9R
C22.7C1.7
C1.8
C21.4
C1.6
C6.6
C1.4
C22.2
C22.3
C6.3R
C1.3
C1.2
C21.1
L1 B-globi
C6.1
3 c6 b1
C6.16 C22.17
C1.15R
C21.12
C22.11
C21.14
LINEs data (truncated alignment). Simultaneous chains with heating and swapping
-377 00
-379 00
-381 00
-383 00
-385 00
-387 00
-389 00
-391 00
-393 00
-395 00
150 00 350 00 cold chain hot chain warm chain
Chain swapping
115 000 135 000 550 00 750 00 generation
950 00
LINEs data Log-likelihoods of trees from cold chain (‘converged’ chain)
45
40
35
30
25
20
15
10
5
0
-37820 -37810 -37800 -37790
Log-likelihoods
-37780 -37770 pre-swap trees
-37760 post-swap trees
Large-Scale Phylogenetic Inference: Approaches and Problems
Availability of data
Inference from aligned gene sequences: traversing the universe
MCMC and MCMCMC inference (assessing the potential for large-scale inference)
A model of pattern-heterogeneity suitable for concatenated sequences
Pattern-Heterogeneity Model of Gene-Sequence Evolution
Allow for different genes in a single concatenated alignment or different regions of the same gene to evolve in qualitatively different ways
Contrast rate heterogeneity: can only detect difference in rates
Implement pattern-heterogeneity without partitioning data
P-H will always equal or better the performance of gamma rate heterogeneity model. Normally yields substantial improvements (100’s of log-units)
Applications
Detecting regions of genes that evolve differently
Large-scale inference: suitable for concatenated gene sequences (e.g. recent phylogeny of the mammals was based upon 16,000 nucleotides and 16 genes), or “supermatrix” alignments
Applications of pattern-heterogeneity model
Single gene alignment species 1 species n pattern 1 pattern 2
Concatenated gene alignment gene 1 gene 2 gene 3 species 1 species 2 . . .
species n pattern 3 gene k
“Supermatrix” alignment species 1 species 2 species n-k species n
'Oceanodroma_hornbyi'
'Gavia_stellata'
'Gavia_immer'
'Spheniscus_demersus'
'Pygoscelis_adeliae'
'Eudyptula_minor'
'Eudyptes_pachyrhynchus'
'Megadyptes_antipodes'
0001000000000000011010
0000000000000000000110
0000000000000000000110
1110000000000000000001
00000000000001
010000000000011110000000000000000001
11000000000001
110000000000010110000000000000000001
'Fregetta_grallaria' 00000000101000000000000000000000000000000000000000000000000000000000000000000000011010
'Pygoscelis_antarctica' 00100000000000000000000000000000000000000000000000000000000000000000000000000000000001
'Pygoscelis_papua' 00100000000000000000000000000000000000000000000000000000000000000000000000000000000001 0010000000000000000001
'Eudyptes_chrysolophus' 11000000000000000000000000000000000000000000000000000000000000000000000000000000000001
'Eudyptes_chrysocome' 11000000000000000000000000000000000000000000000000000000000000000000000000000000000001
'Aptenodytes_patagonicus' 01000000000000000000000000000000000000000000000000000000000000000000000000000000000001 0000000000000000000001
'Oceanodroma_melania' 00000001000000000000000000000000000000000000000000000000000000000000000000000000000110
'Oceanodroma_tethys' 00010001000000000000000000000000000000000000000000000000000000000000000000000000000110
'Halocyptena_microsoma' 00010001000000000000000000000000000000000000000000000000000000000000000000000000000110
'Oceanodroma_furcata' 00001010000000000000000000000000000000000000000000000000000000000000000000000000000110 0001000000000000011010
'Oceanodroma_tristrami' 00000110000000000000000000000000000000000000000000000000000000000000000000000000000110
'Oceanites_oceanicus' 00000000000000000000000000000000000000000000000000000000000000000000000000000000011010 0000000000000000011010
'Fregetta_tropica' 00000000101000000000000000000000000000000000000000000000000000000000000000000000011010
'Garrodia_nereis' 00000000011000000000000000000000000000000000000000000000000000000000000000000000011010
'Pelagodroma_marina' 0000000001100000000000000000000000000000000000000000000000000000000000000000000001101000000000000110
'Pelecanoides_garnotii' 00000000000000000000000000000000000000000000000000000000000000000000000000000110101010
'Pelecanoides_magellani' 00000000000000000000000000000001100000000000000000000000000000000000000000000110101010
'Pelecanoides_georgicus' 00000000000000000000000000000001100000000000000000000000000000000000000000000110101010000000001110100000000000000110101010
'Lugensa_brevirostris' 00000000000000000000000000000000000000000000000000000000000000000000001010101010101010 0000000001011010101010
'Calonectris_leucomelas' 00000000000000000000000000000000000000000000000000000000000000000011011010101010101010
'Puffinus_opisthomelas' 00000000000000000000000000000000000000000000000000000000000000011101011010101010101010
'Procellaria_westlandica' 0000000000000000000000000000000000000000000000000000010000100000000000011010101010101000000101011010
'Procellaria_parkinsoni' 00000000000000000000000000000000000000000000000000001100001000000000000110101010101010
'Procellaria_aequinoctialis' 00000000000000000000000000000000000000000000000000001100001000000000000110101010101010
'Pachyptila_turtur' 0000000000000000000000000000000000000000000000000000000011000000000000011010101010101000010000111010
'Pachyptila_desolata' 00000000000000000000000000000000000000000000000000000011110000000000000110101010101010
'Pachyptila_salvini' 00000000000000000000000000000000000000000000000000000011110000000000000110101010101010
'Pachyptila_vittata' 00000000000000000000000000000000000000000000000000000001110000000000000110101010101010000100001110100000000000111010101010
'Halobaena_caerulea' 00000000000000000000000000000000000000000000000000000000010000000000000110101010101010
'Thalassoica_antarctica' 00000000000000000000000000000000000000000000000000010000000000000000000001101010101010 0000000000111010101010
'Daption_capense' 0000000000000000000000000000000000000000000000000011000000000000000000000110101010101000000000001010
'Macronectes_halli' 00000000000000000000000000000000000000000000000101110000000000000000000001101010101010
'Phoebastria_irrorata' 00000000000000000001000000000010000000000000000000000000000000000000000000000001101010 0000011000000001101010
'Phoebastria_nigripes' 00000000000110000001000000000010000000000000000000000000000000000000000000000001101010 0000001000000001101010
'Diomedea_sanfordi' 00000000000000000110000000000010000000000000000000000000000000000000000000000001101010
'Diomedea_dabbenena' 00000000000000001010000000000010000000000000000000000000000000000000000000000001101010
'Diomedea_antipodensis' 00000000000001011010000000000010000000000000000000000000000000000000000000000001101010
'Diomedea_gibsoni' 00000000000001011010000000000010000000000000000000000000000000000000000000000001101010
'Thalassarche_impavida' 00000000000000000000100011010100000000000000000000000000000000000000000000000001101010
'Thalassarche_melanophris' 00000000000000000000100011010100000000000000000000000000000000000000000000000001101010
'Thalassarche_salvini' 00000000000000000000011101010100000000000000000000000000000000000000000000000001101010
'Thalassarche_eremita' 00000000000000000000011101010100000000000000000000000000000000000000000000000001101010
'Thalassarche_cauta' 00000000000000000000001101010100000000000000000000000000000000000000000000000001101010 0000000000000001101010
'Thalassarche_bassi' 00000000000000000000000000110100000000000000000000000000000000000000000000000001101010
'Thalassarche_chlororhynchos' 00000000000000000000000000110100000000000000000000000000000000000000000000000001101010
'Pterodroma_axillaris' 0000000000000000000000000001
'Pterodroma_cervicalis' 1000000000000000000000000001
'Pterodroma_hypoleuca' 000000000000000000000000001000000000000000000000000000000000000000000000011000000000000000000000000000011010101010 0000000011011010101010
'Pterodroma_defilippiana' 0111000000000000000000001110
'Pterodroma_cookii' 011100000000000000000000111000000000000000000000000000000000010000000000011000000000000000000000000000011010101010000000110110100000000011011010101010
'Pterodroma_leucoptera' 0011000000000000000000001110
'Pterodroma_brevipes' 0001000000000000000000001110
'Pterodroma_longirostris' 000010000000000000000000111000000000000000000000000000000000010000000000011000000000000000000000000000011010101010
'Pterodroma_pycrofti' 0000100000000000000000001110
'Pterodroma_inexpectata' 00000000000000000000010101100000000000000000000000000000000000010000000110100000000000000000000000000001101010101000000011011010
'Pterodroma_ultima' 0000000000000000000011010110
'Pterodroma_solandri' 0000000000000000000111010110
'Pterodroma_macroptera' 000000000000000000111101011000000000000000000000000000000000000001110110101000000000000000000000000000011010101010
'Pterodroma_magentae' 000000000000000001111101011000000000000000000000000000000000000000010110101000000000000000000000000000011010101010
'Pterodroma_lessonii' 000000000000001011111101011000000000000000000000000000000000000001110110101000000000000000000000000000011010101010
'Pterodroma_incerta' 000000000000011011111101011000000000000000000000000000000000000000110110101000000000000000000000000000011010101010
'Pterodroma_hasitata' 000000000000111011111101011000000000000000000000000000000000000000001110101000000000000000000000000000011010101010 0000000011011010101010
'Pterodroma_cahow' 000000000000111011111101011000000000000000000000000000000000000010001110101000000000000000000000000000011010101010
'Pterodroma_mollis' 000000000000000111111101011000000000000000000000000000000000000000000010101000000000000000000000000000011010101010
'Pterodroma_madeira' 0000000000010001111111010110
'Pterodroma_feae' 000000000001000111111101011000000000000000000000000000000000000010001110101000000000000000000000000000011010101010
'Pterodroma_alba' 0000000000000000000000110110
'Pterodroma_heraldica' 0000000001100000000000110110
'Pterodroma_sandwichensis' 0000010001100000000000110110
'Pterodroma_phaeopygia' 000001000110000000000011011000000000000000000000000000000000001100000001101000000000000000000000000000011010101010
'Pterodroma_neglecta' 000000101010000000000011011000000000000000000000000000000000000000000001101000000000000000000000000000011010101010
'Pterodroma_externa' 000000101010000000000011011000000000000000000000000000000000001100000001101000000000000000000000000000011010101010
'Pterodroma_arminjoniana' 0000000110100000000000110110 0000000011011010101010
'Diomedea_epomophora' 001010000000000110 00000000000000000110000000000010000000000000000000000000000000000000000000000001101010001000000001100000111000000001101010
'Diomedea_amsterdamensis' 001010000000000110 00000000000000111010000000000010000000000000000000000000000000000000000000000001101010
'Phoebastria_immutabilis' 000110000000000110 00000000000110000001000000000010000000000000000000000000000000000000000000000001101010 0000111000000001101010
'Phoebastria_albatrus' 000110000000000110 00000000000010000001000000000010000000000000000000000000000000000000000000000001101010
'Phoebetria_palpebrata' 100001000000000110 00000000000000000000000000001100000000000000000000000000000000000000000000000001101010
'Phoebetria_fusca' 100001000000000110 00000000000000000000000000001100000000000000000000000000000000000000000000000001101010
'Thalassarche_chrysostoma' 010001000000000110 00000000000000000000000011010100000000000000000000000000000000000000000000000001101010 0000000000000001101010
'Thalassarche_bulleri' 010001000000000110 00000000000000000000000101010100000000000000000000000000000000000000000000000001101010001000000001100000000000000001101010
'Fulmarus_glacialoides' 000000100000011010 00000000000000000000000000000000000000000000000011110000000000000000000001101010101010
'Hydrobates_pelagicus' 000000000000000001 00001010000000000000000000000000000000000000000000000000000000000000000000000000000110
'Oceanodroma_castro' 000000000000000001
'Pterodroma_baraui' 000000001110 0000000110100000000000110110
'Pagodroma_nivea' 000000110110 00000000000000000000000000000000000000000000000000000000000000000000000001101010101010
'Procellaria_cinerea' 000011010110000000000001101010
'Pseudobulweria_rostrata' 010101010110
'Pseudobulweria_aterrima' 010101010110
'Pterodroma_nigripennis' 000000001110 100000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000011010101010
'Macronectes_giganteus' 100000110110000000000000011010 00000000000000000000000000000000000000000000000101110000000000000000000001101010101010 0000000000001010101010
'Calonectris_diomedea' 001101010110000000000011101010 00000000000000000000000000000000000000000000000000000000000000000011011010101010101010 0000000100111010101010
'Bulweria_bulwerii' 000011010110000000000000101010 00000000000000000000000000000000000000000000000000000000001000000000000110101010101010
'Pelecanoides_urinatrix' 000000000010 00000000000000000000000000000000100000000000000000000000000000000000000000000110101010 0000000000000110101010
'Oceanodroma_leucorhoa' 000000000001 00000110000000000000000000000000000000000000000000000000000000000000000000000000000110 0001000000000000011010
'Diomedea_exulans' 000000000001 00000000000000111010000000000010000000000000000000000000000000000000000000000001101010
'Fulmarus_glacialis' 00000000100000110110000000100000011010 00000000000000000000000000000000000000000000000011110000000000000000000001101010101010
'Puffinus_creatopus' 10000011 00000000000000000000000000000000000000000000000000000000000110000000111010101010101010
'Puffinus_carneipes' 10000011 00000000000000000000000000000000000000000000000000000000000110000000111010101010101010
'Puffinus_gravis' 00000011 00000000000000000000000000000000000000000000000000000000000010000000111010101010101010
'Puffinus_griseus' 00000011 00000000000000000000000000000000000000000000000000000000000010000000111010101010101010000011010110100000000100111010101010
'Puffinus_tenuirostris' 00000011
'Puffinus_bulleri' 01000011 00000000000000000000000000000000000000000000000000000000000001000000111010101010101010
'Puffinus_pacificus' 01000011001101010110 00000000000000000000000000000000000000000000000000000000000001000000111010101010101010
'Puffinus_nativitatis' 00000101 00000000000000000000000000000000000000000000000000000000000000000101011010101010101010
'Puffinus_mauretanicus' 00101101 000000011111101010
'Puffinus_yelkouan' 00101101 000000011111101010
'Puffinus_gavia' 00011101
'Puffinus_huttoni' 00011101 0000000000000000000000000000000000000000000000000000000000000000110101101010101010101000001101011010
'Puffinus_assimilis' 00001101 000000000111101010 00000000000000000000000000000000000000000000000000000000000000111101011010101010101010
'Puffinus_lherminieri' 00001101 00000000000000000000000000000000000000000000000000000000000000111101011010101010101010
'Puffinus_auricularis' 00001101
'Puffinus_puffinus' 00001101 000000001111101010 00000000000000000000000000000000000000000000000000000000000000011101011010101010101010
Testing the Pattern Heterogeneity Model : two different rate matrices
C
G
A
A ---
T
C
5.0
0.5
---
G
0.5
5.0
8.0
1.5
---
T
1.5
3.0
0.1
8.0
1.0
1.0
---
Generate data on a known tree according to these two matrices and form a concatenated alignment. ‘gene 1’ = 600 bases ‘gene 2’ = 400 bases
log-likelihoods obtained from three models applied to simulated pattern-heterogeneity data
log-likelihoods by site in the simulated pattern-heterogeneity data
Pattern-heterogeneity model: Simulated and obtained values of the rate parameters
log-likelihoods for combined LSU/SSU nrRNA data set: 54 species n=800 sites
log-likelihoods by site in the LSU/SSU combined data set
The divide between the two genes
Log-likelihoods for cytochrome-b data set. N=433 sites of which 300 are fixed for a single nucleotide
Metropolis-Hastings Algorithm: Accept new tree according to
R
min 1,
f ( X T ' ) x f ( X T ) f ( T ' ) x f ( T ) f f ( T T ' )
( T ' T )
Likelihood ratio prior ratio proposal ratio
X=data (e.g., gene sequences) T=tree (topology, branches, parameters)