Large-Scale Phylogenetic Inference

advertisement

Large Scale Phylogenetic Inference

Mark Pagel and Andrew Meade

Reading University m.pagel@rdg.ac.uk

Large-Scale Phylogenetic Inference: Approaches and Problems

Availability of data

Inference from aligned gene sequences: traversing the universe

MCMC and MCMCMC inference (assessing the potential for large-scale inference)

A model of pattern-heterogeneity suitable for concatenated sequences

A Tree of Life n= 4000 species David Hills

The accumulation of gene sequence data

Year No. of Sequences

1994 215,273

2001 14,976,310

= 70X growth over 7 years

Compare 20% per annum = 3.6X

growth over 7 years

16000000

14000000

12000000

10000000

8000000

6000000

4000000

2000000

0

Numbers of gene sequences for metazoan phyla

Source: GenBank

All nucleotide sequences

Group No. of genera No. of species

Primates 58

Carnivores 110

Aves 1139

263

204

2903

Cytochrome-b

Group No. of sub/families No. of genera

Primates 20

Carnivores 12

58

93

Large-Scale Phylogenetic Inference: Approaches and Problems

Availability of data

Inference from aligned gene sequences: traversing the universe

MCMC and MCMCMC inference (assessing the potential for large-scale inference)

A model of pattern-heterogeneity suitable for concatenated sequences

Number of Possible Phylogenetic Trees

Species Unrooted Rooted

3 1 3

6

10

4

5

50

3

15

15

105

105 945

2,027,025 34,459,425

2.83 X 10 74 2.75 X 10 76

No.

of

Trees

No. of tips (species)

N=50 No. rooted = 27529213532835651545259729751524430639300973035816196098326553772152587890625

No. unrooted = 283806325080779912837729172696128150920628587998105114415737667754150390625

Sampling the Universe of Phylogenetic Trees

Markov-Chain Monte Carlo (MCMC) Methods

• Generate a large number of phylogenetic trees from a Markov Chain

• at equilibrium randomly sample from universe of trees sampling mechanism: The Metropolis-Hastings Algorithm

Accept new tree with p=1.0 if L(T n+1

) > L(T n

) otherwise… accept with probability

L(T n+1

)/ L(T n

)

Sampling the universe of possible trees:

Markov-chain Monte Carlo methods

5’

Long Interspersed Nuclear Elements -- LINEs

--autonomously replicating retrotransposons endonuclease reverse transcriptase 3’

~6000 bases

-- as old as mammals (at least)

--20-40 active elements --500,0001,000,000 ‘fossil’ fragments

--account for ~20% of nucleotide content of human genome

Phylogenetic tree of LINE’s in the

Human genome n=500

Sampled from Markov Chain

Convergence of a Markov chainsampling phylogenetic tree of n=500 tips using an alignment of n=4400 nucleotides

-70 0000

-75 0000

-80 0000

-85 0000

-90 0000

-70 0260

-70 0280

-70 0300

-70 0320

-70 0340

-70 0360

-70 0380

50 0000 0 57 5000 0 65 0000 0

-95 0000

-10 0000 0

-10 0000 90 0000 19 0000 0 29 0000 0 39 0000 0 49 0000 0 59 0000 0 69 0000 0

Iteration number

NB: 99% of increase in likelihood in first 2.8% of run. 0.07% change in final 2 million iterations

Mean

Std. Dev.

n=1000

-700299.7

15.91

Frequency histogram of log-likelihoods for phylogenetic trees of n=500 LINEs in Human genome (alignment = 4000 bp).

Note: unconverged chain.

50

45

40

35

30

25

20

15

10

5

0

-700 350 n= 1000 trees

-700 330 -700 310 -700 290

Colum n 2 log-likelihood

-700 270 -700 250

Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC)

Given m simultaneous Markov chains, swap states each iteration among a randomly chosen pair i and j according to:

R 

 min 1,

 f i

( y i

) f j

( y j

) f i

( x i

) f j

( x j

)







{likelihood ratio chain i * likelihood ratio chain j} x i x j y i y j x k y k

cold chain

‘Temperatures’ of heated chains number of chains, i t=0.2

t=0.5

1/i

1/(1+t( i-1)

Swapping behaviour of an MCMCMC analysis

Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) possum

~120

0.1

~90 millions of years ago c21 b3 c6 b2 c6 b3 c1 b3

~10-15

C1.18

gorilla mouse

C6.20

c21 b1

C6.15

C22.14

C21.19R

C1.19 C21.20

C1.17

C6.19

C21.17R

C1.10

C1.14

C21.9 C1.12

C6.11

C21.10

C6.10

C22.8

C21.6

C6.9R

C22.7C1.7

C1.8

C21.4

C1.6

C6.6

C1.4

C22.2

C22.3

C6.3R

C1.3

C1.2

C21.1

L1 B-globi

C6.1

3 c6 b1

C6.16 C22.17

C1.15R

C21.12

C22.11

C21.14

LINEs data (truncated alignment). Simultaneous chains with heating and swapping

-377 00

-379 00

-381 00

-383 00

-385 00

-387 00

-389 00

-391 00

-393 00

-395 00

150 00 350 00 cold chain hot chain warm chain

Chain swapping

115 000 135 000 550 00 750 00 generation

950 00

LINEs data Log-likelihoods of trees from cold chain (‘converged’ chain)

45

40

35

30

25

20

15

10

5

0

-37820 -37810 -37800 -37790

Log-likelihoods

-37780 -37770 pre-swap trees

-37760 post-swap trees

Large-Scale Phylogenetic Inference: Approaches and Problems

Availability of data

Inference from aligned gene sequences: traversing the universe

MCMC and MCMCMC inference (assessing the potential for large-scale inference)

A model of pattern-heterogeneity suitable for concatenated sequences

Pattern-Heterogeneity Model of Gene-Sequence Evolution

Allow for different genes in a single concatenated alignment or different regions of the same gene to evolve in qualitatively different ways

Contrast rate heterogeneity: can only detect difference in rates

Implement pattern-heterogeneity without partitioning data

P-H will always equal or better the performance of gamma rate heterogeneity model. Normally yields substantial improvements (100’s of log-units)

Applications

Detecting regions of genes that evolve differently

Large-scale inference: suitable for concatenated gene sequences (e.g. recent phylogeny of the mammals was based upon 16,000 nucleotides and 16 genes), or “supermatrix” alignments

Applications of pattern-heterogeneity model

Single gene alignment species 1 species n pattern 1 pattern 2

Concatenated gene alignment gene 1 gene 2 gene 3 species 1 species 2 . . .

species n pattern 3 gene k

“Supermatrix” alignment species 1 species 2 species n-k species n

'Oceanodroma_hornbyi'

'Gavia_stellata'

'Gavia_immer'

'Spheniscus_demersus'

'Pygoscelis_adeliae'

'Eudyptula_minor'

'Eudyptes_pachyrhynchus'

'Megadyptes_antipodes'

0001000000000000011010

0000000000000000000110

0000000000000000000110

1110000000000000000001

00000000000001

010000000000011110000000000000000001

11000000000001

110000000000010110000000000000000001

'Fregetta_grallaria' 00000000101000000000000000000000000000000000000000000000000000000000000000000000011010

'Pygoscelis_antarctica' 00100000000000000000000000000000000000000000000000000000000000000000000000000000000001

'Pygoscelis_papua' 00100000000000000000000000000000000000000000000000000000000000000000000000000000000001 0010000000000000000001

'Eudyptes_chrysolophus' 11000000000000000000000000000000000000000000000000000000000000000000000000000000000001

'Eudyptes_chrysocome' 11000000000000000000000000000000000000000000000000000000000000000000000000000000000001

'Aptenodytes_patagonicus' 01000000000000000000000000000000000000000000000000000000000000000000000000000000000001 0000000000000000000001

'Oceanodroma_melania' 00000001000000000000000000000000000000000000000000000000000000000000000000000000000110

'Oceanodroma_tethys' 00010001000000000000000000000000000000000000000000000000000000000000000000000000000110

'Halocyptena_microsoma' 00010001000000000000000000000000000000000000000000000000000000000000000000000000000110

'Oceanodroma_furcata' 00001010000000000000000000000000000000000000000000000000000000000000000000000000000110 0001000000000000011010

'Oceanodroma_tristrami' 00000110000000000000000000000000000000000000000000000000000000000000000000000000000110

'Oceanites_oceanicus' 00000000000000000000000000000000000000000000000000000000000000000000000000000000011010 0000000000000000011010

'Fregetta_tropica' 00000000101000000000000000000000000000000000000000000000000000000000000000000000011010

'Garrodia_nereis' 00000000011000000000000000000000000000000000000000000000000000000000000000000000011010

'Pelagodroma_marina' 0000000001100000000000000000000000000000000000000000000000000000000000000000000001101000000000000110

'Pelecanoides_garnotii' 00000000000000000000000000000000000000000000000000000000000000000000000000000110101010

'Pelecanoides_magellani' 00000000000000000000000000000001100000000000000000000000000000000000000000000110101010

'Pelecanoides_georgicus' 00000000000000000000000000000001100000000000000000000000000000000000000000000110101010000000001110100000000000000110101010

'Lugensa_brevirostris' 00000000000000000000000000000000000000000000000000000000000000000000001010101010101010 0000000001011010101010

'Calonectris_leucomelas' 00000000000000000000000000000000000000000000000000000000000000000011011010101010101010

'Puffinus_opisthomelas' 00000000000000000000000000000000000000000000000000000000000000011101011010101010101010

'Procellaria_westlandica' 0000000000000000000000000000000000000000000000000000010000100000000000011010101010101000000101011010

'Procellaria_parkinsoni' 00000000000000000000000000000000000000000000000000001100001000000000000110101010101010

'Procellaria_aequinoctialis' 00000000000000000000000000000000000000000000000000001100001000000000000110101010101010

'Pachyptila_turtur' 0000000000000000000000000000000000000000000000000000000011000000000000011010101010101000010000111010

'Pachyptila_desolata' 00000000000000000000000000000000000000000000000000000011110000000000000110101010101010

'Pachyptila_salvini' 00000000000000000000000000000000000000000000000000000011110000000000000110101010101010

'Pachyptila_vittata' 00000000000000000000000000000000000000000000000000000001110000000000000110101010101010000100001110100000000000111010101010

'Halobaena_caerulea' 00000000000000000000000000000000000000000000000000000000010000000000000110101010101010

'Thalassoica_antarctica' 00000000000000000000000000000000000000000000000000010000000000000000000001101010101010 0000000000111010101010

'Daption_capense' 0000000000000000000000000000000000000000000000000011000000000000000000000110101010101000000000001010

'Macronectes_halli' 00000000000000000000000000000000000000000000000101110000000000000000000001101010101010

'Phoebastria_irrorata' 00000000000000000001000000000010000000000000000000000000000000000000000000000001101010 0000011000000001101010

'Phoebastria_nigripes' 00000000000110000001000000000010000000000000000000000000000000000000000000000001101010 0000001000000001101010

'Diomedea_sanfordi' 00000000000000000110000000000010000000000000000000000000000000000000000000000001101010

'Diomedea_dabbenena' 00000000000000001010000000000010000000000000000000000000000000000000000000000001101010

'Diomedea_antipodensis' 00000000000001011010000000000010000000000000000000000000000000000000000000000001101010

'Diomedea_gibsoni' 00000000000001011010000000000010000000000000000000000000000000000000000000000001101010

'Thalassarche_impavida' 00000000000000000000100011010100000000000000000000000000000000000000000000000001101010

'Thalassarche_melanophris' 00000000000000000000100011010100000000000000000000000000000000000000000000000001101010

'Thalassarche_salvini' 00000000000000000000011101010100000000000000000000000000000000000000000000000001101010

'Thalassarche_eremita' 00000000000000000000011101010100000000000000000000000000000000000000000000000001101010

'Thalassarche_cauta' 00000000000000000000001101010100000000000000000000000000000000000000000000000001101010 0000000000000001101010

'Thalassarche_bassi' 00000000000000000000000000110100000000000000000000000000000000000000000000000001101010

'Thalassarche_chlororhynchos' 00000000000000000000000000110100000000000000000000000000000000000000000000000001101010

'Pterodroma_axillaris' 0000000000000000000000000001

'Pterodroma_cervicalis' 1000000000000000000000000001

'Pterodroma_hypoleuca' 000000000000000000000000001000000000000000000000000000000000000000000000011000000000000000000000000000011010101010 0000000011011010101010

'Pterodroma_defilippiana' 0111000000000000000000001110

'Pterodroma_cookii' 011100000000000000000000111000000000000000000000000000000000010000000000011000000000000000000000000000011010101010000000110110100000000011011010101010

'Pterodroma_leucoptera' 0011000000000000000000001110

'Pterodroma_brevipes' 0001000000000000000000001110

'Pterodroma_longirostris' 000010000000000000000000111000000000000000000000000000000000010000000000011000000000000000000000000000011010101010

'Pterodroma_pycrofti' 0000100000000000000000001110

'Pterodroma_inexpectata' 00000000000000000000010101100000000000000000000000000000000000010000000110100000000000000000000000000001101010101000000011011010

'Pterodroma_ultima' 0000000000000000000011010110

'Pterodroma_solandri' 0000000000000000000111010110

'Pterodroma_macroptera' 000000000000000000111101011000000000000000000000000000000000000001110110101000000000000000000000000000011010101010

'Pterodroma_magentae' 000000000000000001111101011000000000000000000000000000000000000000010110101000000000000000000000000000011010101010

'Pterodroma_lessonii' 000000000000001011111101011000000000000000000000000000000000000001110110101000000000000000000000000000011010101010

'Pterodroma_incerta' 000000000000011011111101011000000000000000000000000000000000000000110110101000000000000000000000000000011010101010

'Pterodroma_hasitata' 000000000000111011111101011000000000000000000000000000000000000000001110101000000000000000000000000000011010101010 0000000011011010101010

'Pterodroma_cahow' 000000000000111011111101011000000000000000000000000000000000000010001110101000000000000000000000000000011010101010

'Pterodroma_mollis' 000000000000000111111101011000000000000000000000000000000000000000000010101000000000000000000000000000011010101010

'Pterodroma_madeira' 0000000000010001111111010110

'Pterodroma_feae' 000000000001000111111101011000000000000000000000000000000000000010001110101000000000000000000000000000011010101010

'Pterodroma_alba' 0000000000000000000000110110

'Pterodroma_heraldica' 0000000001100000000000110110

'Pterodroma_sandwichensis' 0000010001100000000000110110

'Pterodroma_phaeopygia' 000001000110000000000011011000000000000000000000000000000000001100000001101000000000000000000000000000011010101010

'Pterodroma_neglecta' 000000101010000000000011011000000000000000000000000000000000000000000001101000000000000000000000000000011010101010

'Pterodroma_externa' 000000101010000000000011011000000000000000000000000000000000001100000001101000000000000000000000000000011010101010

'Pterodroma_arminjoniana' 0000000110100000000000110110 0000000011011010101010

'Diomedea_epomophora' 001010000000000110 00000000000000000110000000000010000000000000000000000000000000000000000000000001101010001000000001100000111000000001101010

'Diomedea_amsterdamensis' 001010000000000110 00000000000000111010000000000010000000000000000000000000000000000000000000000001101010

'Phoebastria_immutabilis' 000110000000000110 00000000000110000001000000000010000000000000000000000000000000000000000000000001101010 0000111000000001101010

'Phoebastria_albatrus' 000110000000000110 00000000000010000001000000000010000000000000000000000000000000000000000000000001101010

'Phoebetria_palpebrata' 100001000000000110 00000000000000000000000000001100000000000000000000000000000000000000000000000001101010

'Phoebetria_fusca' 100001000000000110 00000000000000000000000000001100000000000000000000000000000000000000000000000001101010

'Thalassarche_chrysostoma' 010001000000000110 00000000000000000000000011010100000000000000000000000000000000000000000000000001101010 0000000000000001101010

'Thalassarche_bulleri' 010001000000000110 00000000000000000000000101010100000000000000000000000000000000000000000000000001101010001000000001100000000000000001101010

'Fulmarus_glacialoides' 000000100000011010 00000000000000000000000000000000000000000000000011110000000000000000000001101010101010

'Hydrobates_pelagicus' 000000000000000001 00001010000000000000000000000000000000000000000000000000000000000000000000000000000110

'Oceanodroma_castro' 000000000000000001

'Pterodroma_baraui' 000000001110 0000000110100000000000110110

'Pagodroma_nivea' 000000110110 00000000000000000000000000000000000000000000000000000000000000000000000001101010101010

'Procellaria_cinerea' 000011010110000000000001101010

'Pseudobulweria_rostrata' 010101010110

'Pseudobulweria_aterrima' 010101010110

'Pterodroma_nigripennis' 000000001110 100000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000011010101010

'Macronectes_giganteus' 100000110110000000000000011010 00000000000000000000000000000000000000000000000101110000000000000000000001101010101010 0000000000001010101010

'Calonectris_diomedea' 001101010110000000000011101010 00000000000000000000000000000000000000000000000000000000000000000011011010101010101010 0000000100111010101010

'Bulweria_bulwerii' 000011010110000000000000101010 00000000000000000000000000000000000000000000000000000000001000000000000110101010101010

'Pelecanoides_urinatrix' 000000000010 00000000000000000000000000000000100000000000000000000000000000000000000000000110101010 0000000000000110101010

'Oceanodroma_leucorhoa' 000000000001 00000110000000000000000000000000000000000000000000000000000000000000000000000000000110 0001000000000000011010

'Diomedea_exulans' 000000000001 00000000000000111010000000000010000000000000000000000000000000000000000000000001101010

'Fulmarus_glacialis' 00000000100000110110000000100000011010 00000000000000000000000000000000000000000000000011110000000000000000000001101010101010

'Puffinus_creatopus' 10000011 00000000000000000000000000000000000000000000000000000000000110000000111010101010101010

'Puffinus_carneipes' 10000011 00000000000000000000000000000000000000000000000000000000000110000000111010101010101010

'Puffinus_gravis' 00000011 00000000000000000000000000000000000000000000000000000000000010000000111010101010101010

'Puffinus_griseus' 00000011 00000000000000000000000000000000000000000000000000000000000010000000111010101010101010000011010110100000000100111010101010

'Puffinus_tenuirostris' 00000011

'Puffinus_bulleri' 01000011 00000000000000000000000000000000000000000000000000000000000001000000111010101010101010

'Puffinus_pacificus' 01000011001101010110 00000000000000000000000000000000000000000000000000000000000001000000111010101010101010

'Puffinus_nativitatis' 00000101 00000000000000000000000000000000000000000000000000000000000000000101011010101010101010

'Puffinus_mauretanicus' 00101101 000000011111101010

'Puffinus_yelkouan' 00101101 000000011111101010

'Puffinus_gavia' 00011101

'Puffinus_huttoni' 00011101 0000000000000000000000000000000000000000000000000000000000000000110101101010101010101000001101011010

'Puffinus_assimilis' 00001101 000000000111101010 00000000000000000000000000000000000000000000000000000000000000111101011010101010101010

'Puffinus_lherminieri' 00001101 00000000000000000000000000000000000000000000000000000000000000111101011010101010101010

'Puffinus_auricularis' 00001101

'Puffinus_puffinus' 00001101 000000001111101010 00000000000000000000000000000000000000000000000000000000000000011101011010101010101010

Testing the Pattern Heterogeneity Model : two different rate matrices

C

G

A

A ---

T

C

5.0

0.5

---

G

0.5

5.0

8.0

1.5

---

T

1.5

3.0

0.1

8.0

1.0

1.0

---

Generate data on a known tree according to these two matrices and form a concatenated alignment. ‘gene 1’ = 600 bases ‘gene 2’ = 400 bases

log-likelihoods obtained from three models applied to simulated pattern-heterogeneity data

log-likelihoods by site in the simulated pattern-heterogeneity data

Pattern-heterogeneity model: Simulated and obtained values of the rate parameters

log-likelihoods for combined LSU/SSU nrRNA data set: 54 species n=800 sites

log-likelihoods by site in the LSU/SSU combined data set

The divide between the two genes

Log-likelihoods for cytochrome-b data set. N=433 sites of which 300 are fixed for a single nucleotide

Metropolis-Hastings Algorithm: Accept new tree according to

R 

 min 1,

 f ( X T ' ) x f ( X T ) f ( T ' ) x f ( T ) f f ( T T ' )

( T ' T )





Likelihood ratio prior ratio proposal ratio

X=data (e.g., gene sequences) T=tree (topology, branches, parameters)

Download