Bioinf!

advertisement
Phylogenetics
Phylogenetic Trees
time
NODE
Hypothetical
Taxonomic Unit
ROOT
BRANCH
Operational
Taxonomic
Unit (OTU)
time
Information
• Branching order (topology)
– Relative closeness of different taxa
• Branch length
– Amount of divergence
Rooted and unrooted trees
C
A
D
B
A
C
D
B
E
E
ROOTED
UNROOTED
Rooted and unrooted trees
E
A
B
A
C
D
D
B
E
ROOTED
UNROOTED
C
Rooted and unrooted trees
A
B
A
E
B
C
D
D
E
ROOTED
UNROOTED
C
ROOTED
UNROOTED
3 OTUs
B
A
C
4 OTUs
A
B
A
A
A
C
B
C
B
C
B
A
A
A
A
B
C
B
C
B
C
D
D
D
C
D
B
A
D
C
C
C
D
A
B
… 15 rooted trees of 4 OTUs
B
D
Monophyletic & Paraphyletic
Birds
Crocodiles
Snakes and
lizards
Turtles and
tortoises
Mammals
REPTILES
Monophyletic & Paraphyletic
• Monophyletic
– Natural clade; all of the taxa are derived from
a common ancestor
• Paraphyletic
– Taxonomic group whose most recent common
ancestor is shared by another taxon
Reconstruct phylogeny from molecular
data
ACTGTTACCGA
?
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
Types of phylogenetic analysis
methods
• Phenetic: trees are constructed based on
observed characteristics, not on
evolutionary history
• Cladistic: trees are constructed based on
fitting observed characteristics to some
model of evolutionary history
Distance
methods
Parsimony
and
Maximum
Likelihood
methods
Methods of Tree reconstruction
•
•
•
•
Distance
Maximum Parsimony
Maximum Likelihood
Bayesian
Phylogeny Estimation: Traditional and Bayesian Approaches
Nature Reviews Genetics (2003) 4:275
Genetic distance
• Distance from one sequence to another
• Hamming Distance
– Count number of differences
• Multiple hits – number of events is greater than
number of differences
– Estimate number of events
• Infer tree from genetic distance using
Neighbour-joining (NJ) method
UPGMA shown for illustrative purposes. Neighbour-joining is preferred method.
• The algorithm in the text means: find the
closest distance between two sequences,
cluster those; then find the next closest
distance, cluster those; as sequences are
added to existing clusters find the average
distance between existing clusters
• Work through the notation!
• UPGMA assumes a molecular clock
mechanism of evolution
• Neighbor-joining: corrects for UPGMA’s
assumption of the same rate of evolution
for each branch by modifying the distance
matrix to reflect different rates of change.
• The net difference between sequence i and
all other sequences is
• ri = Sdik
• The rate-corrected distance matrix is then
•
Mij = dij - (ri + rj)/(n - 2)
• Join the two sequences whose Mij is minimal;
then calculate the distance from this new node to
all other sequences using
•
dkm = (dim + djm - dij)/2
• Again correct for rates and join nodes.
Maximum Parsimony (MP)
• Find topology requiring smallest number of
evolutionary changes
• Consider each position (site) in the
sequence alignment independently
• Not all sites are informative
• Informative
– Favours one topology over others
Informative sites
a.
b.
c.
d.
A
A
A
A
A
G
G
G
a
c
a
b
d
c
G
C
A
A
A
C
T
G
G
G
A
A
T
T
T
T
T
T
C
C
C
C
C
C
b
c
a
b
d
d
A
T
A
T
Maximum Likelihood (ML)
• Likelihood L of a tree is the probability of
observing the data given the tree
L = P(data|tree)
• Find the tree with the highest L value
• Results depends on model of nucleotide
substitution
• Computationally time-consuming
• Actually, all the other methods discussed
implicitly use a simple model of evolution
similar to the typical model made explicit in
maximum likelihood:
• All sites selectively neutral
• All mutate independently, forward and
reverse rates equal, given by m
• Also assume discrete generations and sites
change independently
• Given this model, can calculate probability
that a site with initial nucleotide I will
change to nucleotide j within time t:
• Ptij = dije-mt + (1 - e-mt)gj, where dij = 1 if i = j
and dij = 0 otherwise, and where gj is the
equilibrium frequency of nucleotide j
• The likelihood that some site is in state i at the
kth node of a tree is Li(k)
• The likelihoods for all states for each site for
each node are calculated separately; the
product of the likelihoods for each site gives
the overall likelihood for the observed data
• Different tree topologies are searched to find
the highest overall likelihood
• Maximum likelihood is maybe the “gold
standard” for phylogenetic analysis; but
because of its computational intensity it
can only be used for select data and only
after much initial fine tuning of many
parameters of sequence alignments
• Often used to distinguish between several
already generated trees
Bayesian (B) Phylogeny Estimation
• Searches for best trees consistent with
both model and data
• Incorporates prior knowledge (prior
probability)
• B maximises probability of tree given data
and model
• Searches for best set of trees
Comparison of methods
How much information are they using?
• MP, ML, B use actual DNA whereas NJ
summarises information into distance matrix
• BUT, not all sites are used by MP (“informative”
sites only)
How can the nature of the data affect the
methods?
• NJ better for recent divergences
• MP works well for a high number of informative
sites
Comparison of methods
How do they cope with lots of sequences?
• MP requires comparison of all possible trees
– Not possible for large number of taxa
• ML is computationally intensive and very slow
for large number of taxa
• NJ efficient for large number of taxa
Anything else?
• ML requires explicit assumptions about rate and
pattern of substitution (model)
– ML may perform poorly if model is incorrect
• ML or B may get stuck on local maxima
Outgroup rooting of unrooted trees
• Outgroup – related sequence that
definitely diverged earlier (paleontological
evidence)
chicken
human
mouse
human
rat
mouse
rat
Rate (r) of evolution
• K = number of
substitutions per site
• T = time since
divergence
• r = K/2T
• Rate is expressed as
substitutions per site
per year
Species A
Species B
T
Estimating species divergence times
• fossil evidence shows
that T1 = 310 mya
• What is T2 ?
r=
K AC + K BC
2(2T1 )
r=
K AB
2T2
T2 =
K AB
2r
T2 =
K AB ´ 2T1
K AC + K BC
• Only need to have
sequences and
information on one
divergence time
Chicken (C)
Human (B)
T2
T1
Rat (A)
True tree and inferred tree
•
There is only one true
tree of species
relationships
• Inferred tree may not be
correct
1. Some genes may not be
representative
2. Tree inference method
may have produced an
incorrect tree
– e.g. parsimony method:
may get several equally
parsimonious results
How credible is the tree?
• The tree is a hypothesis of the true
relationship
• Need some measure of the support for
that hypothesis
• Note: Bayesian methods simultaneously
estimate tree and measures of uncertainty
for each branch
Standard Error of branches
Human
Chimp
Gorilla
Orangutan
• The bootstrap: randomly sample all
positions (columns in an alignment) with
replacement -- meaning some columns can
be repeated -- but conserving the number
of positions; build a large dataset of these
randomized samples
Bootstrap
• Then use your method (distance, parsimony,
likelihood) to generate another tree
• Do this a thousand or so times
• Note that if the assumptions the method is based
on hold, you should always get the same tree
from the bootstrapped alignments as you did
originally
• The frequency of some feature of your phylogeny
in the bootstrapped set gives some measure of
the confidence you can have for this feature
Applications of phylogenetics
• Detection of orthology and
paralogy
• Estimation of divergence times
• Reconstruction of ancient
proteins
• Identifying residues important
to selection
• Detecting recombination points
• Identifying mutations likely to
be associated with disease
• Determining the identity of new
pathogens
The time will come, I believe, though I shall not live to
see it, when we shall have fairly true genealogical trees
of each great kingdom of Nature.
Charles Darwin
The Tree of Life
• Traditional
classification of life
into five kingdoms
– Bacteria (inc
cyanobacteria)
– Protista (inc. cilliates,
flagellates, amoebae)
– Fungi
– Plantae
– Animalia
Archaebacteria
• Carl Woese and colleagues
• Study relationships by
comparing rRNAs
• Methanogens were expected
to group with other bacteria
• BUT, found to be equally
distant from bacteria and
eukaryotes
• Made new taxon Archaebacteria
• Includes many extremophiles
– thermophiles
– hyperthermophiles
– halophiles (salt dependent)
The Tree of Life
Where is the root of the Tree of
Life?
• No possible outgroup (by definition)
• Iwabe et al. (1989)
• Examined phylogenetic tree of pairs of genes that
exist in all organisms
– derived from gene duplication that predates lineage
divergences
lineage 1
Gene A1
lineage 2
lineage 3
Gene A
lineage 1
Gene A2
lineage 2
lineage 3
• Homologous elongation factor genes EF-Tu and
EF-G present in all prokaryotes and eukaryotes
• Both genes show the same topology
Archaea
EF-Tu
Eucarya
Bacteria
Archaea
EF-G
Eucarya
Bacteria
Changing view of
The Tree of Life …
based on morphological
characteristics (Chatton,
1925)
(Gaucher et al, 2010)
based on DNA
sequence analysis
(Woese & Fox, 1977)
based on membrane
architecture & gene
indels
Most modern view …
based on ancient
gene duplication
based on
phylogenies of
hundreds of genes
Phylogeny of humans and apes
• Darwin – Gorilla and Chimpanzee
our closest relatives and human
evolutionary origins in Africa
• Many people preferred
anthropocentric idea that humans
were special
Human
Chimp
Gorilla
Traditional view
Orangutan
Gibbon
So what is the evidence?
• Serological precipitation
(Goodman 1962) – H, G, C
constitute a natural clade,
orangutans & gibbons earlier
diverging
• However, H,G,C relative
relationships remained unclear
• Most DNA sequence data
support ((H,C),G)
• Some genes show different
relationship
Human
Chimp
Gorilla
Orangutan
Gibbon
Conservation biology – the dusky
seaside sparrow
• Last one died June 1987
(DisneyWorld)
• Discovered 1872
• Ammodramus maritimus
nigrescens
• Geographically confined to
small salt marsh in Florida
• 2000 individuals in 1900
• 6 individuals (all male) in 1980
• Conservation program
– artificial breeding
Conservation genetics
• Mating of remaining males with females
from closest subspecies available
• Female hybrids of first generation then
“back-crossed” to original males
• Continue as long as original males live
• Which species to choose to take the
females from??
• 8 other A. maritimus
subspecies
• Geographically dispersed
along coast
• Artificial breeding with Scott’s
seaside sparrow (A. m.
peninsulae)
• Chosen based on
Morphological and behavioural
similarities
• Was this the best choice?
nigrescens
Atlantic
Coast
peninsulae
Gulf
Coast
Woops!
• Two subspecies diverged about 250,000 – 500,000
years ago
• A. m. nigrescens almost indistinguishable molecularly
from other Atlantic Coast subspecies
• Any Atlantic Coast subspecies would have been a better
choice
• Created a new species instead of saving old
• Dusky seaside sparrow officially declared extinct in 1990
Origin of angiosperms
• Flowering plants: carpelenclosed ovules and
seed
• Fossils
– began to radiate midCretaceous (~115 mya)
– Dominant land plants 90
mya
• 275,000 species
described
Origin of angiosperms
• Probably arose from
gymnosperm-like
ancestor up to 370-380
mya
• Gymnosperm = “naked
seed” (e.g. conifers)
• Long time span of
possible origin
• Why no fossils?
– Didn’t exist prior to
Cretaceous?
– Lived in habitats not
conducive to fossilisation?
Monocot and Dicot divergence
• Monocotyledons
• Dicotyledons
• Two major classes of
angiosperm
• Date of their divergence
gives minimum estimate
for age of angiosperms
• Phylogenetic analysis of
DNA sequences
Monocot – Dicot divergence
• Initial estimate of 300-320 mya (Martin et al. 1989)
– Glyceraldehyde-3-phosphaste dehydrogenase from plants,
animals and fungi
• Implied origin close (within 100myr) to the time of origin
of earliest land plants – seems too ancient
– implies all vascular plants arose within 100myr
• Alternative study (Wolfe et al., 1989)
• Calibrated molecular clock with maize-wheat divergence
(50-70 mya)
• Monocot-dicot divergence estimated as 200 mya
• Existed long before prominence in paleoflora
Cetaceans
Cow
Deer
Hippo
Pig
Peccary
Camel
Artiodactyls
• Link to ungulates
(hoofed mammals)
suggested by
comparative anatomy
• Early protein and
mtDNA phylogenetic
studies indicated that
Cetaceans are closely
related to Artiodactyls
• Graur and Higgins (1994)
• Protein and DNA
sequence from several
cetaceans and from three
suborders of artiodactyls
• Showed cetaceans are
within artiodactyls
• Confirmed by analysis of
distribution of SINE
elements
Cetartiodactyls
Download