Document

advertisement
Reconstrucción Filogenética
Una manera simple de entender
la evolución…
DATOS: Alineamiento de secuencias de genes
Cómo podemos transformar esta información a un contexto histórico?
Phylogeny inference
1. Distance based methods
-Pair wise distance matrix
-Adjust tree branch lengths to fit the distance
matrix (ex. Minimum squares, Neighbor
joining)
2. Character based methods
-Parsimony
-Maximum likelihood or model based
evolution
In 1866, Ernst Haeckel coined the word “phylogeny” and presented
phylogenetic trees for most known groups of living organisms.
The Tree of Life project
Surf the tree of life at:
http://tolweb.org/tree/phylogeny.html
What is a tree?
A tree is a mathematical structure which is used to model
the actual evolutionary history of a group of sequences or organisms,
i.e. an evolutionary hypothesis.
A tree consists of nodes connected by branches.
The ancestor of all the
sequences is the root of
the tree
Internal nodes represent
hypothetical ancestors
Terminal nodes represent sequences or
organisms for which we have data.
Each is typically called a
“Operational Taxonomical Unit”
or OTU.
Types of Trees
Rooted vs. Unrooted
Rooted
Unrooted
Branches
Nodes
Interior
M–2
M–1
Total
2M – 2
2M – 1
Interior
M–3
M–2
Total
2M – 3
2M – 2
M is the number of OTU’s
The number of rooted and unrooted trees:
Possible Number of
Number
of OTU’s
Rooted trees
2
1
Unrooted
trees
1
3
3
1
4
15
3
5
105
15
6
945
105
7
10395
945
8
135135
10395
9
2027025
135135
10
34459425
2027025
OTU – Operational Taxonomical Unit
Types of Trees
Bifurcating
Multifurcating
Polytomy
Polytomies: Soft vs. Hard
• Soft: designate a lack of information about the
order of divergence.
• Hard: the hypothesis that multiple divergences
occurred simultaneously
Types of Trees
Trees
Networks
Only one path between
any pair of nodes
More than one path
between any pair of
nodes
A shorthand for trees: the Newick format
((1,2),(3,4))
2
1
4
3
(((1,2),((3,4),5)),6)
1
2
3
4
5
6
Comments on Trees
Trees give insights into •
underlying data
Identical trees can appear •
differently depending upon the
method of display
Information maybe lost when •
creating the tree. The tree is not
the underlying data.
A
B
C
B
A
C
C
A
B
C
B
A
Different kinds of trees can be used to
depict different aspects of evolutionary history
1. Cladogram:
simply shows relative recency of common ancestry
2. Additive trees:
7
2
4
3
3
1
5
a cladogram with branch lengths,
also called phylograms and metric trees
3
1
1
3
1
1
3. Ultrametric trees:
1
1 1
2
1
3
(dendograms) special kind of additive tree in which the
tips of the trees are all equidistant from the root
Making trees according to morphological features
Ridley New Scientist (Dec. 1983) 100, 647-51
Given a multiple alignment, how do we construct the tree?
A
B
C
D
E
F
–
–
–
-
GCTTGTCCGTTACGAT
ACTTGTCTGTTACGAT
ACTTGTCCGAAACGAT
ACTTGACCGTTTCCTT
AGATGACCGTTTCGAT
ACTACACCCTTATGAG
?
Distance methods
Logic: Evolutionary distance is a tree metric and hence defines a tree
General Method:
• Evolutionary distances are computed for all pairs of taxa.
• A phylogenetic tree is constructed by considering the
relationships among these distance data (fitting a tree to the
matrix).
Methods we’ll talk about
• UPGMA (Unweighted Pair Group Method with Arithmetic Mean )
• Neighbor Joining
Distance methods
Ultrametric Trees
1
4
1
1
1
1
1 1
3
2
1
a b
c
Metric distances must obey 4 rules:
Non-negativity
Symmetry
Triangle Inequality
Distinctness
d(a,b) >= 0
d(a,b) = d(b,a)
d(a,c) <= d(a,b) + d(b,c)
d(a,b) = 0 iff a = b
Construction of a distance tree using clustering with the Unweighted
Pair Group Method with Arithmatic Mean (UPGMA)
First, construct a distance matrix:
A
B
C
D
E
F
–
–
–
-
GCTTGTCCGTTACGAT
ACTTGTCTGTTACGAT
ACTTGTCCGAAACGAT
ACTTGACCGTTTCCTT
AGATGACCGTTTCGAT
ACTACACCCTTATGAG
A B C D
E
B 2
C 4
4
D 6
6
6
E 6
6
6
4
F 8
8
8
8
8
From http://www.icp.ucl.ac.be/~opperd/private/upgma.html
UPGMA
First round
A B C D
E
B 2
C 4
4
D 6
6
6
E 6
6
6
4
F 8
8
8
8
dist(A,B),C = (distAC + distBC) / 2 = 4
dist(A,B),D = (distAD + distBD) / 2 = 6
dist(A,B),E = (distAE + distBE) / 2 = 6
dist(A,B),F = (distAF + distBF) / 2 = 8
8
Choose the most similar pair,
cluster them together and calculate
the new distance matrix.
C
D
E
F
A,B
C
D
E
4
6
6
8
6
6
8
4
8
8
UPGMA
Second round
A,B
C
D
E
4
6
6
8
6
6
8
4
8
8
C
D
E
F
Third round
A,B
C D,E
C
4
D,E
6
6
F
8
8
8
UPGMA
Fourth round
AB,C
D,E
6
F
8
D,E
8
Fifth round
ABC,DE
F
8
Note the this method identifies the root of the tree.
http://www.genpat.uu.se/mtDB/
A tree of human mitochondria sequences
•
The mitochondrial genome has 16,500 base-pairs.
•
In 2000, Gyllensten and colleagues sequenced the mitochondrial
genomes of 53 people of diverse geographical, racial and linguistic
backgrounds.
•
A molecular clock seems to hold the divergence of these sequences at
a rate of 1.7x10-8 substitutions per site per year.
Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713.
The deepest branches lead exclusively to sub-Saharan mtDNAs, with the
second branch containing both Africans and non-Africans.
sub-Sahara mtDNA
A tree of 86 mitochondrial sequences.
Downloaded from
http://www.genpat.uu.se/mtDB/sequences.html and
analyzed using MEGA, method: UPGMA
Rooting the tree with an outgroup
Root
Outgroup
Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713.
Phylogeny based upon the molecular clock
•
Evidence for a human mitochondrial origin in Africa: African
sequence diversity is twice as large as that of non-African
•
Gyllensten and colleagues estimate that the divergence of
Africans and non-Africans occurred 52,000 to 28,000 years ago.
Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713.
UPGMA assumes a molecular clock
•
•
•
The UPGMA clustering method is very sensitive to unequal
evolutionary rates (assumes that the evolutionary rate is the
same for all branches).
Clustering works only if the data are ultrametric
Ultrametric distances are defined by the satisfaction of the
'three-point condition'.
The three-point condition:
A
B
C
For any three taxa, the two greatest distances are equal.
UPGMA fails when rates of evolution are not constant
A tree in which the evolutionary rates are not equal
From http://www.icp.ucl.ac.be/~opperd/private/upgma.html
A
B
C
D
B
5
C
4
7
D
7
10
7
E
6
9
6
5
F
8
11
8
9
E
8
(Neighbor joining will get
the right tree in this case.)
Neighbors
A
a
B
b
x
c
C
d
D
A and B are neighbors because they are connected
through a single internal node.
C and D are also neighbors, but A and D are not
neighbors.
The 4-point condition can be used to identify neighbors.
Basically states that neighbors are closer than non-neighbors.
The Four Point Condition
A
a
B
b
x
c
C
d
D
dAC + dBD = dAD + dBC = a + b + c + d + 2x = dAB + dCD + 2x
dAB + dCD < dAC + dBD
dAB + dCD < dAD + dBC
neighbors
non-neighbors
Neighbor Joining
An algorithm for finding the shortest tree
Start with a star (no hierarchical structure)
a
b
c
d
The length of the tree
Pair-wise distances
Number of OTUs
Neighbor Joining
(Saitou and Nei, 1987)
Neighbor Joining
(Saitou and Nei, 1987)
Neighbor Joining
(Saitou and Nei, 1987)
Neighbor Joining
(Saitou and Nei, 1987)
Character state methods
MAXIMUM PARSIMONY
Logic:
Examine each column in the multiple alignment of the sequences.
Examine all possible trees and choose among them according to
some optimality criteria
Method we’ll talk about
• Maximum parsimony
Maximum Parsimony
Simpler hypotheses are preferable to more complicated ones and that as
hoc hypotheses should be avoided whenever possible (Occam’s Razor).
Thus, find the tree that requires the smallest number of evolutionary
changes.
W
X
Y
Z
–
–
-
0123456789012345
ACTTGACCCTTACGAT
AGCTGGCCCTGATTAC
AGTTGACCATTACGAT
AGCTGGTCCTGATGAC
W
X
Y
Z
Maximum Parsimony
Start by classifying the sites:
123456789012345678901
Mouse
CTTCGTTGGATCAGTTTGATA
Rat
CCTCGTTGGATCATTTTGATA
Dog
CTGCTTTGGATCAGTTTGAAC
Human
CCGCCTTGGATCAGTTTGAAC
-----------------------------------Invariant
* * ******** *****
Variant
** *
*
**
-----------------------------------Informative **
**
Non-inform.
*
*
123456789012345678901
CTTCGTTGGATCAGTTTGATA
CCTCGTTGGATCATTTTGATA
CTGCTTTGGATCAGTTTGAAC
CCGCCTTGGATCAGTTTGAAC
** *
Mouse
Rat
Dog
Human
Site 5:
Mouse
G
G
T
G
Rat
Site 2:
Mouse
T
C
C
Rat
Mouse
T
T
C
Dog
T
Dog
T
C
Human
G
Rat
C
Human
Dog
T
Dog
T
C
Mouse
T
C
Mouse
T
T
T
Dog
G
C
Human
Mouse
G
T
C
Human
Dog
G
G
C
Rat
Dog
G
G
Human
T
Rat
G
Mouse
G
G
Site 3:
T
Rat
G
Human
Mouse
G
G
G
T
Dog
Mouse
G
T
G
Dog
Rat
G
C
Human
C
Rat
C
C
Human
Rat
G
T
G
Human
Maximum Parsimony
123456789012345678901
Mouse
CTTCGTTGGATCAGTTTGATA
Rat
CCTCGTTGGATCATTTTGATA
Dog
CTGCTTTGGATCAGTTTGAAC
Human
CCGCCTTGGATCAGTTTGAAC
Informative **
**
Mouse
Dog
Dog
Mouse
Mouse
Rat
Rat
Human
Rat
Human
Dog
Human
3
0
1
Maximum Parsimony
The situation is more complicated when there are more than four units.
(TAGC)
(AT)
(TAG)
(AG)
T
(AGT)
C
(CT)
(GT)
T
T
G
T
A A
T
T
Problems with maximum parsimony:
Only uses “informative” sites
Long-branches “attract”
A
A
A
G C
Maximum Likelihood Analysis
Same as Maximum Parsimony except rates of nucleic acids substitutions are not
considered to have equal probability.
All possible unrooted trees are evaluated. (Same for Parsimony)
Each column of the alignment is processed. (Same for Parsimony)
The transition of A -> T will have a different probability than the transition
from G -> C
Start with a frequency distribution table that specifies the probability of one
base being substituted for another base.
See probabilities of nucleotide substitution. (Table 6.5 pg 275)
Probability that unrooted tree predicts each column of the alignment is
calculated.
Probabilities for each column are summed together for each tree.
The unrooted tree with the highest probability is chosen.
•
•
•
•
•
•
•
•
•
Maximum Likelihood Example
Four sequences are compared
(w, x, y and z)
All unrooted trees are shown
In this example we will
examine the first unrooted tree.
• Sequence w
•
•
Sequence x
Sequence y
Sequence z
ACGCGTTGGG
ACGCGTTGGG
ACGCAATGAA
ACACAGGGAA
w
y
w
x
x
z
y
z
w
y
z
x
Maximum Likelihood Example Continued
L(Tree x) = L0 * L1 * L2 * L3 * L4 *
L5 * L6
L0 base probability of nucleotide at 0
(0.25)
L1 probability of nucleotide changing
from value at 0 to value at 1.
L2 probability of nucleotide changing
from value at 0 to value at 1.
L3 probability of nucleotide changing
from value at 1 to value at 3 (T).
L4, L5, L6 probability of nucleotide
changing to value at leaf.
•
T
T
A
G
•
w
x
y
z
L3
L4
L5
L6
•
•
1
2
•
•
L1
0
L0
L2
Maximum Likelihood Example Continued
There are 64 likelihood trees to
evaluate. (number of bases) ^ (number
of internal nodes) or 4^3.
We will show evaluation TTG against
the first unrooted tree for column
TTAG
Determine values for L0, … L6. Values
are determined by looking up
probabilities in transition probability
table.
Probability of L2 is T->G
Probability of L5 is G -> A
Probability of L3 is T->T
Determine combined probability L0 *
L1 * L2 * … * L6
•
•
•
T
T
A
G
w
x
y
z
L3
L4
L5
L6
1
2
T
•
•
•
•
G
L1
0
L2
T
L0
Maximum Likelihood Example Continued
Determine probability for combination
TGG
Determine probability for the other 62
combinations.
Sum all the trees together. L(Tree) =
(LTree1) + L(Tree2) + … + L(Tree64)
Move to next column and repeat the
same procedure.
Once all columns are complete sum all
the probabilities. This is the likelihood
of the first unrooted tree.
Continue this process for the other
unrooted trees.
Pick the unrooted tree with the highest
probability. This is the most likely
unrooted tree.
•
T
T
A
G
•
w
x
y
z
L3
L4
L5
L6
•
•
•
•
1
2
T
G
L1
0
L2
G
•
L0
EVOLUCIÓN IN VITRO POR
INTERMEDIO DE PCR
Conclusion
Phylogenetic Prediction can be used for more than •
Evolutionary Distance
Verification of Taxonomy –
Identification of unknown –
Techniques work for genetic and non genetic data (Fatty Acid). –
Use multiple methods for verification •
Pick at least two different types of methods from Parsimony, –
Distance and Likelihood.
If the analysis is in agreement there is a higher level of confidence –
that the analysis is correct.
BOOTSTRAPING
How confident are we in this tree?
Bootstrapping
A statistical method that can be used to place
confidence intervals on phylogenies
Resampling from the Data
Original data:
human_myoglobin
pig_myoglobin
horse_myoglobin
common_seal_myoglobin
sperm_whale_myoglobin
sea_hare_myoglobin
-GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHL
-GLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL
-GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHL
-GLSEGEWQLVLNVWGKVEADLAGHGQDVLIRLFKGHPETLEKFDKFKHL
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
-SLSAAEADLAGKSWAPVFANKDANGDAFLVALFEKFPDSANFFADFKG-
...
...
...
...
...
...
Resampled data number 1:
Pick with replacement
human_myoglobin
pig_myoglobin
horse_myoglobin
common_seal_myoglobin
sperm_whale_myoglobin
sea_hare_myoglobin
LQKWDQKHNVHTEFGAEELQGDKLSWKKLDQGKKVVKKELGLDEDEWLGE
LQKWDQKHNVHTEFGAEELQGDKLSWKKLDQGKKVVKKELGLDEDEWLGE
LQKWDQTHNVHTEFGAEELQGDKLSWKTLDQGKKVVTKELGQDEDEWLGE
LQKWEQKHNVHTEFGADELQGDKLSWKKLDQGKKVVKKELGLDEDDWLGE LQRWEQKHHVHTEFAADELQGDKLSWKKLDQGRKVVKKELGLDEDDWLGE
LDDWADENKSNSNFAAAELDANFASAPELNDGDKVAEKFAALNNAAWAAN
Repeat 99 more time (or 999,999..)
Given the following tree, estimate the confidence of
the two internal branches
Chimpanzee
Human
Gibbon
Gorilla
Orang-utan
Estimating Confidence from the Resamplings
1. Of the 100 trees:
41/100
Chimpanzee
28/100
Gorilla
31/100
Human
Chimpanzee
Human
Human
Gibbon
Gibbon
Gibbon
Gorilla
Orang-utan
Chimpanzee
Orang-utan
Gorilla
Orang-utan
2. Upon the original tree we superimpose bootstrap values:
Chimpanzee
In 41 of the 100 trees,
chimp and gorilla are
split from the rest.
41
Human
Gibbon
Gorilla
100
Orang-utan
In 100 of the 100 trees,
gibbon and orang-utan
are split from the rest.
THE TREE OF LIFE
Relationships between 16S ribosomal RNAs
eukaryotes bacteria
archaea
Distant relationships
Close relationships
The three domains of Life
as identified by phylogenetic analysis of the
highly conserved 16S ribosomal RNA
16S ribosomal
RNA
(Woese and Fox 1977)
Where is the root of the tree of life?
(by definition there is no outgroup)
An ancient gene duplication can root a tree
Gene duplication
Speciation of 3 and 1-2
Speciation of 1 and 2
Outgroups for A2
Outgroups for A1
Root of 1,2,3
Graur & Li. Fundamentals of Molecular Evolution (1999)
The root of the tree of life as inferred from Ef-Tu and EF-G
Both trees show Archaea and Eucarya as sister taxa
Graur & Li. Fundamentals of Molecular Evolution (1999)
Horizontal Gene Transfer
archae
eubacteria
Mn-dependent transcriptional regulator
(Tatusov, 1996)
What is the origin of the mitochondria?
http://www.mitomap.org/
The endosymbiotic theory
The evidence:
•
Both mitochondria and chloroplasts can arise only from preexisting
mitochondria and chloroplasts. They cannot be formed in a cell that lacks
them because nuclear genes encode only some of the proteins of which they
are made.
•
Both mitochondria and chloroplasts have their own genome.
• Both genomes consist of a single circular molecule of DNA.
• There are no histones associated with the DNA.
The Mitochondria sit with the proteobacteria in the tree of life
mitochondrial (MT)
Small-subunit (SSU) ribosomal RNA tree
Gray MW Nature. 1998 Nov 12;396(6707):109-10.
mitochondrion
chloroplast
Lack
mitochondria (?)
The genome sequence of Rickettsia prowazekii and the
origin of mitochondria.
Andersson SG Nature 1998 Nov 12;396(6707):133-40
Mitochondrial ribosomal proteins are most similar
to those of R. prowazekii
Andersson SG Nature 1998 Nov 12;396(6707):133-40
Mitochondrial proteins involved in ATP synthesis
are most similar to those of R. prowazekii
Andersson SG Nature 1998 Nov 12;396(6707):133-40
Mitochondria derive from -Purple bacteria
Chloroplasts derive from cyanobacteria
Graur & Li. Fundamentals of Molecular Evolution (1999)
The tree of life
with mitochondria and chloroplast endosymbiotic events
(Doolittle, 1999)
Horizontal transfer is a dominant feature of the “tree” of life
(Doolittle, 1999)
Download