#30 - Phylogenetics Distance-Based 11/02/07 Methods BCB 444/544

advertisement
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Required Reading
BCB 444/544
(before lecture)
Wed Oct 30 - Lecture 29
Lecture 30
Phylogenetics Basics
• Chp 10 - pp 127 - 141
Thurs Oct 31 - Lab 9
Phylogenetics – Distance-Based
Methods
Gene & Regulatory Element Prediction
Fri Oct 30 - Lecture 30
Phylogenetic – Distance-Based Methods
• Chp 11 - pp 142 – 169
#30_Nov02
Mon Nov 5 - Lecture 31
Phylogenetics – Parsimony and ML
• Chp 11 - pp 142 - 169
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
1
Assignments & Announcements
11/02/07
2
11/02/07
4
11/02/07
6
BCB 544 "Team" Projects
Mon Oct 29 - HW#5
Last week of classes will be devoted to Projects
HW#5 = Hands-on exercises with phylogenetics
and tree-building software
Due: Mon Nov 5
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
• Written reports due:
• Mon Dec 3 (no class that day)
(not Fri Nov 1 as previously posted)
• Oral presentations (20-30') will be:
• Wed-Fri Dec 5,6,7
• 1 or 2 teams will present during each class period
¾ See Guidelines for Projects posted online
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
3
BCB 544 Only:
New Homework Assignment
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
544 Extra#2
Due:
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
http://www.bcb.iastate.edu/seminars/index.html
√PART 1 - ASAP
PART 2 - meeting prior to 5 PM Fri Nov 2
• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Bob Jernigan BBMB, ISU
• Control of Protein Motions by Structure
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
5
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
1
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Tree Building Procedure
Chp 10 - Phylogenetics
SECTION IV MOLECULAR PHYLOGENETICS
• Choose molecular markers
• Perform MSA
• Choose a model of evolution
• Determine tree building method
• Assess tree reliability
Xiong: Chp 10 Phylogenetics Basics
•
•
•
•
•
•
Evolution and Phylogenetics
Terminology
Gene Phylogeny vs. Species Phylogeny
Forms of Tree Representation
Why Finding a True Tree is Dificult
Procedure of Building a Phylogenetic Tree
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
7
Choice of Molecular Markers
11/02/07
• Make sure important functional residues align
• Align secondary structure elements
• Use full alignment or just parts
9
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
10
How do we measure divergence
between sequences?
• Simple measure – just count the
number of substitutions observed
between the sequences in the MSA
• Problem – number of substitutions may
not represent the number of
evolutionary events that actually
occurred
• Rascal and NorMD – correct alignment
errors, remove potentially unrelated or
highly divergent sequences
• Gblocks – detect and eliminate poorly
aligned positions and divergent regions
BCB 444/544 Fall 07 Dobbs
8
• Most critical step in tree building - cannot
build correct tree without correct
alignment
• Should build alignments with multiple
programs, then inspect and compare to
identify the most reasonable one
• Most alignments need manual editing
Automatic Editing of Alignments
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
Multiple Sequence Alignment
• Very closely related organisms - nucleic acid
sequence will show more differences
• For individuals within a species - faster mutation
rate is in noncoding regions of mtDNA
• More distantly related species - slowly evolving
nucleic acid sequences like ribosomal RNA or
protein sequences
• Very distantly related species - use highly
conserved protein sequences
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
11
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
12
2
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Multiple Substitutions
C
Multiple Substitutions
A
A
T
A
T
G
G
A
A
Just because we only see one difference, does not
mean that there was only one evolutionary event
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
Just because we only see no difference, does not
mean that there were no evolutionary events
13
Choosing Substitution Models
11/02/07
14
11/02/07
16
Jukes-Cantor Model
• Statistical models of evolution are
used to correct for the multiple
substitution problem
• Focus on DNA models
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
• Jukes-Cantor model
assumes all
nucleotides are
substituted with
equal probability
• Can be used to
correct for multiple
substitutions
11/02/07
15
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
Evolutionary Models for Protein
Sequences
Many Other Models
• PAM and JTT substitution matrices
already take into account multiple
substitutions
• There are also models similar to
Jukes-Cantor for protein sequences
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
17
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
18
3
#30 - Phylogenetics Distance-Based
Methods
11/02/07
What about differences in mutation rates
between positions within a sequence?
Chp 11 – Phylogenetic Tree Construction Methods
and Programs
SECTION IV MOLECULAR PHYLOGENETICS
• One of our assumptions was that all positions in a
sequence are evolving at the same rate
• Bad assumption
Xiong: Chp 11 Phylogenetic Tree Construction Methods
and Programs
• Third position in a codon changes with higher frequency
• In proteins, some amino acids can change and others
cannot
•
•
•
•
• This variation is called among-site rate
heterogeneity
• Many tree building programs have parameters
meant to deal with this problem – adds to
complexity of getting the correct tree
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
19
Tree Construction
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
21
6
0
c
d
7
14
3
10
0
9
0
a
b
c
d
20
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
22
Distance-Based Methods
• Two ways to construct a tree based on
a distance matrix
a
b
11/02/07
• Given a MSA and an evolutionary
model, calculate the distance between
all pairs of sequences
• Construct distance matrix
• Construct phylogenetic tree based on
the distance matrix
Distance Matrices
0
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
Distance-Based Methods
• Two main categories of tree building
methods
• Distance-based
• Overall similarity between sequences
• Character-based
• Consider the entire MSA
a
Distance-Based Methods
Character-Based Methods
Phylogenetic Tree Evaluation
Phylogenetic Programs
b
• Clustering
• Optimality
c
d
0 1 2 34 5 6 7 8
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
23
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
24
4
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Clustering-Based Methods
UPGMA
• E.g., UPGMA and Neighbor-Joining
• A cluster is a set of taxa
• Interspecies distances translate into
intercluster distances
• Clusters are repeatedly merged
• “Closest” clusters merged first
• Distances are recomputed after merging
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
• UPGMA – Unweighted Pair Group Method
Using Arithmetic Average
• Uses molecular clock assumption – all taxa
evolve at a constant rate and are equally
distant from the root (ultrametric tree)
• This assumption is usually wrong
• So why use UPGMA?
• Very fast
25
UPGMA Example
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
26
11/02/07
28
11/02/07
30
UPGMA Example
11/02/07
27
UPGMA Example
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
UPGMA Example
11/02/07
29
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
5
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Neighbor Joining
Neighbor Joining
• Idea: Find a pair of taxa that are
close to each other but far from
other taxa
• Implicitly finds a pair of neighboring
taxa
• No molecular clock assumption
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
• NJ corrects for unequal evolutionary
rates between sequences by using a
conversion step
• The conversion step requires
calculation of “r-values” and
“transformed r-values”
31
Neighbor Joining
11/02/07
32
Neighbor Joining
The r-value for a sequence is:
The transformed r-value for a sequence is:
ri = ∑ d ij
r 'i =
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
ri
n−2
Where n is the number of taxa
Transformed r-values are used to
determine the distance of a taxon to
the nearest node
The sum of the distances between
sequence i and all other sequences
33
Neighbor Joining
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
34
Neighbor Joining
The final equation we need is for computing the
distance from a new cluster to each taxa.
Assume taxa i and j were merged into a cluster
u. The distance from taxa i to cluster u is:
The converted distance between two
sequences is:
d 'ij = d ij −
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
1
(ri + rj )
2
d iu =
These converted distances are used
in building the tree
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
35
[d + (r ' −r ' )]
ij
i
j
2
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
36
6
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Neighbor Joining Example
A
Neighbor Joining Example
B
• Initialize tree into a star shape with
all taxa connected to the center
• Step 1: Compute r-values and
transformed r-values for all taxa
C
B
0.40
C
0.35
0.45
D
0.60
0.70
rA = d AB + d AC + d AD = 0.4 + 0.35 + 0.6 = 1.35
0.55
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
rA
1.35
=
= 0.675
4−2
2
r'A =
37
Neighbor Joining Example
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
• Step 3: Fill out converted distance
matrix
A
1
(rA + rB )
2
1
(1.35 + 1.55)
2
= −1.05
= 0.4 −
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
39
Neighbor Joining Example
B
?
U
C
D
?
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
B
B
-1.05
C
-1
-1
D
-1
-1
C
-1.05
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
40
Neighbor Joining Example
• Step 4: Create a node by merging closest taxa
• In this example, the distance between A and B is
the same as the distance between C and D
• We can pick either pair to start with
• Let’s pick A and B and create a node called U
A
38
Neighbor Joining Example
• Step 2: Compute converted distances
d ' AB = d AB −
11/02/07
• Step 5: Compute branch lengths
• Use the equation for computing the distance from
a taxa to a node
d AU =
[d AB + (r ' A −r 'B )]
2
[0.4 + (0.675 − 0.775)]
=
2
= 0.15
A
B
11/02/07
41
0.15
U
0.25
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
A
B
11/02/07
42
7
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Neighbor Joining Example
Neighbor Joining Example
Our reduced distance matrix:
• Step 6: Construct reduced distance matrix by
computing converted distances from each taxa to
the new node U
• In UPGMA, we simply calculated the average
d CU =
U
[(d AC − dUA ) + (d BC − dUB )]
2
(
[
0.35 − 0.15) + (0.45 − 0.25)]
=
= 0 .2
2
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
43
Neighbor Joining Example
C
0.20
D
0.45
C
0.55
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
44
Optimality-Based Methods
• From here, we go back to step 1
• Continue until all taxa have been decomposed from
the star tree
•
• Clustering methods produce a single tree with no
ability to judge how good it is compared to
alternative tree topologies
• Optimality-based methods compare all possible
tree topologies and select a tree that best fits the
distance matrix
• Two algorithms:
• Fitch-Margoliash
• Minimum evolution
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
45
Fitch-Margoliash
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
46
Minimum Evolution
• Similar to Fitch-Margoliash, but uses a
different optimality criterion
• Searches for a tree with the minimum total
branch length
• This is an indirect way of achieving the best
fit of the branch lengths with the original
data
• Selects best tree among all possible trees based on
minimum deviation between distances calculated in
the tree and distances in the distance matrix
• Basically, a least squares method
• Dij = distance between i and j in matrix
• dij = distance between i and j in tree
• Objective: Find tree that minimizes
∑ (Dij − dij )2
1 ≤i< j≤n
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
47
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
11/02/07
48
8
#30 - Phylogenetics Distance-Based
Methods
11/02/07
Summary of Distance-Based Methods
• Clustering-based methods:
• Computationally very fast and can handle large datasets
that other methods cannot
• Not guaranteed to find the best tree
• Optimality-based methods:
• Better overall accuracies
• Computationally slow
• All distance-based methods lose all sequence
information and cannot infer the most likely state
at an internal node
BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods
BCB 444/544 Fall 07 Dobbs
11/02/07
49
9
Download