#31 - Phylogenetics Character-Based 11/05/07 Methods BCB 444/544

advertisement
#31 - Phylogenetics Character-Based
Methods
11/05/07
Required Reading
BCB 444/544
(before lecture)
Fri Oct 30 - Lecture 30
Lecture 31
Phylogenetic – Distance-Based Methods
• Chp 11 - pp 142 – 169
Mon Nov 5 - Lecture 31
Phylogenetics – Character-Based
Methods
Phylogenetics – Parsimony and ML
• Chp 11 - pp 142 – 169
Wed Nov 7 - Lecture 32
Machine Learning
#31_Nov05
Fri Nov 9 - Lecture 33
Functional and Comparative Genomics
• Chp 17 and Chp 18
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
1
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
2
BCB 544 Only:
New Homework Assignment
Assignments & Announcements
Mon Oct 29 - HW#5
544 Extra#2
HW#5 = Hands-on exercises with phylogenetics
and tree-building software
Due: Mon Nov 5
11/05/07
Due:
(not Fri Nov 1 as previously posted)
√PART 1 - ASAP
PART 2 - meeting prior to 5 PM Fri Nov 2
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
3
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
4
Chp 11 – Phylogenetic Tree Construction Methods
and Programs
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
SECTION IV MOLECULAR PHYLOGENETICS
http://www.bcb.iastate.edu/seminars/index.html
Xiong: Chp 11 Phylogenetic Tree Construction Methods
and Programs
• Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB
• Sharon Roth Dent
11/05/07
MD Anderson Cancer Center
• Role of chromatin and chromatin modifying proteins in
regulating gene expression
•
•
•
•
• Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Jianzhi George Zhang
U. Michigan
• Evolution of new functions for proteins
• Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI
Distance-Based Methods
Character-Based Methods
Phylogenetic Tree Evaluation
Phylogenetic Programs
• Amy Andreotti
ISU
• Something about NMR
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 Fall 07 Dobbs
11/05/07
5
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
6
1
#31 - Phylogenetics Character-Based
Methods
11/05/07
Tree Construction
Summary of Distance-Based Methods
• Two main categories of tree building
methods
• Distance-based
• Overall similarity between sequences
• Character-based
• Consider the entire MSA
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
• Clustering-based methods:
• Computationally very fast and can handle large datasets
that other methods cannot
• Not guaranteed to find the best tree
• Optimality-based methods:
• Better overall accuracies
• Computationally slow
• All distance-based methods lose all sequence
information and cannot infer the most likely state
at an internal node
11/05/07
7
Character-Based Methods
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
8
Parsimony
• Based directly on the sequence characters
in the MSA rather than overall distances
• Count mutational events accumulated on
sequences
• Evolutionary dynamics of each character
can be studied and ancestral sequences
inferred
• Two popular approaches
• Parsimony is based on Occam’s Razor –
the simplest explanation is most likely
correct
• Goal: Find the tree that allows
evolution of the sequences with the
fewest changes
• Parsimony
• Maximum Likelihood (ML)
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
9
Parsimony
BCB 444/544 Fall 07 Dobbs
11/05/07
10
Algorithms for Small Parsimony
• Fitch’s algorithm:
• Parsimony score of a tree: The smallest
(weighted) number of steps required by the tree
• Two parsimony problems:
• Large Parsimony problem: Find the tree with the
lowest parsimony score
• Small Parsimony problem: Given a tree, find its
parsimony score
• Use the small parsimony problem to solve the large
parsimony problem
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
• Based on set operations
• Evolutionary steps have the same weight
• Sankoff’s algorithm:
• Based on dynamic programming
• Allows steps to have different weights
• Both algorithms compute the minimum
(weighted) number of steps a tree requires
at a given site
11
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
12
2
#31 - Phylogenetics Character-Based
Methods
11/05/07
Fitch’s Algorithm Example
Sankoff’s Algorithm
• Allows for different weights for
different evolutionary steps
• Transitions (A <-> G or C <-> T) are
more probable than transversions, so
give a lower weight to transitions
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
13
Sankoff’s Algorithm Example
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
15
14
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
16
Exhaustive Search
• Solving the large parsimony problem
requires searching all possible trees
(or does it?)
• Exhaustive search (exact)
• Branch-and-Bound (exact)
• Heuristic search methods (not exact)
BCB 444/544 Fall 07 Dobbs
11/05/07
Sankoff’s Algorithm Traceback
Searching for a Most Parsimonious
Tree
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
• Build the only possible unrooted tree
for three taxa (can be randomly
chosen)
• Try all possible places to add the
fourth taxon and score each tree
• Try all places to add the fifth taxon
to the trees and score again …
17
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
18
3
#31 - Phylogenetics Character-Based
Methods
11/05/07
Why Finding a True Tree is Difficult
Adding the Fourth Taxon
Number of rooted trees
• The number of possible
trees grows
exponentially with the
number of species (or
sequences)
• Nr = (2n -3)!/2(n-2)(n-2)!
• Nu = (2n -5)!/2(n-3)(n-3)!
• To find the best tree,
you must explore all
possibilities (or must
you?)
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
19
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
20
11/05/07
21
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
22
Adding the Fifth Taxon
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
Branch and Bound
Branch and Bound
• When a tip of the search tree is
reached the tree is either optimal
(and retained) or suboptimal (and
rejected)
• When all paths leading from the initial
3 taxon tree have been explored, the
algorithm terminates, and all most
parsimonious trees will have been
identified
• Similar to exhaustive search except that
we maintain the score of best tree obtained
so far
• If score of current tree exceeds the
current best score, backtrack and take
next available path
• Main idea: The parsimony score of a tree
can only increase as we add another taxa
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 Fall 07 Dobbs
11/05/07
23
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
24
4
#31 - Phylogenetics Character-Based
Methods
11/05/07
Branch and Bound
Branch and Bound
• One way to find a reasonable lower
bound quickly:
• Use UPGMA or NJ to build a complete
tree
• Calculate the parsimony score of this
tree and use it as a lower bound in our
search
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
25
Heuristic Search
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
26
11/05/07
28
11/05/07
30
Nearest-Neighbor Interchange
• Shortcuts have been designed to reduce
the search space
• Idea: Build a tree quickly (by NJ or some
other fast method) and rearrange parts of
it to explore some of the possible trees
• Branch swapping
• Nearest neighbor interchange
• Subtree pruning and regrafting
• Tree bisection and reconnection
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
27
Subtree Pruning and Regrafting
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 Fall 07 Dobbs
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
Tree Bisection and Reconnection
11/05/07
29
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
5
#31 - Phylogenetics Character-Based
Methods
11/05/07
Stepwise Addition – Another Heuristic
Maximum Likelihood Method
• A greedy method
• Start with 3 taxon tree
• Add one taxon at a time
• Keep only the best tree found so far
• No guarantee of optimality, but may
provide a good starting point for a
search
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
• ML is based on a Markov model of evolution
• Observed: The species labeling the leaves
• Hidden: The ancestral states
• Transition probabilities: The mutation
probabilities
• Assumptions:
• Only mutations are allowed
• Sites are independent
31
Models of Evolution at a Site
11/05/07
G
A
T
G
C
T
Probability = mTG · mGA · mGG · mTT · mTC · mTT
33
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
34
Likelihood of a Tree
X
X
Y
Y
Z
G
32
T
Ancestral Reconstruction: Most Likely
Assignment
A
11/05/07
The Probability of an Assignment
• Transition probability matrix:
M = [mij],
i,j {A,C,T,G}
Where
mij = Prob(i -> j mutation in 1 time
unit)
Branches may have different lengths
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
C
T
A
L* = maxX,Y,Z {mXY · mYA · mYG · mXZ · mZC · mZT}
BCB 444/544 Fall 07 Dobbs
G
C
T
L* = ∑X,Y,Z {mXY · mYA · mYG · mXZ · mZC · mZT}
Compute using Viterbi algorithm
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
Z
Compute using forward algorithm
11/05/07
35
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
36
6
#31 - Phylogenetics Character-Based
Methods
11/05/07
Maximum Likelihood Comments
Phylogenetic Tree Evaluation
• ML is robust
• ML converges to the correct answer
as more data is added
• Can put in a Bayesian statistical
framework to obtain a distribution of
possible phylogenies
• ML can be slow
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
• Bootstrapping
• Jackknifing
• Bayesian Simulation
• Statistical difference tests (are two
trees significantly different?)
• Kishino-Hasegawa Test (paired t-test)
• Shimodaira-Hasegawa Test (χ2 test)
37
Bootstrapping
• Construct trees for samples
• For each branch in original tree, compute fraction
of bootstrap samples in which that branch appears
• Assigns a bootstrap support value to each branch
• Idea: If a grouping has a lot of support, it will be
supported by at least some positions in most of the
bootstrap samples
11/05/07
39
Jackknifing
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
40
Bayesian Simulation
• Another resampling technique
• Randomly delete half of the sites in the
dataset
• Construct new tree with this smaller
dataset, see how often taxa are grouped
• Advantage – sites aren’t duplicated
• Disadvantage – again really only measuring
consistency of the data
BCB 444/544 Fall 07 Dobbs
38
• Bootstrapping doesn’t really assess the
accuracy of a tree, only indicates the
consistency of the data
• To get reliable statistics, bootstrapping
needs to be done on your tree 500 – 1000
times, this is a big problem if your tree
took a few days to construct
• Obtain a data matrix with same number of taxa and
number of characters as original one
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
Bootstrapping Comments
• A bootstrap sample is obtained by sampling sites
randomly with replacement
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
• Using a Bayesian ML method to produce a tree
automatically calculates the probability of many
trees during the search
• Most trees sampled in the Bayesian ML search are
near an optimal tree
41
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
42
7
#31 - Phylogenetics Character-Based
Methods
11/05/07
Phylogenetic Programs
Phylogenetic Programs
• Huge list at:
• http://evolution.genetics.washington.edu/phylip/so
ftware.html
• PAUP* - one of the most popular programs,
commercial, Mac and Unix only, nice user interface
• PHYLIP – free, multiplatform, a bit difficult to use
but web servers make it easier
• WebPhylip – another interface for PHYLIP online
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
• TREE-PUZZLE – uses a heuristic to allow ML on
large datasets, also available as a web server
• PHYML – web based, uses genetic algorithm
• MrBayes – Bayesian program, fast and can handle
large datasets, multiplatform download
• BAMBE – web based Bayesian program
43
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
11/05/07
44
Final Comments on Phylogenetics
• No method is perfect
• Different methods make very
different assumptions
• If multiple methods using different
assumptions come up with similar
results, we should trust the results
more than any single method
BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods
BCB 444/544 Fall 07 Dobbs
11/05/07
45
8
Download