Molecular Clocks

advertisement
Molecular Clocks
Rose Hoberman
The Holy Grail
Fossil evidence is
sparse and imprecise
(or nonexistent)
Predict divergence times by
comparing molecular data
• Given
110 MYA
– a phylogenetic tree
– branch lengths (rt)
– a time estimate for
one (or more) node
C
D
R
M
• Can we date other nodes in the tree?
• Yes... if the rate of molecular change is
constant across all branches
H
Rate Constancy?
Page & Holmes p240
Protein Variability
• Protein structures & functions differ
– Proportion of neutral sites differ
• Rate constancy does not hold across
different protein types
• However...
– Each protein does appear to have a
characteristic rate of evolution
Evidence for Rate Constancy
in Hemoglobin
Large carniverous
marsupial
Page and Holmes
p229
The
Molecular Clock
Hypothesis
• Amount of genetic difference between
sequences is a function of time since
separation.
• Rate of molecular change is constant
(enough) to predict times of divergence
Outline
• Methods for estimating time under a molecular
clock
– Estimating genetic distance
– Determining and using calibration points
– Sources of error
• Rate heterogeneity
– reasons for variation
– how its taken into account when estimating times
• Reliability of time estimates
• Estimating gene duplication times
Measuring Evolutionary time with a
molecular clock
1. Estimate genetic distance
d = number amino acid replacements
2. Use paleontological data to determine date of
common ancestor
T = time since divergence
3. Estimate calibration rate (number of genetic
changes expected per unit time)
r = d / 2T
4. Calculate time of divergence for novel
sequences
T_ij = d_ij / 2r
Estimating Genetic Differences
If all nt equally likely,
observed difference
would plateau at 0.75
Simply counting
differences
underestimates
distances
Fails to count for
multiple hits
(Page & Holmes p148)
Estimating Genetic Distance with a
Substitution Model
• accounts for relative frequency of different
types of substitutions
• allows variation in substitution rates
between sites
• given learned parameter values
– nucleotide frequencies
– transition/transversion bias
– alpha parameter of gamma distribution
• can infer branch length from differences
Distances from
Gamma-Distributed Rates
• rate variation among sites
– “fast/variable” sites
• 3rd codon positions
• codons on surface of globular protein
– “slow/invariant” sites
• Trytophan (1 codon) structurally required
• 1st or 2nd codon position when di-sulfide bond needed
• alpha parameter of gamma distribution
describes degree of variation of rates across
positions
• modeling rate variation changes branch length/
sequence differences curve
Gamma Corrected Distances
• high rate sites
saturate quickly
• sequence difference
rises much more
slowly as the
low-rate sites
gradually accumulate
differences
• Felsenstein Inferring
Phylogenies p219
The ‘Sloppy’ Clock
• ‘Ticks’ are stochastic, not deterministic
– Mutations happen randomly according to a
Poisson distribution.
• Many divergence times can result in the
same number of mutations
• Actually over-dispersed Poisson
– Correlations due to structural constraints
Poisson Variance
(Assuming A Pefect Molecular Clock)
If mutation every MY
• Poisson variance
– 95% lineages 15 MYA
old have 8-22
substitutions
– 8 substitutions also
could be 5 MYA
Molecular Systematics p532
Need for Calibrations
• Changes = rate*time
• Can explain any observed branch length
– Fast rate, short time
– Slow rate, long time
• Suppose 16 changes along a branch
– Could be 2 * 8 or 8 * 2
– No way to distinguish
– If told time = 8, then rate = 2
• Assume rate=2 along all branches
– Can infer all times
Estimating Calibration Rate
• Calculate separate rate for each data set
(species/genes) using known date of
divergence (from fossil, biogeography)
• One calibration point
– Rate = d/2T
• More than one calibration point
– use regression
– use generative model that constrains time
estimates (more later)
Calibration Complexities
• Cannot date fossils perfectly
• Fossils usually not direct ancestors
– branched off tree before (after?) splitting
event.
• Impossible to pinpoint the age of last
common ancestor of a group of living
species
Linear Regression
• Fix intercept at (0,0)
• Fit line between
divergence estimates and
calibration times
• Calculate regression and
prediction confidence limits
Molecular Systematics p536
Molecular Dating
Sources of Error
• Both X and Y values only estimates
–
–
–
–
substitution model could be incorrect
tree could be incorrect
errors in orthology assignment
Poisson variance is large
• Pairwise divergences correlated (Systematics p534?)
– inflates correlation between divergence & time
• Sometimes calibrations correlated
– if using derived calibration points
• Error in inferring slope
• Confidence interval for predictions much larger than
confidence interval for slope
Rate Heterogeneity
• Rate of molecular evolution can differ between
–
–
–
–
–
–
nucleotide positions
genes
genomic regions
genomes (nuclear vs organelle), species
species
over time
• If not considered, introduces bias into
time estimates
Rate Heterogeneity
among Lineages
Cause
Reason
Repair
equipment
e.g. RNA viruses have
error-prone polymerases
More free radicals
Metabolic rate
Generation time Copies DNA more frequently
Population size
Effects mutation fixation rate
Local Clocks?
• Closely related species often share similar
properties, likely to have similar rates
• For example
– murid rodents on average 2-6 times faster
than apes and humans (Graur & Li p150)
– mouse and rat rates are nearly equal (Graur &
Li p146)
Rate Changes within a Lineage
Cause
Population size
changes
Reason
Genetic drift more likely to fix
neutral alleles in small
population
Strength of selection 1. new role/environment
changes over time
2. gene duplication
3. change in another gene
Working Around Rate
Heterogeneity
1. Identify lineages that deviate and remove
them
2. Quantify degree of rate variation to put
limits on possible divergence dates
– requires several calibration dates, not always
available
– gives very conservative estimates of
molecular dates
3. Explicity model rate variation
Search for Genes with
Uniform Rate across Taxa
Many ‘clock’ tests:
– Relative rates tests
• compares rates of sister nodes using an outgroup
– Tajima test
• Number of sites in which character shared by outgroup and
only one of two ingroups should be equal for both ingroups
– Branch length test
• deviation of distance from root to leaf compared to average
distance
– Likelihood ratio test
• identifies deviance from clock but not the deviant sequences
Likelihood Ratio Test
• estimate a phylogeny under molecular
clock and without it
– e.g. root-to-tip distances must be equal
• difference in likelihood ~ 2*Chi^2 with n-2
degrees of freedom
– asymptotically
– when models are nested
– when nested parameters aren’t set to
boundary
Relative Rates Tests
• Tests whether distance between two taxa and an
outgroup are equal (or average rate of two clades vs an
outgroup)
– need to compute expected variance
– many triples to consider, and not independent
• Lacks power, esp
– short sequences
– low rates of change
• Given length and number of variable sites in typical
sequences used for dating, (Bronham et al 2000) says:
– unlikely to detect moderate variation between lineages (1.5-4x)
– likely to result in substantial error in date estimates
R
Modeling Rate Variation
N
Relaxing the Molecular Clock
• Learn rates and times, not just
branch lengths
D E F
M
A
B
C
– Assume root-to-tip times equal
– Allow different rates on different branches
– Rates of descendants correlate with that of
common acnestor
• Restricts choice of rates, but still too much
flexibility to choose rates well
Relaxing the Molecular Clock
• Likelihood analysis
– Assign each branch a rate parameter
• explosion of parameters, not realistic
– User can partition branches based on domain knowledge
– Rates of partitions are independent
• Nonparametric methods
– smooth rates along tree
• Bayesian approach
–
–
–
–
stochastic model of evolutionary change
prior distribution of rates
Bayes theorem
MCMC
Parsimonious Approaches
• Sanderson 1997, 2002
– infer branch lengths via parsimony
– fit divergence times to minimize difference
between rates in successive branches
– (unique solution?)
• Cutler 2000
– infer branch lengths via parsimony
– rates drawn from a normal distribution
(negative rates set to zero)
Bayesian Approaches
Learn rates, times, and substitution parameters
simultaneously
Devise model of relationship between rates
– Thorne/Kishino et al
• Assigns new rates to descendant lineages from a
lognormal distribution with mean equal to
ancestral rate and variance increasing with
branch length
– Huelsenbeck et al
• Poisson process generates random rate changes
along tree
• new rate is current rate * gamma-distributed
random variable
Comparison of Likelihood & Bayesan
Approaches for Estimating Divergence
Times (Yang & Yoder 2003)
• Analyzed two mitochondrial genes
–
–
–
–
each codon position treated separately
tested different model assumptions
used
7 calibration points
• Neither model reliable when
– using only one codon position
– using a single model for all positions
• Results similar for both methods
– using the most complex model
– use separate parameters for each codon position (could use
codon model?)
Sources of Error/Variance
• Lack of rate constancy (due to lineage,
population size or selection effects)
• Wrong assumptions in evolutionary model
• Errors in orthology assignment
• Incorrect tree
• Stochastic variability
• Imprecision of calibration points
• Imprecision of regression
• Human sloppiness in analysis
– self-fulfilling prophecies
Reading the entrails of chickens
(Graur and Martin 2004)
• single calibration point
• error bars removed from calibration points
• standard error bars instead of 95% confidence
intervals
• secondary/tertiary calibration points treated as
reliable and precise
– based on incorrect initial estimates
– variance increases with distance from
original estimate
• few proteins used
Multiple Gene Loci
• “Trying to estimate time of divergence from
one protein is like trying to estimate the
average height of humans by measuring
one human”
--Molecular Systematics p539
Use multiple genes!
(and multiple calibration points)
Even so...
Be Very Wary
Of Molecular Times
• Point estimates are absurd
• Sample errors often based
only on the difference between
estimates in the same study
• Even estimates with confidence
intervals unlikely to really capture
all sources of variance
McLysaght, Hokamp, Wolfe 2002
Dating Human Gene Duplications
• [758] Trees generated (ML method using PAM matrix)
• [602] Alpha parameter for gamma distribution learned
– (Gu and Zhang 1997) faster than ML, more accurate than
parsimony
– Thrown out if variance > mean. Why would this happen?
– “May be problematic to apply this model for gene family
evolution because of the possible functional divergence among
paralogous genes”
• [481] NJ trees built from Gamma-corrected distances
– Family kept only if worm/fly group together
• [191] Two-cluster test of rate constancy (Takezaki et al
1995)
Blanc, Hokamp, Wolfe
Dating Arabadopsis Duplications
• Create nucleotide alignments
• Estimate “Level of” Synonymous
substitutions (Yang’s ML method)
– per site? per synonymous site?
• Ks values > 10 ignored (Yang; Anisimova)
• Why used different method than for
human?
• How reliable is ranking of Ks values? How
much variance expected?
Ks > 10 unreliable ?
• Yang (abstract) calculates effect of
evolutionary rate on accuracy of
phylogenic reconstruction
• Anisimova calculates accuracy and power
of LRT in detecting adaptive molecular
evolution
• Neither seems to give any cutoff regarding
dS > 10.
Future Improvements
• Calculate accurate confidence
intervals taking into account
multiple sources of variance
• Novel models that account for variation in
rates between taxa
• Build explicit models that predict rates
based on an understanding of the
underlying processes that generate
differences in substitutions rates
General References
Reviews/Critiques
1. Bronham and Penny. The modern molecular clock,
Nature review in genetics?, 2003.
2. Graur and Martin. Reading the entrails of chickens...the
illusion of precision. Trends in Genetics, 2004.
Textbooks:
1. Molecular Systematics. 2nd edition. Edited by Hillis,
Moritz, and Mable.
2. Inferring Phylogenies. Felsenstein.
3. Molecular Evolution, a phylogenetic approach. Page
and Holmes.
Rate Heterogeneity References
Dealing with Rate Heterogeneity
1. Yang and Yoder. Comparison of likelihood and bayesian methods for
estimating divergence times... Syst. Biol, 2003.
2. Kishino, Thorne, and Bruno. Performance of a divergence time
estimation method under a probabilistic model of rate evolution. Mol.
Biol. Evol, 2001.
3. Huelsenbeck, Larget, and Swofford. A compound poisson process
for relaxing the molecular clock. Genetics, 2000.
Testing for Rate heterogeneity
1. Takezaki, Rzhetsky and Nei. Phylogenetic test of the molecular clock
and linearized trees. Mol. Bio. Evol., 1995.
2. Bronham, Penny, Rambaut, and Hendy. The power of relative rates
test depends on the data. J Mol Evol, 2000.
Dating Duplications References
Dating duplications:
• McLysaght, Hokamp, and Wolfe. Extensive genomic duplication
during early chordate evolution. Nature Genetics?, 2002.
• Blanc, Hokamp, and Wolfe. Recent polyploidy superimposed on
older large-scale duplications in the Arabidopsis genome. Genome
Research, 2003.
Reference used for dating duplications in above papers
• Gu and Zhang. A simple method for estimating the parameter of
substitution rate variation among sites. Mol. Biol. Evol., 1997.
• Yang Z. On the best evolutionary rate for phylogenetic analysis.
Syst. Biol, 1998.
• Anisimova, Bielawski, Yang. Accuracy and power of the likelihood
ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol.,
2001.
Download