Data - Jarno Tuimala

advertisement
more sequence or more individuals,
to combine or not?






14.4. Tue Introduction to models (Jarno)
16.4. Thu Distance-based methods (Jarno)
17.4. Fri ML analyses (Jarno)
20.4. Mon
21.4. Tue
(Jarno)
23.4. Thu



24.4. Fri
Assessing hypotheses (Jarno)
Problems with molecular data
Problems with molecular data (Jarno)
Phylogenomics
Search algorithms, visualization, and
other computational aspects (Jarno)
J

The trivial truth
◦ All extant species
◦ The whole genome

Impractical? Well, then
◦ As many species as possible
◦ As much data as possible

Finite constraints on resources (time, money)
◦ Know your group – which taxa are the most relevant
for your study?
◦ Know what gene sequences are available from
previous studies




The days of single gene datasets are over
Mitochondrial and chloroplast DNA have been
popular because they are easy to amplify and
sequence
It is worth increasing the number of nuclear
genes
One should aim for at least 3 genes,
preferably more (maybe 10?)


It is now possible to increase the number of
genes being sequenced significantly
Whole genome analyses will allow us to
understand:
◦ Intron-exon boundary dynamics
◦ Gene duplication-deletion dynamics
◦ Gene transfer dynamics

Soon we will have a good understanding of
the regions of the genome that are most
suitable for systematics

Sometimes not all genes amplify from all
samples
◦ Should these samples be discarded?


Increased taxon sampling, despite missing
data, increases resolution
All possible data should be used!



Can separate independent data sets be
combined for analysis?
How can we assess the possibility of conflict
between different data?
What does the potential conflict then mean?

For instance
◦ Different genes may have different phylogenetic
signal (different history?)

If both genes have equally strong signal

If one gene has a stronger signal than the
other

If one gene has a stronger signal than the
other
 Never
combine
 Combine sometimes
 Always combine



The different data sets may represent
different evolutionary histories (e.g. different
selection pressures)
Big data sets dominate small data sets
When analyzed separately, the different data
sets can be tests of each others phylogenetic
hypotheses
Data set A
Data set B
+
Their consensus
=
My own experience:
A
B
C
D
E
F
G
H


Would be fantastic to get genealogical
histories of individual genes
But!
◦ Single genes generally short 1000-2000 bases
◦ Lots of homoplasy
◦ Unreliable phylogenies



If the data sets are congruent, combine them
If the data sets are incongruent, don’t
combine them
One can use the ILD test to decide whether
data sets are incongruent

If there is no conflict between data sets:
◦ The length of most parsimonious tree from the
combined data [L(x+y)] is equal to the sum of the
lengths of the MP trees from the separately
analyzed data [L(x) + L(y)]
Dxy = L(x+y) – (L(x) + L(y))
Dxy = 0
(Farris et al 1994)



Combining the data sets leads to increased
homoplasy
But is it statistically significant?
Can be tested with the Mann-Whitney U test,
where the null hypothesis is that the data sets
are combinable
Data set x
Data set y
Combine data
Data sets x + y
Data set p
Original
Data set q
Sample randomly to
get equally large data
sets




Search for MP trees and calculate Dpq values
Repeat many times (e.g. 1000), which gives us
a distribution for the value of D
Compare whether Dxy differs from random
distribution at P < 0.05
However:
◦ ILD-test is sensitive to relative sizes of compared data
sets and to the evolutionary history of the different
data sets




Combining all available data leads to more
resolved trees = the combined data has higher
explanatory power
”Hidden support” can only be detected through
combined analysis
Conflicts at different nodes can only be
discovered in a combined analysis framework
The effects of combined analysis can be
investigated using indices related to Bremer
support

Partitioned Bremer Support (PBS)
◦ Baker & DeSalle 1997: Syst Biol 46:654

Partition Congruence Index (PCI)
◦ Brower 2006: Cladistics 22:378

Hidden Bremer Support (HBS)
◦ Gatesy et al 1999: Cladistics 15:271


The different data partitions in a data set
contribute to the Bremer support in an
additive way
For each node:
◦ A negative Partitioned Bremer support value
indicates conflict
◦ A positive Partitioned Bremer support value
indicates congruence
7
3,4
7
-6,13
Bremer Support
Morpholgy, COI, EF1a, Wgl




Tells us about the magnitude of conflict
between data partitions in a combined
analysis
PCI is always equal to or less than BS for a
given branch
PCI = BS when there is no conflict
PCI is negative when there is low BS because
of strong conflicts between data partitions
Brower 2006: Cladistics 22:378-386



Underlying phylogenetic signal can be
confounded by homoplasy in separate
analyses
Combining datasets can bring out this signal,
as homoplasy is largely random noise
Can be measured using HBS and Partitioned
HBS

Hidden support can be defined as increased
support for the node of interest in the
simultaneous analysis of all data partitions
relative to the sum of support for that node in
the separate analyses of each partition

For a particular combined data set and a
particular node, HBS is the difference between
BS for that node in the combined analysis and
the sum of BS values for that node from each
data partition



With a small dataset, it is probably always
best to combine everything
With large datasets (10 or 20 gene regions?)
one can find sets of congruent genes and
combine them
But!
◦ Is there a biological reason for incongruence, or is
it just a property of the data?
Niklas Wahlberg





Saturation
Bias in nucleotide composition
Orthology vs paralogy
Lineage sorting
Lateral Gene Transfer
Saturation




Saturation is due to multiple changes at the same
site subsequent to lineage splitting
Models of evolution attempt to infer the missing
information through correcting for “multiple hits”
Most data will contain some fast evolving sites
which are potentially saturated (e.g. in proteins
often position 3)
In severe cases the data becomes essentially
random and all information about relationships
can be lost
Multiple changes at a single
site - hidden changes
Ancest GGCGCG
Seq 1 AGCGAG
Seq 2 GCGGAC
Number of changes
1
Seq 1 C
Seq 2
C
3
2
G
T
1
A
A
Time since divergence
from sequences
Pairwise distance calculated



Homoplasy is a problem with molecular data
Elevated rates of molecular evolution in
unrelated lineages
Sparse taxon sampling leading to long
branches
The classical long-branch
attraction example
Based on one gene 18S
Nardi et al. 2003:
Science 299: 1887-1889



Taxon sampling is important
For divergent taxa with few extant species,
can be a problem
More data from different sources
◦ Could be that molecular data are not able to resolve
the position of some taxa
◦ Morphological data!
Biased base composition
 Do
sequences manifest biased
base compositions (e.g
thermophilic convergence) or
biased codon usage patterns
which may obscure phylogenetic
signal?
% Guanine + Cytosine in 16S rRNA genes
Thermophiles:
Thermotoga maritima
Thermus thermophilus
Aquifex pyrophilus
%GC variable parsimony
all sites sites
sites
62
64
65
72
72
73
73
70
71
Deinococcus radiodurans 55
Bacillus subtilis
55
52
50
48
38
Mesophiles:
A case study in phylogenetic analysis:
Deinococcus and Thermus


Deinococcus are radiation resistant bacteria
Thermus are thermophilic bacteria
 BUT:
◦ Both have the same very unusual cell wall based upon
ornithine
◦ Both have the same menaquinones (Mk 9)
◦ Both have the same unusual polar lipids

Congruence between these complex characters
supports a phylogenetic relationship between
Deinococcus and Thermus
An appropriate method can correct for GC bias
Jukes & Cantor Tree
Parsimony tree
Aquifex
Aquifex
Thermotoga
Thermus
Aquifex
Thermotoga
Thermotoga
Deinococcus
Bacillus
Log Det Tree
Deinococcus
Bacillus
Thermus
Deinococcus
Thermus
Bacillus
Orthology and paralogy


Are the sequences being generated from
different species the same (homologous)?
Gene duplication
◦ duplicate gene degenerates
◦ duplicate gene aquires new function

A problem particular accute currently as we
search for new genes
Orthology: gene trees and species trees
Gene phylogeny Organism phylogeny
a
A
b
B
c
C
ORTHOLOGY
Darwin’s theory reinterpreted homology as common ancestry.
Ancestral sequence
ATCGGCCACTTTCGCGATCA
ATCGGCCACTTTCGCGATCG
ATCGGCCACTTTCGTGATCG
ATCGGCCACGTTCGTGATCG
ATCGGCCACGTTCGCGATCG
ATAGGCCACTTTCGCGATCA
ATAGGCCACTTTCGCGATTA
ATAGGGCAGTTTCGCGATTA
ATAGGGCAGTTTTGCGATTA
ATCGGCCACCTTCGCGATCG
ATAGGGCAGTTTCGCGATTA
ATAGGGCAGTCTCGCGATTA
ACCGGCCACCTTCGCGATCG
Homologous sequences
ACCGGCCACCTTCGCGATCG
ATAGGGCAGTCTCGCGATTA
Orthologs arise by speciation
Speciation event
ATCGGCCACTTTCGCGATCA
ATAGGGCAGTCTCGCGATTA
ACCGGCCACCTTCGCGATCG
Modern species A
Sequence in ancestral
Organism
Orthologous sequences
Modern species B
Orthologs are “evolutionary counterparts” – Koonin (2001)
Paralogs arise by duplications
Duplication event
ATCGGCCACTTTCGCGATCA
ATAGGGCAGTCTCGCGATTA
ACCGGCCACCTTCGCGATCG
Modern duplicate A
Sequence in ancestral
Organism
Paralogous sequences
Modern duplicate B
An evolutionary tale…
Duplication of A in human
Duplication of A in worm
Sonnhammer & Koonin (2002) TIGs 18 619-220
Evolutionary Relationships
The yeast gene is orthologous to all worm and
human genes, which are all co-orthologous to the
yeast gene
Sonnhammer & Koonin (2002) TIGs 18 619-220
Evolutionary Relationships
all genes in the HA* set are co-orthologous to all
genes in the WA* set
Sonnhammer & Koonin (2002) TIGs 18 619-220
Evolutionary Relationships
The genes HA* are hence ‘inparalogs’ to each
other when comparing human to worm.
Sonnhammer & Koonin (2002) TIGs 18 619-220
Evolutionary Relationships
duplication speciation
By contrast, the genes HB and HA* are
‘outparalogs’ when comparing human with worm
Sonnhammer & Koonin (2002) TIGs 18 619-220
speciation
Evolutionary Relationships
duplication
HB and HA*, and WB and WA* are inparalogs
when comparing with yeast, because the animal–
yeast split pre-dates the HA*–HB duplication
Sonnhammer & Koonin (2002) TIGs 18 619-220
Paralogy can produce misleading trees
Gene phylogenies Organism phylogeny
a1*
A
b1
B
c1*
C
Misleading tree
a2
b2*
a1
A
c2
c1
C
b2
B
gene duplication
PARALOGY
Ancient gene duplications can be used to root the
tree of life
Ancestral Elongation Factor Gene
Gene Duplication Prior To
Split Into 3 Domains Of
Life
EF-Tu/
1-alpha
EF-2/
EF-G
EF-Tu/
1-alpha
+
= paralogues
of each other
EF-2/
EF-G
Sequences from one paralogue can be
used to root a tree formed using
sequences from the other and vice versa
Lineage sorting



Gene trees may not be the same as species
trees
Extant populations may retain ancestral
polymorphisms
Species level phylogenies should never
sample single individuals of different species



Implicit assumption in many studies using
mtDNA
The mode of speciation can now be studied
using DNA sequences
Theoretical studies predict that DNA lineages
pass through several phases in a species
The assumption: monophyly
A
B
Time
Ancestral gene pool
The assumption: monophyly
A
Time
B



Paraphyly can occur when one population in a
set of locally panmictic populations speciates
Polyphyly occurs when a highly polymorphic
population is subdivided
Can be highly informative of the history of
divergence
Paraphyly
A
B
Time
Ancestral gene pool
Paraphyly
A
Time
B
Polyphyly
A
Time
B
Polyphyly
A
Time
B
Polyphyly
100
100
99
100
77
100
80
91
88
An empirical example:
Phyciodes butterflies
Wahlberg et al. 2003. Syst Ent
28:257-273
vesta (67-9) Mexico
vesta (41-1) TX
vesta (41-2) TX
picta canace (44-11, 44-12) AZ
picta picta (34-7) CO
pallescens (64-2) Mexico
pallescens (64-1) Mexico
orseis orseis (67-3) CA1
100
orseis orseis (37-1) CA1
73
51
orseis orseis (67-4) CA1
orseis orseis (67-6) CA1
pallida pallida (34-6, 47-9, 47-10, 47-11) CO3
pallida barnesi (58-5, 58-6) BC1
mylitta arida (67-10) Mexico
86
mylitta mylitta (11-10, 11-11, 58-1, 58-2) BC1
71
mylitta arizonensis (32-1) AZ1, (47-1) NM
63
mylitta mylitta (32-3) NV
mylitta mylitta (32-6) MT
phaon phaon (25-17) FL
phaon jalapeno (35-11) Mexico
pulchella pulchella (47-6, 49-14, 50-6) CA3
100
pulchella pulchella (49-13) CA3
batesii apsaalooke (35-8) WY
cocyta selenis (47-12) CO1
cocyta cocyta (72-8) ONT
tharos orantain (52-9) AB4
95
52
tharos tharos (47-3) MN
tharos orantain (35-6) CO4
tharos orantain (47-2) CO7, (60-6, 60-7) AB6
56
tharos riocolorado (35-9) CO8
tharos tharos (25-18) FL
tharos tharos (34-2) MN
74
tharos tharos (53-8) MD
tharos tharos (44-3, 44-4) NY
tharos tharos (44-2) NY
tharos distincta (73-4) Mexico
tharos tharos (44-1) NY
78
tharos tharos (47-4) MN
62
tharos tharos (47-8) MN
tharos tharos (54-9) MD
cocyta cocyta (72-9) ONT
100
batesii batesii (73-9) MN
batesii batesii (72-1) ONT
cocyta selenis (47-13) CO1
cocyta selenis (48-3) CO1
cocyta selenis (58-8) BC1
95
cocyta selenis (11-5) BC1
100
99
pulchella owimba (24-10) MT
62
batesii maconensis (60-13, 60-15) NC
batesii maconensis (69-1, 69-2) NC
batesii lakota (52-7, 52-8) AB3
batesii anasazi (34-1) CO2
cocyta selenis (47-14, 48-6) CO1
74
cocyta diminutor (49-9) MN
cocyta selenis (55-2) AB7
68
cocyta selenis (58-7) BC1
cocyta selenis (60-12) BC2
batesii lakota (60-5) AB6
cocyta cocyta (72-10) ONT
52
cocyta selenis (11-4) BC1, (55-8) AB7
61
cocyta selenis (11-6) BC1
cocyta selenis (48-10) CO1
cocyta diminutor (49-8) MN
cocyta selenis (55-6) AB6
probably (52-2)
batesiiAB1
lakota
pulchella inornata (67-11) OR
pulchella montana (27-5) CA2
91
72
75
pulchella montana (67-15) OR
pulchella montana (67-16) OR
80
pulchella owimba (52-14, 55-7) AB5
89
pulchella owimba (54-1) AB5
68
pulchella tutchone (23-11) Alaska
pulchella owimba (56-1, 56-5, 56-7, 60-2) BC2
pulchella inornata (67-13) OR
100
pulchella inornata (67-14) OR
62
pulchella inornata (73-1) OR
99
pulchella inornata (73-2) OR
pulchella camillus (48-14) CO1
pulchella camillus (50-4) CO1
88
pulchella camillus (48-8, 49-12) CO1
batesii lakota (35-4) NE
pulchella camillus (49-3) CO6
pulchella camillus (50-3) CO1
pulchella camillus (49-5) CO6
pulchella camillus (49-4) CO6
pulchella camillus (48-4) CO5
pulchella camillus (49-1) NM
72
pulchella camillus (49-2) CO6
pulchella camillus (35-5, 48-2, 48-7, 48-9,
48-13) CO1, (50-2) NM
100
Paraphyly of a
species can be due to
incomplete lineage
sorting and/or
secondary gene flow
G = generations, starting with ten unrelated females at G = 0
Lateral gene transfer

Widely spread in single celled organisms
◦ Even between distantly related lineages

In multi-celled organisms more a problem in
closely related species
◦ hybridization

Is the Tree of Life really a Web of Life?



These ”problems” are highly interesting
phenomena in themselves!
When taking the different factors into
account, can be informative about
evolutionary history
”When in doubt, get more data”
- Brooks and McLennan 2002
Download