iPhy Martin Jones tools for collation and analysis of phylogenomic data

advertisement
Phyloinformatics Workshop Edinburgh 2007
iPhy
tools for collation and analysis of
phylogenomic data
Martin Jones
and Mark Blaxter
cercozoa
alv
zoaî
po
rid
fun ia
gi
ìchoa
no
mi
cr
os
root
bozoa
amoe
cilia
te
excavates
ts
opisthokonts
cor
e j
ako
bid
s
a
vahcrasid s
lime
lka
mold
mp
s
fiid
eu
gle amoe
nid
ba
s
s
es
om
os
an nia
a
yp
tr ishm
le
ch
s
ecid
so
bico
tes
oomyce
diatoms
b
rown
laby
opalin
algae
ids
mo
rint
huli
r
e
ds
cryptophyte
ch
s
la
c
alg
hapto
phyte
ae
s
s
ad
on
lids
om
pl
asa
di arab
ads
p
tamon
retor
oxymonads
o
s
ate ls
l
l
ge ma
i
fla
o
n
an
a
II
kon
amoeba
s
slime molds
s
s
me mold
li
s
l
mold s
ia
d
o
e
m
s
m
pla
nt
sli
bio
elid
t
o
s
l
o
t
pe
*pro
r
ma
o
gr
es
ero
lobose
dictyostelid
ine
up
lat
het
algae
eugly
phi
d
a
m
oeba
s
foraminiferans
onads
phyte
ds
glauco
es
hyt
ian
e
ga
pl
yt
a
e a nts e a
lga
lg
e
ae
chlorop
hyte a
lgae
hyt
iop
ph
chn
no
lar
dio
si
al
rap
nd
d
la
ra
ra
*p
re
cha
rar
pl
cercom
o
chl
s
t
an
s
mar
ine
gro
up I
din
ap ofl
a
ic
om gella
te
pl
s
ex
a
eo
discicristates
Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling
2: Organising principles
3: iPhy design
4: iPhy deployment
5: Nameless taxa & endless forms
Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling
Phylogenetics is a growth area.
The raw materials (sequences)
are being added at a startling rate.
Tree databases are also growing
(both in number and size).
so how does a lab worker bee keep up?
Metazoan Phyla: Sequences per phylum
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
Porifera
Placozoa
Buddenbrockia
Myxozoa
Mesozoa
Ctenophora
Cnidaria
Micrognathozoa
Cycliophora
Acoelomorpha
Gnathostomulida
Seisonidea
Rotifera
Gastrotricha
Sipuncula
Nemertea
Mollusca
Entoprocta
Bryozoa
Brachiopoda
Pogonophora
Echiura
Annelida
Platyhelminthes
Nematomorpha
Nematoda
Kinorhyncha
Acanthocephala
Priapulida
Tardigrada
Onychophora
Arthropoda
Xenoturbellida
Enteropneusta
Hemichordata
Echinodermata
Chordata
Chaetognatha
(10/05/2006)
Metazoan Phyla: Species per phylum
10000000
1000000
100000
10000
1000
100
10
1
Porifera
Placozoa
Buddenbrockia
Myxozoa
Mesozoa
Ctenophora
Cnidaria
Micrognathozoa
Cycliophora
Acoelomorpha
Gnathostomulida
Seisonidea
Rotifera
Gastrotricha
Sipuncula
Nemertea
Mollusca
Entoprocta
Bryozoa
Brachiopoda
Pogonophora
Echiura
Annelida
Platyhelminthes
Nematomorpha
Nematoda
Kinorhyncha
Acanthocephala
Priapulida
Tardigrada
Onychophora
Arthropoda
Xenoturbellida
Enteropneusta
Hemichordata
Echinodermata
Chordata
Chaetognatha
(10/05/2006)
Metazoan Phyla: Sequences per species
1000
100
10
1
0.1
Porifera
Placozoa
Buddenbrockia
Myxozoa
Mesozoa
Ctenophora
Cnidaria
Micrognathozoa
Cycliophora
Acoelomorpha
Gnathostomulida
Seisonidea
Rotifera
Gastrotricha
Sipuncula
Nemertea
Mollusca
Entoprocta
Bryozoa
Brachiopoda
Pogonophora
Echiura
Annelida
Platyhelminthes
Nematomorpha
Nematoda
Kinorhyncha
Acanthocephala
Priapulida
Tardigrada
Onychophora
Arthropoda
Xenoturbellida
Enteropneusta
Hemichordata
Echinodermata
Chordata
Chaetognatha
(10/05/2006)
Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling
Phylogenetics is a growth area.
The raw materials (sequences)
are being added at a startling rate.
Tree databases are also growing
(both in number and size).
so how does a lab worker bee keep up?
from Rod Page “Towards a Taxonomically Intelligent Phylogenetic Database”
7000
6000
Cumulative number
Molecular phylogenies
TreeBASE studies
5000
4000
3000
2000
1000
0
1980
1985
1990
Year
1995
2000
Phyloinformatics Workshop Edinburgh 2007
Two modes of data acquisition
(a) wet lab - compute lab synergy
explicitly source the sequences needed
preformed ideas of
the best taxa to sample
the best genes to sample
[this is the source of most phylogenetic data]
Phyloinformatics Workshop Edinburgh 2007
Two modes of data acquisition
(a) wet lab - compute lab synergy
(b) magpie surfing / tree surgery
using phyloinformatic tools
to discover the set of available
genes AND taxa
to address a particular problem
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles
On average …
• more data are better
more taxa
more genes
• multiple methods are better
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles
• assess all relevant taxa
• assess all relevant sequence
while the NCBI taxonomy
isn’t the best in the world,
at least every sequence
is attached to a taxon,
and TAX_IDs are unique
The Edinburgh EST analysis Pipeline
(trace2dbest)
Process raw sequence traces
Trim off vector & low quality
(CLOBB)
Cluster into putative gene objects
Predict consensus sequence
(prot4EST)
Predict translation reading frame
Generate protein translation
(annot8r)
Annotate using BLAST GOtcha
PSort Pfam SigPep KEGG
(PartiGene)
Collate information in relational
database
NEMBASE3 http://www.nematodes.org/
The web portal to NEMBASE3
Mark Blaxter, James Wasmuth,
Ann Hedley & Ralf Schmid
University of Edinburgh,
Institute of Evolutionary Biology,
Edinburgh UK EH9 3JT
mark.blaxter@ed.ac.uk
NEMBASE3 http://www.nematodes.org/
Collectors’ curve of nematode protein families
Trichinella spiralis
Number of families
50000
Brugia malayi
Meloidogyne incognita
40000
A
Strongyloides
stercoralis
Ancylostoma
caninum
30000
20000
Caenorhabditis
10000 elegans
0
B
C
0
25000
50000
75000
100000
125000
Total number of proteins
150000
NEMBASE3 http://www.nematodes.org/
Earliest origins of nematode protein families
V
Rhabditina (Clade V)
Strongyloidea
949
(6120)
Rhabditoidea
12302
(3674)
Diplogasteromorpha
0
(1356)
Panagrolaimomorpha
435
(2678)
1108
4162
IV
Tylenchina (Clade IV)
132
NEMATODA
Rhabditida
Cephalobomorpha
7501
2811
Tylenchomorpha
III
Spirurina (Clade III)
I
Dorylaimia (Clade I)
3893
(11213)
Ascaridomorpha
293
(3695)
Spiruromorpha
824
(5188)
Dorylaimida
0
(1610)
Trichinellida
128
(2571)
152
30
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles
• assess all relevant taxa
• assess all relevant sequence
• store aligned sequences locally
• output ‘slices’ of data in analysis-ready formats
many taxa, missing data
gene->
/taxon
1
2
3
4
5
6
7
8
9
a b c d e f g h i
Generating a slice that
• maximises taxonomic coverage
• maximises present data/minimises missing data
gene->
/taxon
1
3
7
9
a b e f g i
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles
• assess all relevant taxa
• assess all relevant sequence
• store aligned sequences locally
• output ‘slices’ of data in analysis-ready formats
• store trees locally
• store alternative taxonomic systems
Complete
genome
sequences
Platyhelminthes
Annelida
L
(Philippe et al.)
Mollusca
Tardigrada
P
C
Nematoda
Arthropoda
E
Vertebrata
Urochordata
Cephalochordata
Echinodermata
Ctenophora
Cnidaria
Choanoflagellata
Fungi
Including
neglected
taxa ESTs
D
Phyloinformatics Workshop Edinburgh 2007
3: iPhy design
sequence
AGGCT
PheTyr
alignment
AGGCT
ACGGT
CCGGA
Processing to
* identify relevant sequences
and store locally
* associate sequences
and taxa
TreeFam
AGGCT
ACGGT
CCGGA
TreeBASE
user tree
systematic
AGGCT
ACGGT
CCGGA
Processing to
* identify relevant sequences
and store locally
* capture tree data
* reconcile tree nodes
with existing systems
Processing to
* capture tree data
* reconcile tree nodes
with existing systems
sequence
alignment
AGGCT
PheTyr
AGGCT
ACGGT
CCGGA
Processing to
* identify relevant sequences
and store locally
* associate sequences
and taxa
POA
tranAlign
AGGCT
ACGGT
CCGGA
Alignment Cycle
TreeFam
AGGCT
ACGGT
CCGGA
TreeBASE
user tree
systematic
AGGCT
ACGGT
CCGGA
Processing to
* identify relevant sequences
and store locally
* capture tree data
* reconcile tree nodes
with existing systems
Processing to
* capture tree data
* reconcile tree nodes
with existing systems
iPhy database
AGGCT
ACGGT
CCGGA
AGGCT
PheTyr
AGGCT
ACGGT
CCGGA
sequence
alignment
AGGCT
PheTyr
AGGCT
ACGGT
CCGGA
Processing to
* identify relevant sequences
and store locally
* associate sequences
and taxa
POA
tranAlign
AGGCT
ACGGT
CCGGA
Alignment Cycle
TreeFam
AGGCT
ACGGT
CCGGA
TreeBASE
user tree
systematic
AGGCT
ACGGT
CCGGA
Processing to
* identify relevant sequences
and store locally
* capture tree data
* reconcile tree nodes
with existing systems
Processing to
* capture tree data
* reconcile tree nodes
with existing systems
iPhy database
AGGCT
ACGGT
CCGGA
AGGCT
PheTyr
TreeFam
Ortho-MCL
AGGCT
ACGGT
CCGGA
Orthologue
Inference
Engine
AGGCT
ACGGT
CCGGA
POA
tranAlign
AGGCT
ACGGT
CCGGA
iPhy database
Alignment Cycle
AGGCT
ACGGT
CCGGA
AGGCT
PheTyr
TreeFam
Ortho-MCL
AGGCT
ACGGT
CCGGA
Orthologue
Inference
Engine
AGGCT
ACGGT
CCGGA
Dataset Exploration Tools
maximal
bicliques
AGGCT
ACGGT
CCGGA
}
Slice
Selecter
AGGCT
ACGGT
CCGGA
Phylogenetics Cycle
Tree
Comparer
PhyML
MrBayes
PAUP
...
POA
tranAlign
AGGCT
ACGGT
CCGGA
iPhy database
Alignment Cycle
AGGCT
ACGGT
CCGGA
AGGCT
PheTyr
TreeFam
Ortho-MCL
AGGCT
ACGGT
CCGGA
Orthologue
Inference
Engine
AGGCT
ACGGT
CCGGA
Dataset Exploration Tools
maximal
bicliques
AGGCT
ACGGT
CCGGA
}
Slice
Selecter
AGGCT
ACGGT
CCGGA
Phylogenetics Cycle
Tree
Comparer
PhyML
MrBayes
PAUP
...
trees &
alignments
Publication
Quality
Analyses
AGGCT
ACGGT
CCGGA
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
BMC Bioinformatics
Bio Med Central
Software
Open Access
TaxMan: a taxonomic database manager
Martin Jones* and Mark Blaxter
Address: Institute of Evolutionary Biology, King's Buildings,
Ashworth Laboratories, West Ma ins Road, Edinburgh EH9 3JT, UK
Email: Martin Jones* - marti n.jones@ed.ac.uk; Mark Blax ter - mark.blaxter@ed.ac.uk
* Corresponding author
Published: 18 December 2006
BMC Bioinformatics 2006, 7:536
doi:10.1186/1471-2105-7-536
This article is available from: http://www.biomedcentral.com/1471-2105/7/536
© 2006 Jones and Blaxter; licensee BioMed Central Ltd.
Received: 11 October 2006
Accepted: 18 December 2006
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan automates assembly of large
sequence datasets for chosen taxa
TaxMan automates generation of aligned
sequences sets for chosen genes
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan simplifies selection of taxa for
analysis
e.g. given a gene set, choosing one species per family
(choosing the species with the least missing data)
e.g. given a taxon set, choosing the genes
(choosing genes with less than a given % missing data)
e.g. generating custom defined alignments
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan simplifies analysis by exporting
formatted alignments (NEXUS)
of nucleotides
(with codon positions and genes as defined partitions)
of amino acids
(with genes as defined partitions)
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment
version 0.1: ‘TaxMan’
TaxMan simplifies post-phylogenetic analysis
by
saving trees
(with links to the original data)
saving analytical metadata
(algorithm, parameters, settings)
saving tree statistics
(bootstraps, branch lengths)
Lophotrochozoa
●
70,000 annotated sequences
●
630,000 EST sequences
●
21 genes (mt + 18S 28S actin H3 WG EF1A)
●
53,000 sequences extracted
●
17,000 aligned consensus sequences
●
8,700 species represented
●
One day for data collection, one for alignment
Molecular Phylogenetics and Evolution 43 (2007) 583–595
www.elsevier.com/locate/ympev
The effect of model choice on phylogenetic inference using
mitochondrial sequence data: Lessons from the scorpions
Martin Jones
a
a,¤
, Benjamin Gantenbein b, Victor Fet c, Mark Blaxter
Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK
b
AO Research Institute, Clavadelerstrasse 8, Davos Platz CH-7270, Switzerland
c
Department of Biological Sciences, Marshall University, Huntington, WV 25755-2510, USA
Received 25 April 2006; revised 14 November 2006; accepted 14 November 2006
Available online 29 November 2006
a
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
"... endless forms
most beautiful
and most wonderful
have been,
and are being,
evolved"
(Darwin 1859)
http://www.nematodes.org/NeglectedGenomes/
ARTHROPODA/Chelicerata.html
100000000
10000000
100000
10000
1000
100
10
1
Choanoflagellida
Porifera
Placozoa
Cnidaria
Ctenophora
Acoela
Mesozoa
Myxozoa
Nematoda
Nematomorpha
Loricifera
Kinorhyncha
Priapulida
Onychophora
Arthropoda
Tardigrada
Gastrotricha
Nemertea
Myzostomida
Gnathostomulida
Cycliophora
Platyhelminthes
Acanthocephala
Rotifera
Chaetognatha
Sipunculida
Bryozoa
Brachiopoda
Entoprocta
Annelida
Pogonophora
Echiura
Mollusca
Hemichordata
Echinodermata
Chordata
1000000
Metazoan species per phylum
organism-size curve
number of individuals (log scale)
squillions
Eukaryotes
POSSIBLE
PREDATORS
lots
FOOD
ITEMS
few
miniscule
tiny
just visible
small
size of organism (log scale)
big
Sourhope farm
NERC "Soil Biodiversity
and Ecosystem Function"
Programme Study Site
120 m x 75 m
of raw Scottish upland grass
13 000 000 000 nematodes
MAN IS BVT A WORM
1034ED Fyne1
1022ED Fyne1
1010ED Fyne1
1020ED Fyne1
1005ED Fyne1
1007ED Fyne
1140ED Orkney
1139ED Orkney
1031ED Fyne1
5 changes
1075ED Gullane
1109ED Fyne2
1128ED Fyne2
1024ED Fyne1
1178ED Orkney
1165ED Orkney
1156ED Orkney
1141ED Orkney
1164ED Orkney
1066ED Gullane
Orkney
Loch Fyne
Marine
Nematode
Barcodes
1043ED Gullane
1118ED Fyne2
1011ED Fyne1
1093ED Fyne2
1085ED Gullane
1046ED Gullane
1041ED Gullane
1060ED Gullane
1
1028ED Fyne1
1119ED Fyne2
1122ED Fyne2
Gullane
1142ED Orkney
1145ED Orkney
1170ED Orkney
1174ED Orkney
1162ED Orkney
1169ED Orkney
1173ED Orkney
1179ED Orkney
1168ED Orkney
1176ED Orkney
1167ED Orkney
1175ED Orkney
1147ED Orkney
1008ED Fyne1
1009ED Fyne1
1144ED Orkney
1146ED Orkney
1083ED Gullane
1073ED Gullane
1051ED Gullane
1019ED Fyne1
1124ED Fyne2
1097ED Fyne2
1150ED Orkney
1136ED Orkney
1152ED Orkney
1171ED Orkney
1154ED Orkney
1151ED Orkney
1029ED Fyne1
1012ED Fyne1
1138ED Orkney
1013ED Fyne1
1032ED Fyne1
1092ED Fyne2
1036ED Fyne1
1037ED Fyne1
Gullane
1094ED Fyne2
1044ED Gullane
1071ED Gullane
1064ED Gullane
1053ED Gullane
1070ED Gullane
1038ED Gullane
1052ED Gullane
1123ED Fyne2
1035ED Fyne1
1107ED Fyne2
1108ED Fyne2
Loch Fyne
2
1047ED Gullane
1099ED Fyne2
1058ED Gullane
1042ED Gullane
1088ED Fyne2
1086ED Fyne2
1039ED Gullane
1069ED Gullane
1061ED Gullane
1074ED Gullane
1096ED Fyne2
1105ED Fyne2
1133ED Fyne2
1077ED Gullane
1014ED Fyne1
1068ED Gullane
1076ED Gullane
1080ED Gullane
1072ED Gullane
1054ED Gullane
1062ED Gullane
1048ED Gullane
1057ED Gullane
1040ED Gullane
1059ED Gullane
1120ED Fyne2
1017ED Fyne1
1004ED Fyne1
1018ED Fyne1
1177ED Orkney
1025ED Fyne1
1023ED Fyne1
1016ED Fyne1
1027ED Fyne1
1015ED Fyne1
1002ED Fyne1
1001ED Fyne1
1021ED Fyne1
1003ED Fyne1
1006ED Fyne1
1000ED Fyne1
1155ED Orkney
1121ED Fyne2
1103ED Fyne2
1110ED Fyne2
1114ED Fyne2
1125ED Fyne2
1131ED Fyne2
1101ED Fyne2
1102ED Fyne2
1112ED Fyne2
1116ED Fyne2
1106ED Fyne2
1104ED Fyne2
1132ED Fyne2
10
10
4
11
2
51
12
Orkney
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
MOTU
Molecular Operational
Taxonomic Units
motu
1. to cut; to snap off
motu-á te hau, the fishing line snapped off
2. to engrave, to inscribe
letters or pictures in stone or in wood, like the motu mo rogorogo, inscriptions for recitation in lines called kohau.
3. islet
some names of islets: Motu Motiro Hiva, Motu Nui, Motu Iti, Motu Kaokao,
Motu Tapu, Motu Marotiri, Motu Kau, Motu Tavake, Motu Tautara, Motu Ko
Hepa Ko Maihori, Motu Hava.
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
MOTU
specimen-based surveys
CBoL Barcode of Life (CO1)
anonymous, specimen-free surveys
environmental sampling
bulk community DNA
millions of sequences
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
~1.2 million described species
~10-100 million species in reality
Thus, most ‘species’
will never be formally named.
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
How do we incorporate these myriad
‘nameless taxa’ into our systems?
Phyloinformatics Workshop Edinburgh 2007
Martin Jones
TaxMan, iPhy & chelicerate evolution
Robin Floyd &
Jenna Mann
MOTU and barcoding
Ralf Schmid,
James Wasmuth
& Ann Hedley
PartiGene & EST analysis
Download