Phyloinformatics Workshop Edinburgh 2007 iPhy tools for collation and analysis of phylogenomic data Martin Jones and Mark Blaxter cercozoa alv zoaî po rid fun ia gi ìchoa no mi cr os root bozoa amoe cilia te excavates ts opisthokonts cor e j ako bid s a vahcrasid s lime lka mold mp s fiid eu gle amoe nid ba s s es om os an nia a yp tr ishm le ch s ecid so bico tes oomyce diatoms b rown laby opalin algae ids mo rint huli r e ds cryptophyte ch s la c alg hapto phyte ae s s ad on lids om pl asa di arab ads p tamon retor oxymonads o s ate ls l l ge ma i fla o n an a II kon amoeba s slime molds s s me mold li s l mold s ia d o e m s m pla nt sli bio elid t o s l o t pe *pro r ma o gr es ero lobose dictyostelid ine up lat het algae eugly phi d a m oeba s foraminiferans onads phyte ds glauco es hyt ian e ga pl yt a e a nts e a lga lg e ae chlorop hyte a lgae hyt iop ph chn no lar dio si al rap nd d la ra ra *p re cha rar pl cercom o chl s t an s mar ine gro up I din ap ofl a ic om gella te pl s ex a eo discicristates Phyloinformatics Workshop Edinburgh 2007 1: Forests of trees, and loads of kindling 2: Organising principles 3: iPhy design 4: iPhy deployment 5: Nameless taxa & endless forms Phyloinformatics Workshop Edinburgh 2007 1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences) are being added at a startling rate. Tree databases are also growing (both in number and size). so how does a lab worker bee keep up? Metazoan Phyla: Sequences per phylum 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Porifera Placozoa Buddenbrockia Myxozoa Mesozoa Ctenophora Cnidaria Micrognathozoa Cycliophora Acoelomorpha Gnathostomulida Seisonidea Rotifera Gastrotricha Sipuncula Nemertea Mollusca Entoprocta Bryozoa Brachiopoda Pogonophora Echiura Annelida Platyhelminthes Nematomorpha Nematoda Kinorhyncha Acanthocephala Priapulida Tardigrada Onychophora Arthropoda Xenoturbellida Enteropneusta Hemichordata Echinodermata Chordata Chaetognatha (10/05/2006) Metazoan Phyla: Species per phylum 10000000 1000000 100000 10000 1000 100 10 1 Porifera Placozoa Buddenbrockia Myxozoa Mesozoa Ctenophora Cnidaria Micrognathozoa Cycliophora Acoelomorpha Gnathostomulida Seisonidea Rotifera Gastrotricha Sipuncula Nemertea Mollusca Entoprocta Bryozoa Brachiopoda Pogonophora Echiura Annelida Platyhelminthes Nematomorpha Nematoda Kinorhyncha Acanthocephala Priapulida Tardigrada Onychophora Arthropoda Xenoturbellida Enteropneusta Hemichordata Echinodermata Chordata Chaetognatha (10/05/2006) Metazoan Phyla: Sequences per species 1000 100 10 1 0.1 Porifera Placozoa Buddenbrockia Myxozoa Mesozoa Ctenophora Cnidaria Micrognathozoa Cycliophora Acoelomorpha Gnathostomulida Seisonidea Rotifera Gastrotricha Sipuncula Nemertea Mollusca Entoprocta Bryozoa Brachiopoda Pogonophora Echiura Annelida Platyhelminthes Nematomorpha Nematoda Kinorhyncha Acanthocephala Priapulida Tardigrada Onychophora Arthropoda Xenoturbellida Enteropneusta Hemichordata Echinodermata Chordata Chaetognatha (10/05/2006) Phyloinformatics Workshop Edinburgh 2007 1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences) are being added at a startling rate. Tree databases are also growing (both in number and size). so how does a lab worker bee keep up? from Rod Page “Towards a Taxonomically Intelligent Phylogenetic Database” 7000 6000 Cumulative number Molecular phylogenies TreeBASE studies 5000 4000 3000 2000 1000 0 1980 1985 1990 Year 1995 2000 Phyloinformatics Workshop Edinburgh 2007 Two modes of data acquisition (a) wet lab - compute lab synergy explicitly source the sequences needed preformed ideas of the best taxa to sample the best genes to sample [this is the source of most phylogenetic data] Phyloinformatics Workshop Edinburgh 2007 Two modes of data acquisition (a) wet lab - compute lab synergy (b) magpie surfing / tree surgery using phyloinformatic tools to discover the set of available genes AND taxa to address a particular problem Phyloinformatics Workshop Edinburgh 2007 2: Organising principles On average … • more data are better more taxa more genes • multiple methods are better Phyloinformatics Workshop Edinburgh 2007 2: Organising principles • assess all relevant taxa • assess all relevant sequence while the NCBI taxonomy isn’t the best in the world, at least every sequence is attached to a taxon, and TAX_IDs are unique The Edinburgh EST analysis Pipeline (trace2dbest) Process raw sequence traces Trim off vector & low quality (CLOBB) Cluster into putative gene objects Predict consensus sequence (prot4EST) Predict translation reading frame Generate protein translation (annot8r) Annotate using BLAST GOtcha PSort Pfam SigPep KEGG (PartiGene) Collate information in relational database NEMBASE3 http://www.nematodes.org/ The web portal to NEMBASE3 Mark Blaxter, James Wasmuth, Ann Hedley & Ralf Schmid University of Edinburgh, Institute of Evolutionary Biology, Edinburgh UK EH9 3JT mark.blaxter@ed.ac.uk NEMBASE3 http://www.nematodes.org/ Collectors’ curve of nematode protein families Trichinella spiralis Number of families 50000 Brugia malayi Meloidogyne incognita 40000 A Strongyloides stercoralis Ancylostoma caninum 30000 20000 Caenorhabditis 10000 elegans 0 B C 0 25000 50000 75000 100000 125000 Total number of proteins 150000 NEMBASE3 http://www.nematodes.org/ Earliest origins of nematode protein families V Rhabditina (Clade V) Strongyloidea 949 (6120) Rhabditoidea 12302 (3674) Diplogasteromorpha 0 (1356) Panagrolaimomorpha 435 (2678) 1108 4162 IV Tylenchina (Clade IV) 132 NEMATODA Rhabditida Cephalobomorpha 7501 2811 Tylenchomorpha III Spirurina (Clade III) I Dorylaimia (Clade I) 3893 (11213) Ascaridomorpha 293 (3695) Spiruromorpha 824 (5188) Dorylaimida 0 (1610) Trichinellida 128 (2571) 152 30 Phyloinformatics Workshop Edinburgh 2007 2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats many taxa, missing data gene-> /taxon 1 2 3 4 5 6 7 8 9 a b c d e f g h i Generating a slice that • maximises taxonomic coverage • maximises present data/minimises missing data gene-> /taxon 1 3 7 9 a b e f g i Phyloinformatics Workshop Edinburgh 2007 2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats • store trees locally • store alternative taxonomic systems Complete genome sequences Platyhelminthes Annelida L (Philippe et al.) Mollusca Tardigrada P C Nematoda Arthropoda E Vertebrata Urochordata Cephalochordata Echinodermata Ctenophora Cnidaria Choanoflagellata Fungi Including neglected taxa ESTs D Phyloinformatics Workshop Edinburgh 2007 3: iPhy design sequence AGGCT PheTyr alignment AGGCT ACGGT CCGGA Processing to * identify relevant sequences and store locally * associate sequences and taxa TreeFam AGGCT ACGGT CCGGA TreeBASE user tree systematic AGGCT ACGGT CCGGA Processing to * identify relevant sequences and store locally * capture tree data * reconcile tree nodes with existing systems Processing to * capture tree data * reconcile tree nodes with existing systems sequence alignment AGGCT PheTyr AGGCT ACGGT CCGGA Processing to * identify relevant sequences and store locally * associate sequences and taxa POA tranAlign AGGCT ACGGT CCGGA Alignment Cycle TreeFam AGGCT ACGGT CCGGA TreeBASE user tree systematic AGGCT ACGGT CCGGA Processing to * identify relevant sequences and store locally * capture tree data * reconcile tree nodes with existing systems Processing to * capture tree data * reconcile tree nodes with existing systems iPhy database AGGCT ACGGT CCGGA AGGCT PheTyr AGGCT ACGGT CCGGA sequence alignment AGGCT PheTyr AGGCT ACGGT CCGGA Processing to * identify relevant sequences and store locally * associate sequences and taxa POA tranAlign AGGCT ACGGT CCGGA Alignment Cycle TreeFam AGGCT ACGGT CCGGA TreeBASE user tree systematic AGGCT ACGGT CCGGA Processing to * identify relevant sequences and store locally * capture tree data * reconcile tree nodes with existing systems Processing to * capture tree data * reconcile tree nodes with existing systems iPhy database AGGCT ACGGT CCGGA AGGCT PheTyr TreeFam Ortho-MCL AGGCT ACGGT CCGGA Orthologue Inference Engine AGGCT ACGGT CCGGA POA tranAlign AGGCT ACGGT CCGGA iPhy database Alignment Cycle AGGCT ACGGT CCGGA AGGCT PheTyr TreeFam Ortho-MCL AGGCT ACGGT CCGGA Orthologue Inference Engine AGGCT ACGGT CCGGA Dataset Exploration Tools maximal bicliques AGGCT ACGGT CCGGA } Slice Selecter AGGCT ACGGT CCGGA Phylogenetics Cycle Tree Comparer PhyML MrBayes PAUP ... POA tranAlign AGGCT ACGGT CCGGA iPhy database Alignment Cycle AGGCT ACGGT CCGGA AGGCT PheTyr TreeFam Ortho-MCL AGGCT ACGGT CCGGA Orthologue Inference Engine AGGCT ACGGT CCGGA Dataset Exploration Tools maximal bicliques AGGCT ACGGT CCGGA } Slice Selecter AGGCT ACGGT CCGGA Phylogenetics Cycle Tree Comparer PhyML MrBayes PAUP ... trees & alignments Publication Quality Analyses AGGCT ACGGT CCGGA Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ BMC Bioinformatics Bio Med Central Software Open Access TaxMan: a taxonomic database manager Martin Jones* and Mark Blaxter Address: Institute of Evolutionary Biology, King's Buildings, Ashworth Laboratories, West Ma ins Road, Edinburgh EH9 3JT, UK Email: Martin Jones* - marti n.jones@ed.ac.uk; Mark Blax ter - mark.blaxter@ed.ac.uk * Corresponding author Published: 18 December 2006 BMC Bioinformatics 2006, 7:536 doi:10.1186/1471-2105-7-536 This article is available from: http://www.biomedcentral.com/1471-2105/7/536 © 2006 Jones and Blaxter; licensee BioMed Central Ltd. Received: 11 October 2006 Accepted: 18 December 2006 Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan automates assembly of large sequence datasets for chosen taxa TaxMan automates generation of aligned sequences sets for chosen genes Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies selection of taxa for analysis e.g. given a gene set, choosing one species per family (choosing the species with the least missing data) e.g. given a taxon set, choosing the genes (choosing genes with less than a given % missing data) e.g. generating custom defined alignments Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies analysis by exporting formatted alignments (NEXUS) of nucleotides (with codon positions and genes as defined partitions) of amino acids (with genes as defined partitions) Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies post-phylogenetic analysis by saving trees (with links to the original data) saving analytical metadata (algorithm, parameters, settings) saving tree statistics (bootstraps, branch lengths) Lophotrochozoa ● 70,000 annotated sequences ● 630,000 EST sequences ● 21 genes (mt + 18S 28S actin H3 WG EF1A) ● 53,000 sequences extracted ● 17,000 aligned consensus sequences ● 8,700 species represented ● One day for data collection, one for alignment Molecular Phylogenetics and Evolution 43 (2007) 583–595 www.elsevier.com/locate/ympev The effect of model choice on phylogenetic inference using mitochondrial sequence data: Lessons from the scorpions Martin Jones a a,¤ , Benjamin Gantenbein b, Victor Fet c, Mark Blaxter Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK b AO Research Institute, Clavadelerstrasse 8, Davos Platz CH-7270, Switzerland c Department of Biological Sciences, Marshall University, Huntington, WV 25755-2510, USA Received 25 April 2006; revised 14 November 2006; accepted 14 November 2006 Available online 29 November 2006 a Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms "... endless forms most beautiful and most wonderful have been, and are being, evolved" (Darwin 1859) http://www.nematodes.org/NeglectedGenomes/ ARTHROPODA/Chelicerata.html 100000000 10000000 100000 10000 1000 100 10 1 Choanoflagellida Porifera Placozoa Cnidaria Ctenophora Acoela Mesozoa Myxozoa Nematoda Nematomorpha Loricifera Kinorhyncha Priapulida Onychophora Arthropoda Tardigrada Gastrotricha Nemertea Myzostomida Gnathostomulida Cycliophora Platyhelminthes Acanthocephala Rotifera Chaetognatha Sipunculida Bryozoa Brachiopoda Entoprocta Annelida Pogonophora Echiura Mollusca Hemichordata Echinodermata Chordata 1000000 Metazoan species per phylum organism-size curve number of individuals (log scale) squillions Eukaryotes POSSIBLE PREDATORS lots FOOD ITEMS few miniscule tiny just visible small size of organism (log scale) big Sourhope farm NERC "Soil Biodiversity and Ecosystem Function" Programme Study Site 120 m x 75 m of raw Scottish upland grass 13 000 000 000 nematodes MAN IS BVT A WORM 1034ED Fyne1 1022ED Fyne1 1010ED Fyne1 1020ED Fyne1 1005ED Fyne1 1007ED Fyne 1140ED Orkney 1139ED Orkney 1031ED Fyne1 5 changes 1075ED Gullane 1109ED Fyne2 1128ED Fyne2 1024ED Fyne1 1178ED Orkney 1165ED Orkney 1156ED Orkney 1141ED Orkney 1164ED Orkney 1066ED Gullane Orkney Loch Fyne Marine Nematode Barcodes 1043ED Gullane 1118ED Fyne2 1011ED Fyne1 1093ED Fyne2 1085ED Gullane 1046ED Gullane 1041ED Gullane 1060ED Gullane 1 1028ED Fyne1 1119ED Fyne2 1122ED Fyne2 Gullane 1142ED Orkney 1145ED Orkney 1170ED Orkney 1174ED Orkney 1162ED Orkney 1169ED Orkney 1173ED Orkney 1179ED Orkney 1168ED Orkney 1176ED Orkney 1167ED Orkney 1175ED Orkney 1147ED Orkney 1008ED Fyne1 1009ED Fyne1 1144ED Orkney 1146ED Orkney 1083ED Gullane 1073ED Gullane 1051ED Gullane 1019ED Fyne1 1124ED Fyne2 1097ED Fyne2 1150ED Orkney 1136ED Orkney 1152ED Orkney 1171ED Orkney 1154ED Orkney 1151ED Orkney 1029ED Fyne1 1012ED Fyne1 1138ED Orkney 1013ED Fyne1 1032ED Fyne1 1092ED Fyne2 1036ED Fyne1 1037ED Fyne1 Gullane 1094ED Fyne2 1044ED Gullane 1071ED Gullane 1064ED Gullane 1053ED Gullane 1070ED Gullane 1038ED Gullane 1052ED Gullane 1123ED Fyne2 1035ED Fyne1 1107ED Fyne2 1108ED Fyne2 Loch Fyne 2 1047ED Gullane 1099ED Fyne2 1058ED Gullane 1042ED Gullane 1088ED Fyne2 1086ED Fyne2 1039ED Gullane 1069ED Gullane 1061ED Gullane 1074ED Gullane 1096ED Fyne2 1105ED Fyne2 1133ED Fyne2 1077ED Gullane 1014ED Fyne1 1068ED Gullane 1076ED Gullane 1080ED Gullane 1072ED Gullane 1054ED Gullane 1062ED Gullane 1048ED Gullane 1057ED Gullane 1040ED Gullane 1059ED Gullane 1120ED Fyne2 1017ED Fyne1 1004ED Fyne1 1018ED Fyne1 1177ED Orkney 1025ED Fyne1 1023ED Fyne1 1016ED Fyne1 1027ED Fyne1 1015ED Fyne1 1002ED Fyne1 1001ED Fyne1 1021ED Fyne1 1003ED Fyne1 1006ED Fyne1 1000ED Fyne1 1155ED Orkney 1121ED Fyne2 1103ED Fyne2 1110ED Fyne2 1114ED Fyne2 1125ED Fyne2 1131ED Fyne2 1101ED Fyne2 1102ED Fyne2 1112ED Fyne2 1116ED Fyne2 1106ED Fyne2 1104ED Fyne2 1132ED Fyne2 10 10 4 11 2 51 12 Orkney Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms MOTU Molecular Operational Taxonomic Units motu 1. to cut; to snap off motu-á te hau, the fishing line snapped off 2. to engrave, to inscribe letters or pictures in stone or in wood, like the motu mo rogorogo, inscriptions for recitation in lines called kohau. 3. islet some names of islets: Motu Motiro Hiva, Motu Nui, Motu Iti, Motu Kaokao, Motu Tapu, Motu Marotiri, Motu Kau, Motu Tavake, Motu Tautara, Motu Ko Hepa Ko Maihori, Motu Hava. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms MOTU specimen-based surveys CBoL Barcode of Life (CO1) anonymous, specimen-free surveys environmental sampling bulk community DNA millions of sequences Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms ~1.2 million described species ~10-100 million species in reality Thus, most ‘species’ will never be formally named. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms How do we incorporate these myriad ‘nameless taxa’ into our systems? Phyloinformatics Workshop Edinburgh 2007 Martin Jones TaxMan, iPhy & chelicerate evolution Robin Floyd & Jenna Mann MOTU and barcoding Ralf Schmid, James Wasmuth & Ann Hedley PartiGene & EST analysis