Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK The Species 2000 vision • To enumerate all known species of plants, animals, fungi and microbes on Earth as the baseline dataset for studies of global biodiversity • To provide a simple access point enabling users to link from Species 2000 to other data systems for all groups of organisms, using direct species-links • To enable users worldwide to verify the scientific name, status and classification of any known species through species checklist data drawn from an array of participating databases • (More recently) to provide a “synonymy server” for use as a service by other applications needing to obtain suitable scientific names, e.g. for querying 2 biological data sets Need for a catalogue • Suppose we wished to retrieve all locations where specimens of Caragana arborescens have been collected, from various specimen distribution databases. • A taxonomic checklist might include: Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym] • Classification of organisms is based on opinion regarding – what the groups are – identification of individuals • So we need to use both these names as search terms • In practice the problem might be far worse 3 SPICE for Species 2000: Meeting the Computing challenges • The SPICE for Species 2000 project aimed to: – build a federated ‘registry’ of scientific names organised by taxon (species, etc.) – accommodate GSD (Global Species Database) heterogeneity – accommodate GSD autonomy & instability – ensure scalability • Funding: – SPICE was funded by the UK BBSRC/EPSRC Bioinformatics panel – EuroCat – new EU-funded project to augment SPICE catalogue of life & develop/maintain SPICE software 4 SPICE Project Staff Cardiff – Prof. Alex Gray, Dr. Andrew Jones, Prof. Nick. Fiddian, Dr. Xuebiao Xu, (Mr. Nick Pittas). Object and Knowledge-based Systems Group, Department of Computer Science, Cardiff University, PO Box 916, Cardiff CF24 3XF Email: {W.A.Gray|Andrew.C.Jones|N.Fiddian|X.Xu|N.Pittas}@cs.cf.ac.uk Telephone +44 (0)29 2087 4812 Reading – Prof. Frank Bisby, Prof. Sir Ghillean Prance and Dr. Sue Brandt. Centre for Plant Diversity & Systematics, The University of Reading, Reading RG6 6AS Email: {F.A.Bisby|S.M.Brandt}@reading.ac.uk Telephone +44 (0) 118 378 6437 Southampton – Dr. Richard White and Mr. John Robinson. Biodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX Email: {R.J.White|J.S.Robinson}@soton.ac.uk Telephone +44 (0)23 8059 2021 Royal Botanic Gardens, Kew - Prof. Peter Crane, Dr. Don Kirkup, Ms. Sally Hinchcliffe, Mr. Graham Christian and others Natural History Museum, London - Prof. Paul Henderson, Mr. Charles Hussey and others BIOSIS UK - Mr. Michael Dadd, Ms. Judith Howcroft and others 5 Interactive use of SPICE … 6 7 8 9 10 Basic uses for the catalogue • User wishes to check taxonomy of some organisms interactively; or • User wishes to access or store data (observations, gene sequences; …) associated with a given species: – Catalogue gives information about accepted name/synonyms – Can use all names for retrieval, for example – May well want to use the accepted name provided by SPICE for storing new data. 11 The “standard data” • Comprises the information about a species which Species 2000 wishes to provide: – – – – – – – – AVCNameWithRefs SynonymWithRefs CommonNameWithRefs Family Comment Scrutiny DataLink Geography • Minimalistic CDM devised: – The basic information needed for a catalogue of life; – If GSD can’t be wrapped to conform, probably doesn’t contain required information 12 Request Types 0-5 • Again, a fairly simple set of operations is required: – Type 0: Get CDM version compliance for a GSD – Type 1: Search for a name in a GSD – Type 2: Fetch “standard data” about a chosen species – Type 3: Get information about a GSD – Type 4: Move up the taxonomic hierarchy – Type 5: Move down the taxonomic hierarchy 13 Type 1 response (XML) extract <type1result> <SPECIESNAME> <SYNONYMWITHAVC> <SYNONYM> <FULLNAME> <GENUS>Abrus</GENUS> <SPECIES>abrus</SPECIES> <AUTHORITY>(L.) Wright</AUTHORITY> </FULLNAME> <INFRASPECIFICPORTION> </INFRASPECIFICPORTION> <SYNONYMSTATUS>synonym</SYNONYMSTATUS> </SYNONYM> <AVCNAME> <FULLNAME> <GENUS>Abrus</GENUS> <SPECIES>precatorius</SPECIES> <AUTHORITY>L.</AUTHORITY> </FULLNAME> <AVCSTAT>accepted</AVCSTAT> <IDL>1571</IDL> </AVCNAME> </SYNONYMWITHAVC> </SPECIESNAME> <SPECIESNAME> … 14 SPICE architecture User (Web Browser) User (Web browser) …… CORBA User Server module (HTTP) CAS knowledge repository (taxonomic hierarchy, annual checklist, genus and other caches, ...) ‘Query’ co-ordinator Wrapper (e.g. JDBC) …… Wrapper (e.g.CGI/XML + ODBC) (in some cases, generic) CORBA ‘wrapper’ element of GSD Wrapper GSD Common Access System (CAS) Internal wrapper CGI XML External wrapper GSD 15 Why a federation of autonomous, heterogeneous GSDs? • Taxonomists have specialist knowledge of a limited range of organisms, and want to make their data available in various ways • So – the hierarchy is divided into sectors, with an individual or group of scientists responsible for each – scientists are given control over their databases – we accommodate existing heterogeneous GSDs; also new ones built for various purposes • This helps assure taxonomic data quality (peer review of GSDs is also used) 16 Specialist GSDs mean better data quality than non-specialist ones … • … but data quality problems still arise: – “Non-overlapping” sectors may, in fact, overlap – GSDs may be inconsistent taxonomically – GSDs may be formed by merging two or more other databases, mutually inconsistent 17 LITCHI Project A rule-based tool for the detection and repair of conflicts and merging of data in taxonomic databases 18 Project Staff Suzanne Embury, Alex Gray, Andrew Jones, Iain Sutherland Object and Knowledge-based Systems Group, Department of Computer Science, University of Wales, Cardiff, PO Box 916, Cardiff CF24 3XF Frank Bisby, Sue Brandt Centre for Plant Diversity and Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS John Robinson, Richard White Biodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX 19 Summary • We modelled the knowledge integrity rules in a taxonomic treatment • The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon (examples later) • Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases 20 Example 1 Checklist A • Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym] Checklist B • Caragana sibirica Medikus [accepted name] Caragana arborescens Lam. [synonym] 21 Example 2 Treatment A recognises one genus, Cytisus Cytisus multiflorus Cytisus praecox Treatment B recognises two genera, Cytisus and Sarothamnus Cytisus multiflorus Cytisus praecox Genus Cytisus Genus Cytisus Cytisus scoparius Cytisus striatus Sarothamnus scoparius Genus Sarothamnus striatus Sarothamnus In the case of the species Cytisus scoparius Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius) Treatment B will list it as Sarothamnus scoparius (synonym Cytisus scoparius) 22 Example of a rule • In each of the 2 examples, merging the checklists would lead to violation of: – “A full name which is not a pro-parte name may not appear as both an accepted name and a synonym in the same checklist” n, a, l , c1 , c2 , t1 , t2 accepted _ name(n, a, c1 , l , t1 ) synonym(n, a, c2 , l , t2 ) pro _ parte(c1 ) pro _ parte(c2 ) violation:accepted_name(N,A,C1,L,T1), synonym(N,A,C2,L,T2), (\+pro_parte(C1); \+pro_parte(C2)). • (Violations of other rules help user to distinguish the taxonomic causes; various options to repair this violation) 23 Conflict display 24 LITCHI: current status • Good selection of rules (for botanical nomenclature) • A research project, now in need of reengineering: – Implemented in Prolog & Visual Basic; not portable – Uses XDF file format for data import/export 25 Some future developments of LITCHI • BiodiversityWorld – BiodiversityWorld is not funded to develop LITCHI at all, but will be able to take advantage of LITCHI developments for ‘taxonomically intelligent navigation’ • EuroCat – Re-engineer LITCHI, to work with GSDs wrapped to SPICE CDM 1.2 – Use for • Intra- and inter- GSD consistency checking • Navigation between resources organised according to differing taxonomies, e.g. for access to regional hubs – Use in conjunction with, and for generating, ‘cross-maps’ 26 Litchi in (future) use Checklist A Checklist B Read into system Taxonomic intelligence Conflict detection Rules Conflict display Conflict description Possible repairs Conflict repair (not necessarily used in this context) Write Cross-map 27 BiodiversityWorld • Problem solving environment for biodiversity informatics on the GRID • UK BBSRC-funded • Universities of Reading, Cardiff & Southampton, and The Natural History Museum, London 28 BiodiversityWorld – The Challenge Some difficult Biodiversity questions • How should conservation efforts be concentrated? – (example of Biodiversity Richness & Conservation Evaluation) • Where might a species be expected to occur, under present or predicted climatic conditions? – (example of Bioclimatic modelling and Climate Change) • Is geography a good predictor of relationship between lineages? (e.g. are the more closely related species found near each other?) – (example of Phylogenetic Analysis & Biogeography) 29 Some relevant resource types • Data sources: – Catalogue of life – Species Information Sources (SISs) • Species geography • Descriptive data • Specimen distribution – Geographical • Boundaries of geographical & political units • Climate surfaces – Genetic sequences • Analytic tools: – Biodiversity richness assessment – various metrics – Bioclimatic modelling – bioclimatic ‘envelope’ generation – Phylogenetic analysis (generation of phylogenetic trees) 30 Some challenges … • Finding the resources • Knowing how to use these heterogeneous resources – Originally constructed for various reasons – Often little thought was given to standards or interoperability • One important specific issue: using appropriate scientific name for SIS queries (hence SPICE for Species 2000) 31 Our vision • Biodiversity Problem Solving Environment – – Heterogeneous diverse resources – Flexible workflows – Main challenges centre around metadata, interoperability, etc; – High-performance computing secondary (though relevant) • Our previous GRAB demonstrator illustrates some Bioclimatic Modelling elements, with a fixed workflow … 32 Typical GRAB display Web browser ‘front-end’ to the GRAB server Applet monitoring communication between GRAB server and GRAB databases 33 Why the GRID for BiodiversityWorld (or even GRAB?) • HPC; mobility of data & programs • Resource discovery • OGSA (Open Grid Services Architecture) – not Globus-specific – gives Web Services & life cycle management, etc • Workflow for orchestrating resources, etc. 34 BiodiversityWorld architecture Taxonomic index (SPICE Catalogue of Life) Analytic tool Analytic tool GSD GSD GSD GSD Proxy Proxy Proxy Ontology: Metadata Intelligent links Resource & Analytic tool descriptions Maintenancetools BioD-GRID Problem Solving Environment: Broker agents Facilitator agents Presentation agents Proxy Proxy Proxy User Thematic Data source Abiotic Data source Local tools Problem Solving Environment User Interface 35 Bioclimatic modelling Case Study - Leucaena leucocephala • Leucaena leucocephala (Lam.) De Wit • Native of Central America • Widely introduced around the tropics • Widely utilised around the globe for: – Wood – Forage – Soil enrichment and erosion control • Regarded as an invasive weed in some areas 36 Point data from various herbaria 37 Distribution data from ILDIS database 38 GARP prediction of climatic suitability 39 Workflow • Our PSE should provide flexible support for development of complex workflows for: – experimental design of in silico biodiversity-related experiments – repeatability – modification of experiments 40 START Typical workflow Species 2000 Catalogue of Life STAGE 1 Returns list of accepted taxa, synonyms and common names Enquiry: select ‘data’ for ‘taxon set’ STAGE 2 Return dataset composed of homologous responses from multiple thematic data sources STAGE 3 Presentation and storage of results Analytical Toolbox Distributed Array of GSD’s Enquiry name(s) Distributed array of thematic data sources Reference to Abiotic datasets 41 Initial test workflow Submit scientific name; retrieve accepted name & synonyms for species Retrieve distribution maps for species of interest Possibly different climate surfaces (e.g. predicted climate) SPICE Climate surfaces Localities Climate Space Model Climate Prediction of suitable regions for species of interest World or regional maps Climate Model of climatic conditions where species is currently found Base Maps Prediction 42 BiodiversityWorld – much more complex than SPICE • Much more heterogeneity – diverse kinds of databases and tools • Much greater range of data quality and terminology problems, e.g. – accuracy of “point data” – country names –… 43 Role/use of metadata • Descriptive • Create electronic book for user • Create workflows – necessary transformations – provenances – interoperability • Locate appropriate elements • Rerun processing (possibly with modifications) 44 Conclusion • The field of biodiversity informatics presents various challenges including: – taxonomic/naming – heterogeneity & autonomy – data quality – need for extensive metadata 45