ppt file - The CIDOC CRM

advertisement
Knowledge Sharing and
Collaborative Problem Solving in
Biodiversity Informatics
Andrew C. Jones
Cardiff University, UK
The Species 2000 vision
• To enumerate all known species of plants, animals, fungi and
microbes on Earth as the baseline dataset for studies of global
biodiversity
• To provide a simple access point enabling users to link from
Species 2000 to other data systems for all groups of organisms,
using direct species-links
• To enable users worldwide to verify the scientific name, status
and classification of any known species through species
checklist data drawn from an array of participating databases
• (More recently) to provide a “synonymy server” for use as a
service by other applications needing to obtain
suitable scientific names, e.g. for querying
2
biological data sets
Need for a catalogue
• Suppose we wished to retrieve all locations where
specimens of Caragana arborescens have been
collected, from various specimen distribution
databases.
• A taxonomic checklist might include:
Caragana arborescens Lam. [accepted name]
Caragana sibirica Medikus [synonym]
• Classification of organisms is based on opinion
regarding
– what the groups are
– identification of individuals
• So we need to use both these names as search
terms
• In practice the problem might be far worse
3
SPICE for Species 2000: Meeting the
Computing challenges
• The SPICE for Species 2000 project aimed to:
– build a federated ‘registry’ of scientific names organised by taxon
(species, etc.)
– accommodate GSD (Global Species Database) heterogeneity
– accommodate GSD autonomy & instability
– ensure scalability
• Funding:
– SPICE was funded by the UK BBSRC/EPSRC Bioinformatics panel
– EuroCat – new EU-funded project to augment
SPICE catalogue of life & develop/maintain
SPICE software
4
SPICE Project Staff
Cardiff – Prof. Alex Gray, Dr. Andrew Jones, Prof. Nick. Fiddian, Dr. Xuebiao Xu,
(Mr. Nick Pittas).
Object and Knowledge-based Systems Group, Department of Computer Science, Cardiff
University, PO Box 916, Cardiff CF24 3XF
Email:
{W.A.Gray|Andrew.C.Jones|N.Fiddian|X.Xu|N.Pittas}@cs.cf.ac.uk
Telephone +44 (0)29 2087 4812
Reading – Prof. Frank Bisby, Prof. Sir Ghillean Prance and Dr. Sue Brandt.
Centre for Plant Diversity & Systematics, The University of Reading, Reading RG6 6AS
Email:
{F.A.Bisby|S.M.Brandt}@reading.ac.uk
Telephone +44 (0) 118 378 6437
Southampton – Dr. Richard White and Mr. John Robinson.
Biodiversity & Ecology Research Division, School of Biological Sciences,
University of Southampton, Southampton SO16 7PX
Email:
{R.J.White|J.S.Robinson}@soton.ac.uk
Telephone +44 (0)23 8059 2021
Royal Botanic Gardens, Kew - Prof. Peter Crane, Dr. Don Kirkup,
Ms. Sally Hinchcliffe, Mr. Graham Christian and others
Natural History Museum, London - Prof. Paul Henderson, Mr. Charles Hussey
and others
BIOSIS UK - Mr. Michael Dadd, Ms. Judith Howcroft and others
5
Interactive use of SPICE …
6
7
8
9
10
Basic uses for the catalogue
• User wishes to check taxonomy of some
organisms interactively; or
• User wishes to access or store data
(observations, gene sequences; …)
associated with a given species:
– Catalogue gives information about accepted
name/synonyms
– Can use all names for retrieval, for example
– May well want to use the accepted name provided
by SPICE for storing new data.
11
The “standard data”
• Comprises the information about a species which
Species 2000 wishes to provide:
–
–
–
–
–
–
–
–
AVCNameWithRefs
SynonymWithRefs
CommonNameWithRefs
Family
Comment
Scrutiny
DataLink
Geography
• Minimalistic CDM devised:
– The basic information needed for a catalogue of life;
– If GSD can’t be wrapped to conform, probably doesn’t
contain required information
12
Request Types 0-5
• Again, a fairly simple set of operations is
required:
– Type 0: Get CDM version compliance for a GSD
– Type 1: Search for a name in a GSD
– Type 2: Fetch “standard data” about a chosen
species
– Type 3: Get information about a GSD
– Type 4: Move up the taxonomic hierarchy
– Type 5: Move down the taxonomic hierarchy
13
Type 1 response (XML) extract
<type1result>
<SPECIESNAME>
<SYNONYMWITHAVC>
<SYNONYM>
<FULLNAME>
<GENUS>Abrus</GENUS>
<SPECIES>abrus</SPECIES>
<AUTHORITY>(L.) Wright</AUTHORITY>
</FULLNAME>
<INFRASPECIFICPORTION> </INFRASPECIFICPORTION>
<SYNONYMSTATUS>synonym</SYNONYMSTATUS>
</SYNONYM>
<AVCNAME>
<FULLNAME>
<GENUS>Abrus</GENUS>
<SPECIES>precatorius</SPECIES>
<AUTHORITY>L.</AUTHORITY>
</FULLNAME>
<AVCSTAT>accepted</AVCSTAT>
<IDL>1571</IDL>
</AVCNAME>
</SYNONYMWITHAVC>
</SPECIESNAME>
<SPECIESNAME> …
14
SPICE architecture
User
(Web Browser)
User
(Web browser)
……
CORBA
User Server module
(HTTP)
CAS knowledge repository
(taxonomic hierarchy,
annual checklist, genus
and other caches, ...)
‘Query’ co-ordinator
Wrapper
(e.g. JDBC)
……
Wrapper
(e.g.CGI/XML
+ ODBC)
(in some cases, generic)
CORBA ‘wrapper’
element of GSD Wrapper
GSD
Common
Access
System
(CAS)
Internal
wrapper
CGI
XML
External
wrapper
GSD
15
Why a federation of autonomous,
heterogeneous GSDs?
• Taxonomists have specialist knowledge of a
limited range of organisms, and want to make
their data available in various ways
• So
– the hierarchy is divided into sectors, with an
individual or group of scientists responsible for
each
– scientists are given control over their databases
– we accommodate existing heterogeneous GSDs;
also new ones built for various purposes
• This helps assure taxonomic data quality
(peer review of GSDs is also used)
16
Specialist GSDs mean better data quality
than non-specialist ones …
• … but data quality problems still arise:
– “Non-overlapping” sectors may, in fact,
overlap
– GSDs may be inconsistent taxonomically
– GSDs may be formed by merging two or
more other databases, mutually
inconsistent
17
LITCHI Project
A rule-based tool
for the detection and repair of
conflicts and merging of data
in taxonomic databases
18
Project Staff
Suzanne Embury, Alex Gray, Andrew Jones, Iain
Sutherland
Object and Knowledge-based Systems Group,
Department of Computer Science, University of
Wales, Cardiff, PO Box 916, Cardiff CF24 3XF
Frank Bisby, Sue Brandt
Centre for Plant Diversity and Systematics, School
of Plant Sciences, The University of Reading,
Reading RG6 6AS
John Robinson, Richard White
Biodiversity & Ecology Research Division, School of
Biological Sciences, University of Southampton,
Southampton SO16 7PX
19
Summary
• We modelled the knowledge integrity rules in
a taxonomic treatment
• The knowledge tested is implicit in the
assemblage of scientific names and
synonyms used to represent each taxon
(examples later)
• Practical uses include detecting and resolving
taxonomic conflicts when merging or linking
two databases
20
Example 1
Checklist A
• Caragana arborescens Lam. [accepted name]
Caragana sibirica Medikus [synonym]
Checklist B
• Caragana sibirica Medikus [accepted name]
Caragana arborescens Lam. [synonym]
21
Example 2
Treatment A
recognises one genus,
Cytisus
Cytisus multiflorus
Cytisus praecox
Treatment B
recognises two genera,
Cytisus and Sarothamnus
Cytisus multiflorus
Cytisus praecox
Genus
Cytisus
Genus
Cytisus
Cytisus scoparius
Cytisus striatus
Sarothamnus scoparius Genus
Sarothamnus striatus
Sarothamnus
In the case of the species Cytisus scoparius
Treatment A will list it as
Cytisus scoparius
(synonym Sarothamnus scoparius)
Treatment B will list it as
Sarothamnus scoparius
(synonym Cytisus scoparius)
22
Example of a rule
• In each of the 2 examples, merging the checklists would
lead to violation of:
– “A full name which is not a pro-parte name may not appear as both
an accepted name and a synonym in the same checklist”
n, a, l , c1 , c2 , t1 , t2 
accepted _ name(n, a, c1 , l , t1 )  synonym(n, a, c2 , l , t2 ) 
pro _ parte(c1 )  pro _ parte(c2 ) 
violation:accepted_name(N,A,C1,L,T1),
synonym(N,A,C2,L,T2),
(\+pro_parte(C1); \+pro_parte(C2)).
• (Violations of other rules help user to distinguish the
taxonomic causes; various options to repair this
violation)
23
Conflict display
24
LITCHI: current status
• Good selection of rules (for botanical
nomenclature)
• A research project, now in need of reengineering:
– Implemented in Prolog & Visual Basic; not
portable
– Uses XDF file format for data import/export
25
Some future developments of LITCHI
• BiodiversityWorld
– BiodiversityWorld is not funded to develop LITCHI at all,
but will be able to take advantage of LITCHI
developments for ‘taxonomically intelligent navigation’
• EuroCat
– Re-engineer LITCHI, to work with GSDs wrapped to
SPICE CDM 1.2
– Use for
• Intra- and inter- GSD consistency checking
• Navigation between resources organised according to differing
taxonomies, e.g. for access to regional hubs
– Use in conjunction with, and for generating, ‘cross-maps’
26
Litchi in (future) use
Checklist A
Checklist B
Read into system
Taxonomic intelligence

Conflict detection

Rules

Conflict display

Conflict description

Possible repairs

Conflict repair (not
necessarily used in this context)
Write
Cross-map
27
BiodiversityWorld
• Problem solving environment for
biodiversity informatics on the GRID
• UK BBSRC-funded
• Universities of Reading, Cardiff &
Southampton, and The Natural History
Museum, London
28
BiodiversityWorld – The Challenge
Some difficult Biodiversity questions
• How should conservation efforts be concentrated?
– (example of Biodiversity Richness & Conservation
Evaluation)
• Where might a species be expected to occur, under
present or predicted climatic conditions?
– (example of Bioclimatic modelling and Climate Change)
• Is geography a good predictor of relationship
between lineages? (e.g. are the more closely related
species found near each other?)
– (example of Phylogenetic Analysis & Biogeography)
29
Some relevant resource types
• Data sources:
– Catalogue of life
– Species Information Sources (SISs)
• Species geography
• Descriptive data
• Specimen distribution
– Geographical
• Boundaries of geographical & political units
• Climate surfaces
– Genetic sequences
• Analytic tools:
– Biodiversity richness assessment – various metrics
– Bioclimatic modelling – bioclimatic ‘envelope’ generation
– Phylogenetic analysis (generation of phylogenetic trees)
30
Some challenges …
• Finding the resources
• Knowing how to use these heterogeneous
resources
– Originally constructed for various reasons
– Often little thought was given to standards or
interoperability
• One important specific issue: using
appropriate scientific name for SIS queries
(hence SPICE for Species 2000)
31
Our vision
• Biodiversity Problem Solving Environment –
– Heterogeneous diverse resources
– Flexible workflows
– Main challenges centre around metadata,
interoperability, etc;
– High-performance computing secondary (though
relevant)
• Our previous GRAB demonstrator illustrates
some Bioclimatic Modelling elements, with a
fixed workflow …
32
Typical GRAB display
Web browser
‘front-end’ to
the GRAB
server
Applet
monitoring
communication
between GRAB
server and
GRAB
databases
33
Why the GRID for BiodiversityWorld
(or even GRAB?)
• HPC; mobility of data & programs
• Resource discovery
• OGSA (Open Grid Services
Architecture) – not Globus-specific –
gives Web Services & life cycle
management, etc
• Workflow for orchestrating resources,
etc.
34
BiodiversityWorld architecture
Taxonomic index (SPICE
Catalogue of Life)
Analytic
tool
Analytic
tool
GSD
GSD
GSD
GSD
Proxy
Proxy
Proxy
Ontology:
Metadata
Intelligent links
Resource & Analytic tool
descriptions
Maintenancetools
BioD-GRID
Problem Solving
Environment:
Broker agents
Facilitator agents
Presentation agents
Proxy
Proxy
Proxy
User
Thematic Data
source
Abiotic Data
source
Local tools
Problem Solving
Environment User
Interface
35
Bioclimatic modelling
Case Study - Leucaena leucocephala
• Leucaena leucocephala (Lam.) De Wit
• Native of Central America
• Widely introduced around the tropics
• Widely utilised around the globe for:
– Wood
– Forage
– Soil enrichment and erosion control
• Regarded as an invasive weed in some
areas
36
Point data from various herbaria
37
Distribution data from ILDIS database
38
GARP prediction of climatic suitability
39
Workflow
• Our PSE should provide flexible support
for development of complex workflows
for:
– experimental design of in silico
biodiversity-related experiments
– repeatability
– modification of experiments
40
START
Typical workflow
Species 2000
Catalogue of Life
STAGE 1
Returns list of accepted taxa,
synonyms and common names
Enquiry: select ‘data’
for ‘taxon set’
STAGE 2
Return dataset composed of
homologous responses from
multiple thematic data sources
STAGE 3
Presentation and
storage of results
Analytical
Toolbox
Distributed
Array of GSD’s
Enquiry name(s)
Distributed array of
thematic data
sources
Reference to
Abiotic datasets
41
Initial test workflow
Submit scientific
name; retrieve
accepted name
& synonyms
for species
Retrieve
distribution maps
for species of
interest
Possibly different
climate surfaces
(e.g. predicted
climate)
SPICE
Climate
surfaces
Localities
Climate
Space Model
Climate
Prediction of suitable
regions for species
of interest
World or
regional
maps
Climate
Model of climatic conditions
where species is currently
found
Base Maps
Prediction
42
BiodiversityWorld – much more
complex than SPICE
• Much more heterogeneity
– diverse kinds of databases and tools
• Much greater range of data quality and
terminology problems, e.g.
– accuracy of “point data”
– country names
–…
43
Role/use of metadata
• Descriptive
• Create electronic book for user
• Create workflows
– necessary transformations
– provenances
– interoperability
• Locate appropriate elements
• Rerun processing (possibly with
modifications)
44
Conclusion
• The field of biodiversity informatics
presents various challenges including:
– taxonomic/naming
– heterogeneity & autonomy
– data quality
– need for extensive metadata
45
Download