BiodiversityInformaticsProjects

advertisement
Biodiversity Informatics Projects
Richard White
Thoughts
• Role of biodiversity data in bioinformatics
– assisting with organising and retrieving bioinformatic (molecular)
data
– a separate area with different users (taxonomy, ecology,
conservation, resource management …)
• Demand from users for taxonomic and species diversity
information on the Web
• Pressure on the taxonomic community to deliver
• Demand for more sophisticated use of available data:
interoperability = online analysis, not just browsing
Assembling biodiversity
information sources
Delivering species diversity information by
•
•
•
•
assembling,
merging &
linking databases and
publishing on the Web,
with special emphasis on linking
Issues in assembling and linking
biodiversity information sources
• Assembling a web-site (ERMS)
• Assembling databases by merging (ILDIS)
• Linking on-line databases through a gateway
(Species 2000 and SPICE)
• Onward links to related information
• Checking the reliability of links (LITCHI)
• Intelligent linking
• Persistent identifiers
Assembling species databases
First of all, before we start merging and linking
databases, let’s assemble a database from scratch:
• ERMS (European Register of Marine Species)
• Now at www.marbef.org/data/erms.php
ERMS
Incoming data
• Approximately 100 separate lists for different
taxonomic groups
• Mostly compiled as spreadsheets
• Scientific names, synonyms, geography (at least
Atlantic or Mediterranean)
• Some optional fields
• Objective to create a book and a web-site, partially
supported by a database
List conversion
was carried out in several stages:
• Excel spreadsheets were exported to text files
• Tab-delimited text files were imported into a clientserver database (MySQL)
• Database queries results are passed through
templates to generate either RTF (for the printed
publication) or HTML (for the Web site)
Variations on a theme
• Fields may be combined or separated
e.g. genus species authority date
• Higher taxa may be:
– repeated in fields of the species record
– given once in separate preceding records in various
different formats
• Synonyms may be:
– in a separate field of the species record, or mixed with
other remarks, with various delimiters and separators
– in separate records, linked by code or by name or even
abbreviated
– implied, e.g. Genus1 specname (Smith as Genus2)
• Geographical information is often free text
ERMS book page
Osteichthyes: brief checklist
Reptilia: full details
Taxonomic hierarchy for Reptilia
Merging versus linking
• Merging databases to create a single larger database
• Linking databases to create a distributed information
system
Merging species databases
1
The original databases
are physically copied into
Plants of
a new combined
Europe
database.
Plants of
Africa
1
2
The user interacts with
the new combined
database.
Plants of
the World
2
Linking
1 The user interacts with an
access system which does
not itself contain data.
2 When the user requests
data, it is fetched from the
appropriate database.
Plants of
Europe
Plants of
Africa
2
Plants of
the World
1
Assembling databases by merging
Now we have some databases, let’s build a bigger one
by merging:
• ILDIS (International Legume Database and
Information Service)
ILDIS
International Legume Database and
Information Service
• International collaborative project
– 10 Regional Centres
– 30 Taxonomic Coordinators
• Its goals include
– building, maintaining and enhancing the ILDIS
World Database of Legumes
– designing and providing services from it to users,
including:
• ILDIS LegumeWeb
• via Species 2000
ILDIS World Database of Legumes v. 7.00
• Taxa
• Species
• Subspecies
• Varieties
15,500
1,600
2,400
19,500
• Names
• Accepted names
• Synonyms
39,500
19,500
19,000
ILDIS’s data model: core data
• A core taxonomic checklist, assembled from
regional data sets and nearing completion, provides
a consensus taxonomy - a unified taxonomic
treatment or backbone on which other data can be
hung
• Various kinds of additional data may be attached to
this backbone (see later)
Features of ILDIS LegumeWeb
We’ll look at examples of the use of LegumeWeb, to
show a couple of features:
• Two-stage access with “synonymic indexing”
• A gateway to external information - “onward links”
(direct species
name links) to further sources of information
User access to LegumeWeb: Step 1
• The user types in a name, which may be incomplete
(or wrong!)
• LegumeWeb responds by showing a list of the species
names which fit the user’s specification
User access to LegumeWeb: Step 2
• The user chooses one of the species names provided
(which may be synonym or an accepted name)
– In this example, the user chooses Abrus
cyaneus (a synonym for Abrus precatorius)
• LegumeWeb responds by showing a standard set of
information about
the chosen species
Synonymic indexing
• Automated synonymic indexing
• synonym entered  accepted name found
(name  taxon)
• taxon found  synonyms listed
• Types of synonyms
– Unambiguous
– Ambiguous
• pro parte
• homonyms
• misapplied names
• In these cases an explanation is offered
to the user
Assembling databases by linking
Now we have some biggish databases, let’s build
something even bigger by linking databases together:
• Species 2000
– SPICE
– Species 2000 Europa
Linking
1 The user interacts with an
access system which does
not itself contain data.
2 When the user requests
data, it is fetched from the
appropriate database.
Plants of
Europe
Plants of
Africa
2
Plants of
the World
1
The Catalogue of Life
(Species 2000)
• An international collaborative project to provide
access to an authoritative and up-to-date checklist of
all the world’s species
• A distributed array of Global Species Databases
(GSDs) can be accessed through a Web gateway or
Central Access System (CAS)
• The array of GSDs provide an index to a further
range of information about each species, using
onward links (see later)
• www.sp2000.org
Species 2000 organisation
Taxonomic hierarchy (or
hierarchies)
Species
Global species databases
(GSDs) and interim checklists:
the species index
Species information sources
(SISs): regional faunas and
floras, specialist or sectoral
databases, web pages etc.
interim
GSD checklists
SIS
Architecture of Species 2000
User
interface
Data collector (CAS)
Wrapper
Wrapper
Wrapper
GSD
GSD
GSD
Species 2000’s Common Access System
• Species 2000 gives users a single point of access to
GSDs
• Access involves a two-stage search process similar to
that used in LegumeWeb
• In the second stage, the user sees a screen of
“standard data” about a species
The “standard data”
This comprises the information about a species
which Species 2000 wishes to provide:
– Accepted name (with references)
– Synonyms (with references)
– Common Names (with references)
– Family or other higher taxon
– Geography
– Comment
– Scrutiny information
– URL or URLs linking to further data sources for
this species
Need for communication
• Different people are building the various components
of the system:
– GSDs
– wrappers
– CAS
– user interface
• We need to ensure they all have a common
understanding of the data to avoid mistakes
Common Data Model
• We use a Common Data Model (CDM)
– A definition of the information being passed to
and fro
– Human-readable, not machine-readable
– Helps to manage complexity
– Used to create specific machine-readable
implementations for Corba (IDL), CGI/XML
(DTD, XML Schema), Web Services, etc.
What does the CDM look like?
• It defines the input (“request”) and output
(“response”) for six fundamental operations which
the system needs to be able to carry out
Request Types 0-6
– Type 0: Get CDM version supported by
a GSD’s wrapper
– Type 3: Get information about a GSD
– Type 1: Search for a name in a GSD
– Type 2: Fetch “standard data” about a
chosen species
– Type 4: Move up the taxonomic
hierarchy (towards the root of the tree)
– Type 5: Move down the taxonomic
hierarchy (towards the species level)
Spice CAS in use
Screen-shots of an old version of the Spice system in
use:
Spice 1 CAS
Onward links to related external data
Species databases such as ILDIS and
federated systems such as Species 2000
envisage providing links from their data
to external sources of related data, socalled “onward links”
• Example from ILDIS ...
“Onward links”
• The user may follow a hyperlink to some other data
source for further information, not managed by ILDIS
– In this example, the user chooses to go to
W3Tropicos at Missouri Botanical Garden to
see more information
• In this way LegumeWeb acts as
a gateway to other information about
legume species
LegumeWeb page with onward
links
Destination of an onward link
Further information obtained
Checking the reliability of links
• Whether in
– merging data sets to construct a species
database like ILDIS, or in
– linking from one data set to another,
• it is necessary to ensure that the species concepts in
the different databases do not conflict
Example 1
Database A
•
Caragana arborescens Lam. [accepted name]
Caragana sibirica Medikus [synonym]
Database B
•
Caragana sibirica Medikus [accepted name]
Caragana arborescens Lam. [synonym]
Example 2
Database A
•
Caesalpinia crista L. [accepted name]
Database B
•
Caesalpinia crista L. [accepted name]
 Caesalpinia bonduc (L.) Roxb. [accepted name]
Caesalpinia crista L., p.p. [synonym]
LITCHI project
• We modelled the knowledge integrity rules in a
taxonomic treatment
• The knowledge tested is implicit in the assemblage of
scientific names and synonyms used to represent each
taxon
• Practical uses include
– helping a taxonomist to detect and resolve taxonomic
conflicts when merging or linking two databases
– helping a non-taxonomist user follow links from one
database to another, in which the species may be
differently classified
Conflict display
Outcome of LITCHI project
• A prototype tool for merging checklists & checking
integrity of individual checklists was implemented
• In the Species 2000 Europa project, we are now
creating a completely new second version with a view
to allowing:
– dynamic linking (so-called “taxonomically intelligent
links”)
– Presentation of “attached data” to be organised,
merged and used to support conflict resolution
“Intelligent” linking
• The Catalogue of Life (Species 2000) is
– not just a catalogue (which lists things)
– it is an index (which points to things)
• GSDs, and gateways to them such as the Catalogue of
Life, can serve not only as catalogues of species but
also as indexes giving access, potentially, to all
species information on the Internet
“Intelligent” linking
• Species 2000 plans to provide links to take a user
– from a species entry (from a GSD)
– to further sources of information about that
particular species (Species Information
Sources or SISs)
Species 2000 organisation
Taxonomic hierarchy (or
hierarchies)
Species
Global species databases
(GSDs) and interim checklists:
the species index
Species information sources
(SISs): regional faunas and
floras, specialist or sectoral
databases, web pages etc.
interim
GSD checklists
SIS
“Intelligent” species links
• Given that it is possible to detect many cases of
potential taxonomic conflict when linking species
databases, how can such links be managed?
• There are a number of choices in the ways links may
be made and handled
Cross-mapping
So how can we make intelligent links work, especially in
the difficult cases where a species in one database
does not have an exact match in the other ?
– One way is to create and maintain “crossmaps” which describe how one or more taxa
in one resource (such as the Species 2000
index) relate to one or more taxa in another
resource
A dream
A system for managing intelligent species links
would maximise the potential of the plethora
of species-based catalogues, indexes and
rich species resources currently being
assembled all over the world
• Perhaps on the Web, as with the current Spice/Species
2000 prototypes
• Or ...
The Grid
The Grid is often thought of as a new toy for
particle physicists, with
– very high bandwidth
– distributed computational resources
But it also provides opportunities for more
structured and reliable access to data and
information sources, using improved protocols
with metadata
– For example, access to such knowledge
sources as these cross-maps
Using biodiversity information
resources
• Helping Biodiversity Researchers to do their Work
• Collaborative e-Science and Virtual Organisations
Biodiversity analysis and modelling
Scientists working with biodiversity information employ
a wide variety of resources:
• data sources
• statistical analysis and modelling tools
• presentation or visualisation software
which may be available on various local and remote
computer platforms.
Examples of biodiversity resources
Data sources:
• Names: Species 2000 & ITIS Catalogue of Life
• Data: GBIF, sequence databases
• Geography: Gazetteers
• Collections and distributions: BioCASE, MaNIS
Analysis tools:
• Statistical and multivariate analysis
• Modelling
• Visualisation
Use of resources together
Scientists frequently need to use several of these
resources in sequence to carry out their research.
Much effort is currently expended in
• initially acquiring resources
• installing and sometimes adapting them to run on the
user’s own machine
• converting and transporting data sets between stages
of the analysis process
Biodiversity research
Biologists are working to understand the adaptation of
organisms to their environmental niche,
eventually by combining knowledge of all the levels of
biological organisation
• genome
• transcription
• proteome
• metabolic pathways
• cell
• tissue
• organ
• individual whole organism
• population
• species
• evolutionary pathways
and to predict their interactions with their environment
Workflows
Resources are called into use in an appropriate
sequence from an interactive workflow.
The facility for scientists to be able to create their own
workflows, without the need for regular assistance
from computer scientists, is an essential part of the
BDWorld system. Accessible tools for resource
discovery and for workflow design, enactment and reuse are therefore required.
For example
Changes in distribution in response to climate changes
brought about by global warming
CSM: Climate-space modelling
Modelling and predicting changes in distribution in
response to climate changes such as those brought
about by global warming
An unreasonably brief explanation:
• Get current distribution of a species (e.g. specimen
records)
• Get current or recent climate data for those localities
• Calculate a model for the climate space the species can
occupy
• Predict the distribution the species would have in any
specified climate (may be different to the climate used
Example work-flow (Climate-space Modelling)
Submit scientific
name; retrieve
accepted name
& synonyms
for species
Retrieve
distribution maps
for species of
interest
Possibly different
climate surfaces
(e.g. predicted
climate)
SPICE
Climate
Climate
surfaces
Model of climatic conditions
Localities
Climate
Climate where species is currently
Space Modelfound
Prediction
Prediction of suitable
regions for species
of interest
Base Maps
Projection of predicted
distribution on to base
map
Projection
World or
regional
maps
Triana screen-shots
1
Creation
(design, editing)
Triana screen-shots
Triana screen-shots
Triana screen-shots
Triana screen-shots
2
Execution
(enactment, run-time)
Triana screen-shots
Triana screen-shots
Triana screen-shots
Triana screen-shots
And finally …
Triana screen-shots
Elements of the BDWorld system
• What did the system have to do to make that example
happen?
Role of the work-flow engine
• Create and edit a workflow
– locate an appropriate resource
– check interoperability
– arrange any necessary transformations
– record provenance of generated data sets
• Execute a workflow, passing data sets to and fro
• Create a log or ‘lab book’ for user
Difficulties with resources
• Finding the resources
• Knowing how to use these heterogeneous resources
– Originally constructed for various reasons,
often with little attention to standards or
interoperability
– Have to pass data sets from one to another
– Some involve user interaction
Role of metadata
Metadata is needed to enable discovery of
resources and to indicate how they are to be
used.
•
•
•
•
Properties to help locate appropriate resources
Check interoperability, suggest transformations
Provenance of data sets
Log of work-flows executed
What is biodiversity informatics?
• The preceding project, among others, shows that the
challenges facing biodiversity informatics include not
only
– Describing the diversity of life at all levels of
organisation, so that biologists can understand,
conserve and exploit it,
• But also
– Inventing ways to describe the ever-increasing
diversity of information resources and analysis
tools available, so that users can find and use them
A challenge to link resources
• It is potentially very difficult to link all these
resources together
• Much attention is currently being given to:
– Providing unique identifiers for data objects
– Which can return metadata about themselves
– Which can be stitched together into a distributed
collaborative information system: see the biodiversity
informatics organisations TDWG and GBIF (later)
End
Download