Prof. Jessie Kennedy e-SI Theme:
Exploiting Diverse Sources of Scientific Data
Wealth and diversity of scientific data collected and stored is growing rapidly
Increase in automation
Genetic sequencing, remote sensing, astronomy satellites
Decrease in technological costs
Computers more powerful, disk space greater for the same £
Huge potential for scientific discovery by exploiting this data
especially multi-disciplinary research
Number, complexity and diversity of resources makes this a difficult task
Case Study
Data Integration
Matching data sets on biological names
Exploiting Diverse Sources of Scientific Data 2
Science Environment for Ecological Knowledge
USA National Science Foundation funding
Multidisciplinary project
Biology: Ecology, Taxonomy
Environmental science: Geography, Remote sensing,
Meteorology, Climatology
Computer Science: Database, GRID/Web, Ontologies,
Workflows, Algorithms, Human Computer Interaction
Exploiting Diverse Sources of Scientific Data 3
The SEEK Prototype: Ecological
Niche Modeling
Biodiversity information e.g. data from museum specimens, ecological surveys
Geographic Space Ecological Space ecological niche modeling
Geospatial and remotely sensed data
Model of niche in ecological dimensions occurrence points on native species distribution temperature
Results taken to integrate with other data realms (e.g., human populations, public health, etc.)
Native range prediction
Project back onto geography
4
Predicted
Distribution:
Amur snakehead
(Channa argus)
Image from http://www.lifemapper.org
Exploiting Diverse Sources of Scientific Data 5
Data is Distributed
Data is Heterogeneous
Syntax
e.g. Text, Excel, Relational Database…..
Schema
e.g. Names of the tables, columns in tables
Semantics principal focus for SEEK
From many disciplines
Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,…
Data on economics, demographics, legal issues,…
Exploiting Diverse Sources of Scientific Data 6
EcoGrid:
Making diverse environmental data
Semantic Mediation System:
“Smart” data discovery and integration
BEAM WG:
Biodiversity and Ecological
Analysis and Modelling
Knowledge Representation WG:
Taxonomic name/concept resolution server
Exploiting Diverse Sources of Scientific Data 7
EcoGrid
Exploiting Diverse Sources of Scientific Data 8
Partnership for Interdisciplinary
Studies of Coastal Oceans (4)
Natural History
Collections (>> 100)
NTL
System (36) HBR
VCR
Multi-agency Rocky
Intertidal Network (60)
LTER Network (24)
Organization of Biological
Field Stations (180)
Metacat node
VegBank node
Xanthoria node
LUQ
SRB node
DiGIR node
Legacy system
Exploiting Diverse Sources of Scientific Data 9
EcoGrid registry to discover data sources
EML (Ecological Metadata Language)
Experimental data, survey data, spatial raster and vector data, etc.
XML based
Discovery information
Creator, Title, Abstract, Keyword, etc.
Coverage
Geographic, temporal, and taxonomic extent
Logical and physical data structure
Data semantics via unit definitions and typing
Protocols and methods
DarwinCore
Museum collections
10
Service to Analysis and Modelling Layer
Interaction with Kepler – Workflows
Interaction with Grid Computing Facilities
Distributed computation
Service to Semantic Mediation Layer
Access to Ontologies; Taxon Services
Access to Legacy Apps
LifeMapper
Spatial Data Workbench
Exploiting Diverse Sources of Scientific Data 11
AMS
Exploiting Diverse Sources of Scientific Data 12
Model the way scientists currently work with data
coordinate export and import of data among software systems
Workflows emphasize data flow
Output generation includes creating appropriate metadata
The analysis workflow itself becomes metadata
The workflow describes the data lineage as it has been transformed
Derived data sets can be stored in EcoGrid with provenance
Query EcoGrid to find data Exploiting Diverse Sources of Scientific Data
Archive output to EcoGrid with workflow metadata 13
EML provides semi-automated data binding
Exploiting Diverse Sources of Scientific Data 14
(200 to 500 runs per species x
2000 mammal species x
3 minutes/run)
=
833 to 2083 days
Exploiting Diverse Sources of Scientific Data 15
Utilize distributed computing resources
Execute single steps or sub-workflows on distributed machines
KeplerGrid for Niche
Modeling
(200 to 500 runs per species x
2000 mammal species x
3 minutes/run)
/
100 nodes
=
8 to 20 days
Exploiting Diverse Sources of Scientific Data 16
SMS
Exploiting Diverse Sources of Scientific Data 17
Key information needed to read and machine process a data file is in the metadata
Physical descriptors (CSV, Excel, RDBMS, etc.)
Logical Entity (table, image..),Attribute (column) descriptions
Name
Type (integer, float, string…)
Codes (missing values, nulls...)
Integrity constraints
Semantic descriptions (ontology-based type systems)
Metadata driven data ingestion
Exploiting Diverse Sources of Scientific Data 18
What was measured ( biomass or photosynthetic solar radiation )
Type of quantity measured ( mass, length )
Context of measurement ( Psychotria limonensis, wavelength band )
How it was measured ( dry weight, total solar radiation )
Exploiting Diverse Sources of Scientific Data 19
Label data with semantic types
Label inputs and outputs of analytical components with semantic types
Data Ontology Workflow Components
Use reasoning engine to generate transformation step
Use reasoning engine to discover relevant component
Exploiting Diverse Sources of Scientific Data 20
Homogeneous data integration
Integration via EML metadata is relatively straightforward
Heterogeneous Data integration
Requires advanced metadata and processing
Attributes must be semantically typed
Collection protocols must be known
Units and measurement scale must be known
Measurement relationships must be known
e.g., that ArealDensity=Count/Area
Exploiting Diverse Sources of Scientific Data 21
Exploiting Diverse Sources of Scientific Data 22
Much of the data gathered in ecological studies and used in ecological data analysis is bioreferenced data
typically organisms are referenced by a Latin name
e.g. Picea rubens
Many analyses require integrating data
originating in many locations and
at various points in time
For most bio-referenced data, integration involves matching on organism name
SEEK Taxon investigating associated issues
Exploiting Diverse Sources of Scientific Data 23
Used for communicating information about known organisms and groups of organisms – taxa
Framework for all biologists to communicate…
Arise from taxonomists applying them to species and higher taxa following classification
Formalized according to strict codes of nomenclature
differ depending on kingdom
Use a Latin naming scheme
polynomial for species + below; monomial for genus + above
Quoted as: LatinName NameAuthors Year
Example: Carya floridana Sarg. 1913
Can cause problems in data analysis…..
24
Pile of specimens
Taxon_concept _ a
Genus
Type specimens classify
Taxon_concept _ b
Taxon_concept _ c
Taxon_concept _ d
Species
Taxonomic Hierarchy
Exploiting Diverse Sources of Scientific Data 25
classify
Pile of specimens Taxon_concept_d Taxon_concept_d
Exploiting Diverse Sources of Scientific Data 26
Archer splits Aus aus L. 1758 into two species, of Taxonomic aus L. 1758 and Aus bea Archer. 1965 into one
Revisions species, retains the name.
Genus concept specimen
(v) Aus L.1758
(i) Aus L.1758
(ii) Aus L.1758
(iii) Aus L.1758
(iv) Aus L.1758
Aus aus L.1758
Aus aus L.1758
Species concept
Aus bea
Archer 1965 species name
Aus aus L.1758
Aus bea
Archer 1965
Aus cea
BFry 1989
Aus aus
L.1758
Aus aus L. 1758
Aus ceus
BFry 1989
Xus Pargiter 2003
Aus cea
BFry 1989
Linnaeus 1758 Archer 1965 Fry 1989 Tucker 1991
Xus beus (Archer)
Pargiter 2003.
Pargiter 2003
In Linnaeus 1758 In Archer 1965 publication
A diligent nomenclaturist, Pyle (1990),
In Fry 1989 In Tucker 1991
In Pargiter 2003 notes that the species epithet of Aus
Publications of Purely
Nomenclatural
Observation bea and cea noted bea and Aus cea are of the wrong gender and publishes the corrected names and
Aus beus
Aus ceus corrig. Archer 1965 corrig. BFry 1989 as invalid names and replaced with beus and ceus.
Pyle 1990
In Pyle 1990
Tucker publishes his revision without noting
Pyle’s corrigendum of the name of
Exploiting Diverse Sources of Scientific Data
Aus ceus.
Aus cea to
27
Are not unique
“Re-use” of names with changed definition
Name is ambiguous without definition/context
Subject to alterations and 'corrections' in time
Often recorded inappropriately in datasets
No author and/or year (e.g. Carya floridana )
Abbreviated (e.g. C. floridana )
Internal code (e.g. PicRub for Picea rubens)
Vernacular used (e.g. Scrub Hickory)
Misspelled
Exploiting Diverse Sources of Scientific Data 28
The published expert opinion defining and describing a group of organisms which are given a (scientific) name
Scientific names qualified with a reference to the definition of a concept
Should be used for communicating about groups of organisms
Comparing or integrating data based on taxon concepts will be more accurate
Exploiting Diverse Sources of Scientific Data 29
Created by someone - an Author
Described in a Publication
Given a Name
Related to the type specimen
Definition
Referenced by
Full Scientific name + “according to” (Author +
Publication + Date)
Definition
Carya floridana Sarg. (1913) “according to” Charles
Sprague Sargent, Trees & Shrubs 2:193 plate 177
(1913)
Exploiting Diverse Sources of Scientific Data 30
Defined by
set of Specimens examined during classification
set of common Characters
context dependent; differentiate taxa rather than fully describe them;
use natural language with all its ambiguities
relationships to other Taxon Concepts
Taxon circumscription
the lower level taxa
Congruence, overlap, includes etc. to taxa in other classifications
Exploiting Diverse Sources of Scientific Data 31
Original concept
1 st use of name as described by the taxonomist
same author + date in scientific name and “according to”
Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees &
Shrubs 2:193 plate 177 (1913)
TC_a
Revised concept
Re-classification of a group
Carya floridana Sarg. (1913) “according to” Stone, Flora of North
America 3:424 (1997)
TC_b
Relationship between the taxon concepts
TC_b includes TC_a
Exploiting Diverse Sources of Scientific Data 32
In legacy data names often appear in place of concepts
Names are imprecise
inappropriate for referring to information regarding taxa
e.g. observational/collection data
BUT…sometimes that’s all we have
How do we interpret names?…..
potentially multiple definitions
the sum of all definitions that exist for the name
one of the existing definitions
the “attributes” in common to all the definitions
represented by the type specimen
Exploiting Diverse Sources of Scientific Data 33
Nominal concepts
Sub-set of TaxonConcepts
Name but no AccordingTo
non-unique (concept) identifier attributes
can be given a unique concept identifier
No definition
Explicitly saying it’s something with this name
but not really sure what is/was meant by the name
Encourage people to understand and address the issue of names
Allowing mark-up of data with names allows them to believe names are really good enough
Will improve long term usefulness of scientific data
Ease integration
Exploiting Diverse Sources of Scientific Data 34
Scientific names are not unique identifiers for biological entities
Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data
Biologists must reference organisms precisely
if datasets to be of use long term or to other users
Reference by taxon concept rather than name
integrate data for analysis on taxon concepts
Exploiting Diverse Sources of Scientific Data 35
Main taxonomic list servers are still name based
single perspective on taxonomy
don’t represent multiple classifications
unclear what the definition is (don’t even try!)
provide non-standardised interface (web page, xml download)
SEEK Taxon aims to prototype a concept/name resolution service for ecologists working with SEEK
Find concepts given a name
Compare concepts
Relate concepts
Mark up ecological data sets with concepts
First
Need data on names and concepts
Need an exchange standard….
Exploiting Diverse Sources of Scientific Data 36
TCS standard for exchange of taxonomic names/concept data
Taxonomic Databases Working Group (TDWG)
Global Biodiversity Information Facility (GBIF)
XML based exchange schema
Makes heavy use of Globally Unique Identifiers (GUIDs)
Not designed as the “correct way” to model a Taxon
Concept
No “rules” as to what a taxon must have
Design to accommodate different models
Includes Taxon Names
more constrained - the codes of nomenclature
TCS/EML
TCS modifications to EML taxon coverage
Exploiting Diverse Sources of Scientific Data 37
Important to be able to pass names alone
For nomenclatural and some taxonomic purposes
But not for identifications/observations
Taxon Concepts refer to Names
By GUID
Names must not change
Can’t record original taxon concept
Exploiting Diverse Sources of Scientific Data 38
Taxon Object Server
Schema based on the TCS model
Implements the GUIDs using LSID technology
Tool to import/export data from TCS documents
TOS Allows
registration, retrieval of taxonomic datasets
Match concepts given names, concepts, etc.
Allow users to
See different taxonomic opinions
Uses GUIDs to reference concepts (LSIDs)
Find concepts…
Author new concepts
Make new relationships between existing concepts
Integrated with Kepler workflow system
Exploiting Diverse Sources of Scientific Data 39
Concept mapper
A desktop tool to assist taxonomists to relate concepts from one source to another
For use in creating data sets for TOS or TCS
For creating new relationships between concepts in TOS
Taxonomy comparison visualisation
Visualisation tool to explore different classifications
Compare concepts
Exploiting Diverse Sources of Scientific Data 40
Query concepts
Concepts
Relationships
Exploiting Diverse Sources of Scientific Data 41
Exploiting Diverse Sources of Scientific Data 42
Environment to support large scale ecological data analysis
Scientific Workflows: Kepler
Semantic Mediation
Ecological ontology creation/use for data integration
Grid/Wed based data discovery
Resolution of Taxonomic Names/Concepts
Standards development
Concept matching server
Visualisation tools
http://seek.ecoinformatics.org
Exploiting Diverse Sources of Scientific Data 43
I hope I have convinced you that the answer is
NO
as a general rule…
BUT
Depends on the purpose of the data
therefore the accuracy required
The degree of automation used in matching
greater automation – greater potential problem
Expertise of person involved in the matching
Exploiting Diverse Sources of Scientific Data 44
Educating biologists of the inherent problem in names
Not limited to the Linnaean system of nomenclature
Lack of good taxon concept data
Widening usage and application of taxon concepts
Adopting GUIDs
Provision of reliable ‘look up’ facilities
Cross referencing of GUIDs
Reuse is vital
Must not create duplicate GUIDs if possible
Conversion of legacy data
Develop good matching algorithms
Potential move from XML schema -> semantic web technologies
……..
Exploiting Diverse Sources of Scientific Data 45
This material is based upon work supported by:
The National Science Foundation
SEEK Collaborators: NCEAS (UC Santa Barbara),
University of New Mexico (Long Term Ecological
Research Network Office), San Diego Supercomputer
Center, University of Kansas (Center for Biodiversity
Research), University of Vermont, University of North
Carolina, Arizona State University, UC Davis
Matt Jones – for many of the slides….
Global biodiversity Information Facility
eScience Institute
Research Theme Programme
Malcolm Atkinson
Exploiting Diverse Sources of Scientific Data 46
Upcoming Workshop
discussing possible technology solutions
RDF, Ontologies and Meta-Data Workshop
7th – 9th June, 2006 e-Science Institute
15 South College Street
Edinburgh http://www.nesc.ac.uk/esi/events/683/
Exploiting Diverse Sources of Scientific Data 47