Integrating Diverse Sources of Scientific Data: Prof. Jessie Kennedy

advertisement

Integrating Diverse Sources of

Scientific Data:

Is it safe to match on names?

Prof. Jessie Kennedy e-SI Theme:

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of

Scientific Data

 Wealth and diversity of scientific data collected and stored is growing rapidly

Increase in automation

 Genetic sequencing, remote sensing, astronomy satellites

Decrease in technological costs

 Computers more powerful, disk space greater for the same £

Huge potential for scientific discovery by exploiting this data

 especially multi-disciplinary research

Number, complexity and diversity of resources makes this a difficult task

 Case Study

 Data Integration

 Matching data sets on biological names

Exploiting Diverse Sources of Scientific Data 2

SEEK

Science Environment for Ecological Knowledge

 USA National Science Foundation funding

Multidisciplinary project

 Biology: Ecology, Taxonomy

 Environmental science: Geography, Remote sensing,

Meteorology, Climatology

 Computer Science: Database, GRID/Web, Ontologies,

Workflows, Algorithms, Human Computer Interaction

Exploiting Diverse Sources of Scientific Data 3

The SEEK Prototype: Ecological

Niche Modeling

Biodiversity information e.g. data from museum specimens, ecological surveys

Geographic Space Ecological Space ecological niche modeling

Geospatial and remotely sensed data

Model of niche in ecological dimensions occurrence points on native species distribution temperature

Results taken to integrate with other data realms (e.g., human populations, public health, etc.)

Native range prediction

Project back onto geography

4

Species prediction map

Predicted

Distribution:

Amur snakehead

(Channa argus)

Image from http://www.lifemapper.org

Exploiting Diverse Sources of Scientific Data 5

SEEK - Informatics Challenges

Data is Distributed

Data is Heterogeneous

 Syntax

 e.g. Text, Excel, Relational Database…..

Schema

 e.g. Names of the tables, columns in tables

Semantics  principal focus for SEEK

From many disciplines

 Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,…

 Data on economics, demographics, legal issues,…

Exploiting Diverse Sources of Scientific Data 6

SEEK Overview

EcoGrid:

Making diverse environmental data

Semantic Mediation System:

“Smart” data discovery and integration

BEAM WG:

Biodiversity and Ecological

Analysis and Modelling

Knowledge Representation WG:

Taxonomic name/concept resolution server

Exploiting Diverse Sources of Scientific Data 7

EcoGrid

SEEK Overview

Exploiting Diverse Sources of Scientific Data 8

EcoGrid Resources

Partnership for Interdisciplinary

Studies of Coastal Oceans (4)

Natural History

Collections (>> 100)

NTL

System (36) HBR

VCR

Multi-agency Rocky

Intertidal Network (60)

LTER Network (24)

Organization of Biological

Field Stations (180)

Metacat node

VegBank node

Xanthoria node

LUQ

SRB node

DiGIR node

Legacy system

Exploiting Diverse Sources of Scientific Data 9

EcoGrid Data Access

EcoGrid registry to discover data sources

EML (Ecological Metadata Language)

Experimental data, survey data, spatial raster and vector data, etc.

XML based

 Discovery information

 Creator, Title, Abstract, Keyword, etc.

 Coverage

 Geographic, temporal, and taxonomic extent

Logical and physical data structure

 Data semantics via unit definitions and typing

Protocols and methods

DarwinCore

Museum collections

10

EcoGrid Services

Service to Analysis and Modelling Layer

Interaction with Kepler – Workflows

Interaction with Grid Computing Facilities

 Distributed computation

Service to Semantic Mediation Layer

Access to Ontologies; Taxon Services

Access to Legacy Apps

LifeMapper

 Spatial Data Workbench

Exploiting Diverse Sources of Scientific Data 11

AMS

SEEK Overview

Exploiting Diverse Sources of Scientific Data 12

Scientific Workflows

Model the way scientists currently work with data

 coordinate export and import of data among software systems

 Workflows emphasize data flow

Output generation includes creating appropriate metadata

The analysis workflow itself becomes metadata

 The workflow describes the data lineage as it has been transformed

 Derived data sets can be stored in EcoGrid with provenance

Query EcoGrid to find data Exploiting Diverse Sources of Scientific Data

Archive output to EcoGrid with workflow metadata 13

Scientific workflows

EML provides semi-automated data binding

Exploiting Diverse Sources of Scientific Data 14

Kepler: Ecological Niche

Model

(200 to 500 runs per species x

2000 mammal species x

3 minutes/run)

=

833 to 2083 days

Exploiting Diverse Sources of Scientific Data 15

Grid-enable Kepler

 Utilize distributed computing resources

Execute single steps or sub-workflows on distributed machines

KeplerGrid for Niche

Modeling

(200 to 500 runs per species x

2000 mammal species x

3 minutes/run)

/

100 nodes

=

8 to 20 days

Exploiting Diverse Sources of Scientific Data 16

SMS

SEEK Overview

Exploiting Diverse Sources of Scientific Data 17

Metadata

 Key information needed to read and machine process a data file is in the metadata

Physical descriptors (CSV, Excel, RDBMS, etc.)

Logical Entity (table, image..),Attribute (column) descriptions

 Name

 Type (integer, float, string…)

 Codes (missing values, nulls...)

 Integrity constraints

 Semantic descriptions (ontology-based type systems)

Metadata driven data ingestion

Exploiting Diverse Sources of Scientific Data 18

Ecological ontologies

What was measured ( biomass or photosynthetic solar radiation )

Type of quantity measured ( mass, length )

Context of measurement ( Psychotria limonensis, wavelength band )

How it was measured ( dry weight, total solar radiation )

Exploiting Diverse Sources of Scientific Data 19

Semantic Mediation

 Label data with semantic types

Label inputs and outputs of analytical components with semantic types

Data Ontology Workflow Components

 Use reasoning engine to generate transformation step

Use reasoning engine to discover relevant component

Exploiting Diverse Sources of Scientific Data 20

Data integration

 Homogeneous data integration

 Integration via EML metadata is relatively straightforward

Heterogeneous Data integration

Requires advanced metadata and processing

 Attributes must be semantically typed

 Collection protocols must be known

 Units and measurement scale must be known

 Measurement relationships must be known

 e.g., that ArealDensity=Count/Area

Exploiting Diverse Sources of Scientific Data 21

Simple Example

Exploiting Diverse Sources of Scientific Data 22

Life Sciences Data

Much of the data gathered in ecological studies and used in ecological data analysis is bioreferenced data

 typically organisms are referenced by a Latin name

 e.g. Picea rubens

Many analyses require integrating data

 originating in many locations and

 at various points in time

For most bio-referenced data, integration involves matching on organism name

SEEK Taxon investigating associated issues

Exploiting Diverse Sources of Scientific Data 23

Biological (Scientific) Names

Used for communicating information about known organisms and groups of organisms – taxa

 Framework for all biologists to communicate…

Arise from taxonomists applying them to species and higher taxa following classification

Formalized according to strict codes of nomenclature

 differ depending on kingdom

Use a Latin naming scheme

 polynomial for species + below; monomial for genus + above

Quoted as: LatinName NameAuthors Year

 Example: Carya floridana Sarg. 1913

 Can cause problems in data analysis…..

24

Classification, Concepts & Names

Pile of specimens

Taxon_concept _ a

Genus

Type specimens classify

Taxon_concept _ b

Taxon_concept _ c

Taxon_concept _ d

Species

Taxonomic Hierarchy

Exploiting Diverse Sources of Scientific Data 25

Classification, Concepts &

Names

classify

Pile of specimens Taxon_concept_d Taxon_concept_d

Exploiting Diverse Sources of Scientific Data 26

Taxonomic history of Aus L.

1758

Archer splits Aus aus L. 1758 into two species, of Taxonomic aus L. 1758 and Aus bea Archer. 1965 into one

Revisions species, retains the name.

Genus concept specimen

(v) Aus L.1758

(i) Aus L.1758

(ii) Aus L.1758

(iii) Aus L.1758

(iv) Aus L.1758

Aus aus L.1758

Aus aus L.1758

Species concept

Aus bea

Archer 1965 species name

Aus aus L.1758

Aus bea

Archer 1965

Aus cea

BFry 1989

Aus aus

L.1758

Aus aus L. 1758

Aus ceus

BFry 1989

Xus Pargiter 2003

Aus cea

BFry 1989

Linnaeus 1758 Archer 1965 Fry 1989 Tucker 1991

Xus beus (Archer)

Pargiter 2003.

Pargiter 2003

In Linnaeus 1758 In Archer 1965 publication

A diligent nomenclaturist, Pyle (1990),

In Fry 1989 In Tucker 1991

In Pargiter 2003 notes that the species epithet of Aus

Publications of Purely

Nomenclatural

Observation bea and cea noted bea and Aus cea are of the wrong gender and publishes the corrected names and

Aus beus

Aus ceus corrig. Archer 1965 corrig. BFry 1989 as invalid names and replaced with beus and ceus.

Pyle 1990

In Pyle 1990

Tucker publishes his revision without noting

Pyle’s corrigendum of the name of

Exploiting Diverse Sources of Scientific Data

Aus ceus.

Aus cea to

27

Problems with Taxonomic

Names

Are not unique

 “Re-use” of names with changed definition

Name is ambiguous without definition/context

Subject to alterations and 'corrections' in time

Often recorded inappropriately in datasets

 No author and/or year (e.g. Carya floridana )

Abbreviated (e.g. C. floridana )

 Internal code (e.g. PicRub for Picea rubens)

Vernacular used (e.g. Scrub Hickory)

 Misspelled

Exploiting Diverse Sources of Scientific Data 28

Taxon Concepts ……

The published expert opinion defining and describing a group of organisms which are given a (scientific) name

 Scientific names qualified with a reference to the definition of a concept

Should be used for communicating about groups of organisms

Comparing or integrating data based on taxon concepts will be more accurate

Exploiting Diverse Sources of Scientific Data 29

Taxon Concepts…

Created by someone - an Author

Described in a Publication

Given a Name

Related to the type specimen

Definition

Referenced by

 Full Scientific name + “according to” (Author +

Publication + Date)

Definition

Carya floridana Sarg. (1913) “according to” Charles

Sprague Sargent, Trees & Shrubs 2:193 plate 177

(1913)

Exploiting Diverse Sources of Scientific Data 30

Taxon Concepts ……

Defined by

 set of Specimens examined during classification

 set of common Characters

 context dependent; differentiate taxa rather than fully describe them;

 use natural language with all its ambiguities

 relationships to other Taxon Concepts

Taxon circumscription

 the lower level taxa

Congruence, overlap, includes etc. to taxa in other classifications

Exploiting Diverse Sources of Scientific Data 31

Taxon Concepts ……

Original concept

 1 st use of name as described by the taxonomist

 same author + date in scientific name and “according to”

 Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees &

Shrubs 2:193 plate 177 (1913)

TC_a

Revised concept

Re-classification of a group

Carya floridana Sarg. (1913) “according to” Stone, Flora of North

America 3:424 (1997)

 TC_b

Relationship between the taxon concepts

 TC_b includes TC_a

Exploiting Diverse Sources of Scientific Data 32

Legacy Data …

In legacy data names often appear in place of concepts

 Names are imprecise

 inappropriate for referring to information regarding taxa

 e.g. observational/collection data

 BUT…sometimes that’s all we have

 How do we interpret names?…..

 potentially multiple definitions

 the sum of all definitions that exist for the name

 one of the existing definitions

 the “attributes” in common to all the definitions

 represented by the type specimen

Exploiting Diverse Sources of Scientific Data 33

Names as Taxon Concepts

Nominal concepts

Sub-set of TaxonConcepts

Name but no AccordingTo

 non-unique (concept) identifier attributes

 can be given a unique concept identifier

No definition

 Explicitly saying it’s something with this name

 but not really sure what is/was meant by the name

Encourage people to understand and address the issue of names

Allowing mark-up of data with names allows them to believe names are really good enough

Will improve long term usefulness of scientific data

Ease integration

Exploiting Diverse Sources of Scientific Data 34

SEEK Taxon’s Message…..

Scientific names are not unique identifiers for biological entities

Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data

Biologists must reference organisms precisely

 if datasets to be of use long term or to other users

Reference by taxon concept rather than name

 integrate data for analysis on taxon concepts

Exploiting Diverse Sources of Scientific Data 35

Taxonomic Databases

 Main taxonomic list servers are still name based

 single perspective on taxonomy

 don’t represent multiple classifications

 unclear what the definition is (don’t even try!)

 provide non-standardised interface (web page, xml download)

SEEK Taxon aims to prototype a concept/name resolution service for ecologists working with SEEK

 Find concepts given a name

Compare concepts

Relate concepts

Mark up ecological data sets with concepts

First

Need data on names and concepts

 Need an exchange standard….

Exploiting Diverse Sources of Scientific Data 36

Taxon Concept Schema

 TCS standard for exchange of taxonomic names/concept data

Taxonomic Databases Working Group (TDWG)

Global Biodiversity Information Facility (GBIF)

 XML based exchange schema

 Makes heavy use of Globally Unique Identifiers (GUIDs)

 Not designed as the “correct way” to model a Taxon

Concept

 No “rules” as to what a taxon must have

Design to accommodate different models

Includes Taxon Names

 more constrained - the codes of nomenclature

TCS/EML

TCS modifications to EML taxon coverage

Exploiting Diverse Sources of Scientific Data 37

Taxon Names and

Taxon Concepts

Important to be able to pass names alone

For nomenclatural and some taxonomic purposes

But not for identifications/observations

Taxon Concepts refer to Names

By GUID

 Names must not change

 Can’t record original taxon concept

Exploiting Diverse Sources of Scientific Data 38

Taxon Concept/Name

Resolution Server

 Taxon Object Server

 Schema based on the TCS model

Implements the GUIDs using LSID technology

Tool to import/export data from TCS documents

 TOS Allows

 registration, retrieval of taxonomic datasets

Match concepts given names, concepts, etc.

 Allow users to

 See different taxonomic opinions

 Uses GUIDs to reference concepts (LSIDs)

 Find concepts…

Author new concepts

 Make new relationships between existing concepts

Integrated with Kepler workflow system

Exploiting Diverse Sources of Scientific Data 39

SEEK User Interface Tools

Concept mapper

A desktop tool to assist taxonomists to relate concepts from one source to another

 For use in creating data sets for TOS or TCS

 For creating new relationships between concepts in TOS

Taxonomy comparison visualisation

Visualisation tool to explore different classifications

 Compare concepts

Exploiting Diverse Sources of Scientific Data 40

Concept Mapper Main GUI

Query concepts

Concepts

Relationships

Exploiting Diverse Sources of Scientific Data 41

Concept Comparison

Visualisation

Exploiting Diverse Sources of Scientific Data 42

SEEK Summary

Environment to support large scale ecological data analysis

 Scientific Workflows: Kepler

Semantic Mediation

Ecological ontology creation/use for data integration

Grid/Wed based data discovery

Resolution of Taxonomic Names/Concepts

 Standards development

Concept matching server

 Visualisation tools

 http://seek.ecoinformatics.org

Exploiting Diverse Sources of Scientific Data 43

Is it safe to match on names?

I hope I have convinced you that the answer is

NO

 as a general rule…

BUT

Depends on the purpose of the data

 therefore the accuracy required

The degree of automation used in matching

 greater automation – greater potential problem

Expertise of person involved in the matching

Exploiting Diverse Sources of Scientific Data 44

Many Outstanding Issues….

 Educating biologists of the inherent problem in names

 Not limited to the Linnaean system of nomenclature

Lack of good taxon concept data

Widening usage and application of taxon concepts

 Adopting GUIDs

 Provision of reliable ‘look up’ facilities

Cross referencing of GUIDs

 Reuse is vital

 Must not create duplicate GUIDs if possible

Conversion of legacy data

Develop good matching algorithms

 Potential move from XML schema -> semantic web technologies

 ……..

Exploiting Diverse Sources of Scientific Data 45

Acknowledgements

 This material is based upon work supported by:

The National Science Foundation

 SEEK Collaborators: NCEAS (UC Santa Barbara),

University of New Mexico (Long Term Ecological

Research Network Office), San Diego Supercomputer

Center, University of Kansas (Center for Biodiversity

Research), University of Vermont, University of North

Carolina, Arizona State University, UC Davis

Matt Jones – for many of the slides….

Global biodiversity Information Facility

 eScience Institute

Research Theme Programme

 Malcolm Atkinson

Exploiting Diverse Sources of Scientific Data 46

Exploiting Diverse sources of

Scientific Data

Upcoming Workshop

 discussing possible technology solutions

RDF, Ontologies and Meta-Data Workshop

7th – 9th June, 2006 e-Science Institute

15 South College Street

Edinburgh http://www.nesc.ac.uk/esi/events/683/

Exploiting Diverse Sources of Scientific Data 47

Download