Deck_BiSciCol-TDWG2011

advertisement

BiSciCol:

Tracking Biodiversity

Objects to Brokering Standards

“Or, Gustav’s Big Problem”

John Deck, University of California, Berkeley

Brian Stucky, University of Colorado, Boulder

Lukasz Ziemba, University of Florida, Gaineseville

Nico Cellinese, University of Florida, Gainesville

Rob Guralnick, University of Colorado, Boulder

BiSciCol Team

Reed Beaman, Nico Cellinese, Jonathan Coddington, Neil Davies, John Deck, Rob

Guralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Kate Rachwal, Brian

Stucky, Rob Whitton, Lukasz Ziemba

Biological Science Collections Tracker

working towards building an infrastructure designed to tag and track scientific collections and all of their derivatives.

National Science Foundation funded 2010 – 2014

Partners are University of Florida at Gaineseville, University of Colorado at Boulder, Bishop Museum, University of

California at Berkeley, Smithsonian Institution, University of

Arizona at Tucson

Relies on globally unique identifiers (GUIDs) to track objects

Implements a Linked Data approach

Provides support for the Global Names Architecture

Tracking FaceBook relationships

From “ Facebook Visualizer ”

Can we track relationships for Biological Objects as well?

Why? Here is Gustav ’ s Problem….

Lots of Data ….

Generates …

(Prefers to collect stuff)

Due to project requirements and integration needs, Gustav is left navigating a plethora of redundant and disconnected distributed

Databases. Lots of effort to track objects

And their derivatives.

Can we borrow from Facebook and social networking to help solve Gustav ’ s Problem?

A Biological Relationship Graph …

Taxonomic Type Filter

Class Filter

X

Specimens

X

Tissues

Sequences

Functions

X

Infer Relationships Across providers

Moorea Biocode Example: Tracking biological material from field collection through analysis, across multiple systems

Taxon (Taxon) Taxon*n Taxon

Key (Key) Blast*n Blast

(Biocode Event)

(Essig Museum Specimen)

(CAMERA

Gut Sample Event)

(metagenomic

Sequencing)

(Genbank Sequence)

(Smithsonian Tissue)

How do we Track Biological Objects and their Relations

Across Distributed, Heterogeneous systems?

Tracking Biological Object Relationships

Group like terms into classes. In Darwin Core, for example we have the following “ groups of terms ” : Events,

Locations, Occurrences, GeologicalContext, Identification,

Taxon.

Assign Identifiers. Use globally unique, resolvable, persistent identifiers for each class or term.

Link Identifiers using Relationship Terms. For example, “ This object is related to that object.

Put this data on the Web.

Related Projects that are Grouping

Like terms into Classes

Darwin-SW ( http://code.google.com/p/darwin-sw/ )

Building an ontology of Darwin Core Terms to make it possible to describe biodiversity resources on the web.

Gene Ontology ( http://www.geneontology.org/ )

Standardizing the representation of gene and gene product attributes across species and databases.

ENVO ( http://environmentontology.org/ )

Annotating the environment for any biological sample.

OBO Foundry ( http://www.obofoundry.org/ )

A suite of orthogonal interoperable reference ontologies in the biomedical domain

Creating Globally Unique Identifiers (GUIDs)

Globally unique (mandatory)

Persistent (not mandatory, but very helpful)

Resolvable (not mandatory, but very helpful)

Resolution/Domain http://mycollection.org/specimen/

+

Identifier

JDeckSpecimen1 (A named identifier) http://example.org/urn:lsid:example.org:specimen/

+1-541-914-4739 (Unique, at least for phones)

7217D220-836A-11DF-8395-0800200C9A66

(opaque)

Examples: http://mycollection.org/specimen/JDeckSpecimen1 http://mycollection.org/specimen/uuid=7217D220-836A-11DF-8395-0800200C9A66 http://example.org/urn:lsid:example.org:specimen/7217D220-836A-11DF-8395-0800200C9A66

Linking Identifiers Using Relationship Terms

Predicate

An RDF

Statement:

Subject Object

OR

Predicate

GUID1 GUID2 relatedTo

(Transitive):

A Simple

BiSciCol Graph

(graph=set of RDF

Statements):

GUID1 relatedTo

GUID2 relatedTo

GUID3 GUID1 <-> GUID2

GUID2 <-> GUID3

GUID1 <-> GUID3

GUID1 a Date relatedTo

Event

GUID2 a

Specimen relatedTo

Date

“ 2011-05-01 ” “ 2011-06-01 ”

GUID3 a

Tissue

Date

2011-06-20 ”

Getting the most out of your data:

Inferring Object Relationships

Facebook Inferencing:

“ Let us sell you, to others (or vice-versa) ”

BiSciCol Inferencing:

“ What relationships exist that haven ’ t been explicitly expressed ”

Georeference1

(BioGeomancer)

Inferred Relationship Chains relatedTo Location1

(Essig Museum) hasSpatialThingGeoreference

Organism1

(Essig Museum) sameAs Organism2

(Smithsonian)

48.198,16.371;crs=wgs84;u=40

Tissue1

(Essig Museum)

Tissue2

(Smithsonian)

Tissue1

(Essig Museum)

Tissue2

(Smithsonian)

Even though Tissue #2 is not directly related to Location1, we can

Still infer its relationship through Organism1 and Organism2 being the same as each other.

Tools in Development

“ Bio-Plugins ”

Update Mechanisms

Gustav’s Watchlist:

GP12345-3939-33939 (Occurrence)

BE99999-3939-3dd39 (Event)

GP12346-3939-33II3 (Occurrence)

GP12dd6-3939-3xxxI (Tissue)

GP9999-xkx9d-dkdkd (Occurrence)

Search Descendents

(By Recent Modification)

BiSciCol API

(Search on Date

And return graph

Of object)

Genomic Rosetta Stone

Uses GUIDs, classed data, and links to tie Organismal data to Genomic Data.

“ Triplifier ” linking biological objects

BiSciCol

Darwin Core Archive

“ Triplifier ”

Create links from

Native data formats

Mysql

KEMU

Mysql

Example Taxonomic Query

Client Interface:

Search Scientific Name: Aedes increpitus Run

Results:

OccurrenceID1 ( Aedes increpitus Dyar, 1916 )

OccurrenceID3 ( Aedes vittata Theobald, 1903 )

Taxon SERVICE (ITIS / GNUB) http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126314 http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126317 http://gnub.org/8E19F1DC-74BA-47D4-A505-6498414B4CCE

BISCICOL SERVICE LOOKUP: dwc:IdentificationID1 :relatedTo http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126314 dwc:IdentificationID1 :relatedTo dwc:OccurrenceID1 dwc:IdentificationID2 :relatedTo http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126317 dwc:IdentificationID2 :relatedTo dwc:OccurrenceID3

Working with Locations

E.g. Tracking location in space of a moving individual (whales)

IndividualID1 EventID1

EventID2

EventID3

GeoreferenceID

1

GeoreferenceID

2

GeoreferenceID

3

Data Impact Factor – Graph Metrics

Graphs Collectors

Gustav Paulay

(102,000 direct children)

Christopher Meyer

(83,000 direct children)

Craig Moritz

(523 direct children)

[ ] GBIF Relations Graph

[X] Moorea Biocode

[X] SI MSNGR System

[+] Add New Graph

Whats New?

Occurrence:MBIO1234 ( “2011-10-18 09:10:00”)

DNA Extraction:Extrac9999 ( “2011-10-18 09:00:00”)

Sequence:s1113939999 ( “2011-10-18 08:00:00”)

Occurrence:MBIO1235 ( “2011-10-17 00:00:00”)

Photo:P123456 ( “2011-10-17 00:00:00”)

Events

Biocode10234

(4234 direct children)

Occurrences

MBIO99999

(1024 total descendents)

Expedition21234

(1023 direct children)

IMBL8888888

(723 total descendents)

Web Interface

(Demonstration Wed. 2pm at BiSciCol Meeting)

Summary

All objects are re-usable in the semantic web. We only need to express an identifier once and then it can be linked by anything else (either directly or indirectly)

By using sameAs relations it is possible to infer relations for data that was not previously expressed.

Queries are easily federated – possibility to create global graphs and ask questions against heterogeneous databases.

Graph based databases can help us understand the relevance of individual objects. For example, indicate the number of relations a particular object has for 1 st , 2 nd , 3 rd , or n th order relations.

How to Get Involved

“ Create stable identifiers, link them to other stable identifiers, and put them on the web.

” http://biscicol.blogspot.com/ http://code.google.com/p/biscicol/

Download