Semantic Mediation in SEEK/Kepler:

advertisement
Semantic Mediation in SEEK/Kepler:
Exploiting Semantic Annotation for Discovery, Analysis, and
Integration of Scientific Data and Workflows
Shawn Bowers
UC Davis Genome Center
sbowers @ ucdavis.edu
Bertram Ludäscher
Dept. of Computer Science, UC Davis
UC Davis Genome Center
ludaesch @ ucdavis.edu
seek.ecoinformatics.org | kepler-project.org | www.sdsc.edu | dbis.ucdavis.edu | genomics.ucdavis.edu
Science Environment for Ecological Knowledge
SEEK is an NSF-funded, multidisciplinary research project to
facilitate …
Access to distributed ecological, environmental,
and biodiversity data
– Enable data sharing & reuse
– Enhance data discovery at global scales
Scalable analysis and synthesis
– Taxonomic, spatial, temporal, conceptual
integration of data, addressing data heterogeneity issues
– Enable communication and collaboration for analysis
– Enable reuse of analytical components
– Support scientific workflow design and modeling
Semantic Mediation System, SEEK/Kepler
SEEK data access, analysis, mediation
Data Access (EcoGrid)
– Distributed data network for environmental,
ecological, and systematics data
– Interoperate diverse environmental data systems
Workflow Tools (Kepler)
– Problem-solving environment for scientific data
analysis and visualization  “scientific workflows”
Semantic Mediation (SMS)
– Leverage ontologies for “smart”
data/component discovery and
integration
Semantic Mediation System, SEEK/Kepler
Managing Data Heterogeneity
• Data comes from
heterogeneous sources
– Real-world observations
– Spatial-temporal contexts
– Collection/measurement
protocols and procedures
– Many representations for the
same information (count,
area, density)
– Data, Syntax, Schema,
Semantic heterogeneity
• Discovery and “synthesis” (integration) performed
manually
– Discovery often based on intuitive notion of “what is out there”
– Synthesis of data is very time consuming, and limits use
Semantic Mediation System, SEEK/Kepler
Scientific workflow systems support data analysis
KEPLER
Semantic Mediation System, SEEK/Kepler
A simple Kepler workflow
Composite
Component
(Sub-workflow)
(T. McPhillips)
Loops often used in SWFs; e.g., in genomics and
bioinformatics (collections of data, nested data,
statistical regressions, ...)
Semantic Mediation System, SEEK/Kepler
A simple Kepler workflow
Lists Nexus files
to process (project)
Reads text files
Parses Nexus format
Draws
phylogenetic trees
PhylipPars
infers trees
from discrete,
multi-state
characters.
(T. McPhillips)
Workflow runs PhylipPars
iteratively to discover all of the
most parsimonious trees.
UniqueTrees discards
redundant trees in each
collection.
Semantic Mediation System, SEEK/Kepler
A simple Kepler workflow
An example workflow run, executed as a Dataflow Process Network
Semantic Mediation System, SEEK/Kepler
SMS motivation
• Scientific Workflow Life-cycle
– Resource Discovery
• discover relevant datasets
• discover relevant actors or workflow templates
– Workflow Design and Configuration
• data 
actor (data binding)
• data 
data (data integration / merging / interlinking)
• actor 
actor (actor / workflow composition)
• Challenge: do all this in the presence of …
–
–
–
–
–
100’s of workflows and templates
1000’s of actors (e.g. actors for web services, data analytics, …)
10,000’s of datasets
– price to pay for these resources: $$$ (lots)
1,000,000’s of data items – scientist’s time wasted: priceless!
… highly complex, heterogeneous data
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Ontologies
Iterative
Development
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Ontologies
Iterative
Development
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
• SEEK KR group is developing OWL-DL ontologies:
– Various workflow-component ontologies (for categorizing by
function, project, scientific discipline, …)
– Scientific observation ontology (OBOE), an upper ontology for
defining and relating observations, measurements, and units
– Domain specific ontologies that extend OBOE (standard and
derived units, ecology and biodiversity concepts, …)
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Ontologies
Iterative
Development
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
• Annotations “connect” resources to ontologies
– Conceptually describe a resource and/or its “data schema”
– Annotations provide the means for ontology-based discovery,
integration, …
Semantic Mediation System, SEEK/Kepler
“Hybrid” types … Semantic + Structural Typing
O
O
Oout
S
S
Sout
O : Observation 
obsProperty.SpeciesOccurrence
S : SpeciesData(site, day, spp, occ)
Structural Types: Given a structural type language S
– Datasets, inputs, and outputs can be assigned structural types S  S
Semantic Types: Given an ontology language O (e.g., OWL-DL)
– Datasets, inputs, and outputs can be assigned ontology types O  O
Oout
A1
Sout


Oin
A2
Sin
Semantically compatible
but structurally incompatible
Semantic & structural types can be combined using logic constraints
 := (site,day,sp,occ) SpeciesData(site, day, sp, occ) 
(y) Observation(y), obsProp(y, occ), SpeciesOccurrence(occ)
Semantic Mediation System, SEEK/Kepler
Semantic Type Annotation in Kepler
• Component input and
output port annotation
– Each port can be
annotated with multiple
classes from multiple
ontologies
– Annotations are stored
within the component
metadata
Semantic Mediation System, SEEK/Kepler
Component Annotation and Indexing
• Component Annotations
– New components can be
annotated and indexed into the
component library (e.g.,
specializing generic actors)
– Existing components can also be
revised, annotated, and indexed
(hiding previous versions)
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Iterative
Development
Ontologies
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
• Ontology-based “smart” search
– Find components by semantic types
– Find components by input/output semantic types
– Ontology-based query rewriting for discovery/integration
• Joint work with GEON project (see SSDBM-04, SWDB-04)
Semantic Mediation System, SEEK/Kepler
Smart Search
Browse for Components
Search for Component Name
Search for Category / Keyword
Find a component (here: an actor) in different locations (“categories”)
• … based on the semantic annotation of the component (or its ports)
Semantic Mediation System, SEEK/Kepler
Searching in context
• Search for components with compatible input/output semantic types
– … searches over actor library
– … applies subsumption checking on port annotations
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Ontologies
Iterative
Development
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
• Workflow validation and analysis
– Check that workflows are semantically & structurally well-typed
– Infer semantic type annotations of derived data (ie, type inference)
• An initial approach and prototype based on mapping composition (see
QLQP-05)
– User-oriented provenance
• Collect & query data-lineage of WF runs (see IPAW-06)
Semantic Mediation System, SEEK/Kepler
Workflow validation in Kepler
• Statically perform semantic
and structural type checking
• Navigate errors and
warnings within the
workflow
– Search for and insert
“adapters” to fix
(structural and
semantic) errors …
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Iterative
Development
Ontologies
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
• Integrating and transforming data
– Merge (“smart union”) datasets
– Find mappings between data schemas for transformation
• data binding, component connections (see DILS-04)
Semantic Mediation System, SEEK/Kepler
Smart (Data) Integration: Merge
• Discover data of interest
• … connect to merge actor
• … “compute merge”
– align attributes via annotations
– open dialog for user refinement
– store merge mapping in MOML
• … enjoy!
– … your merged dataset
– almost, can be much more
complicated
Semantic Mediation System, SEEK/Kepler
Under the hood of “Smart Merge” …
Biomass
Site
a1
a3
a1a8
a4
a3a6
Merge
a6
a8
Site
Biomass
a1
a
b
a2
a5
a6
0.1
0.2
0.3
a3
5
6
a4
10
11
a7
a8
a
c
d
Merge Result
a4
• Exploits semantic type
annotations and
ontology definitions to
find mappings between
sources
• Executing the merge
actor results in an
integrated data product
(via “outer union”)
a1
a
b
a
c
d
a3
5.0
6.0
0.1
0.2
0.3
a4
10
11
Semantic Mediation System, SEEK/Kepler
Approach & SMS capabilities
Ontologies
Iterative
Development
Semantic
Annotation
Resource
Discovery
Workflow
Validation
Resource
Integration
Workflow
Elaboration
• Workflow design support
– (Semi-) automatically combine resource
discovery, integration, and validation
– Abstract  Executable WF
– … ongoing work!
Automated SWF
Refinement
Semantic Mediation System, SEEK/Kepler
Summary
• Outlook:
– Ontologies and semantic anotations for WF design & reuse
– Put ontologies to actual use in Kepler
– Continue to develop Kepler tools for annotation (KR observation ontology),
discovery, integration, design, …
• Issues & Challenges:
– Tools/approaches for ontology (OWL) management, organization, reasoning
– Open source (distributed) ontology (OWL) storage and reasoning
– Tools and techniques for robust ontology versioning, and extension
• Acknowledgements
–
–
–
–
–
Timothy McPhillips, Dave Thau (UC Davis)
Mark Schildhauer, Josh Madin, Matt Jones (UCSB)
Deana Pennington (UNM)
Rich Williams (Microsoft Research)
Ferdinando Villa, Sergey Krivov (UVM)
Semantic Mediation System, SEEK/Kepler
Download