Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Shawn Bowers UC Davis Genome Center sbowers @ ucdavis.edu Bertram Ludäscher Dept. of Computer Science, UC Davis UC Davis Genome Center ludaesch @ ucdavis.edu seek.ecoinformatics.org | kepler-project.org | www.sdsc.edu | dbis.ucdavis.edu | genomics.ucdavis.edu Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data – Enable data sharing & reuse – Enhance data discovery at global scales Scalable analysis and synthesis – Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues – Enable communication and collaboration for analysis – Enable reuse of analytical components – Support scientific workflow design and modeling Semantic Mediation System, SEEK/Kepler SEEK data access, analysis, mediation Data Access (EcoGrid) – Distributed data network for environmental, ecological, and systematics data – Interoperate diverse environmental data systems Workflow Tools (Kepler) – Problem-solving environment for scientific data analysis and visualization “scientific workflows” Semantic Mediation (SMS) – Leverage ontologies for “smart” data/component discovery and integration Semantic Mediation System, SEEK/Kepler Managing Data Heterogeneity • Data comes from heterogeneous sources – Real-world observations – Spatial-temporal contexts – Collection/measurement protocols and procedures – Many representations for the same information (count, area, density) – Data, Syntax, Schema, Semantic heterogeneity • Discovery and “synthesis” (integration) performed manually – Discovery often based on intuitive notion of “what is out there” – Synthesis of data is very time consuming, and limits use Semantic Mediation System, SEEK/Kepler Scientific workflow systems support data analysis KEPLER Semantic Mediation System, SEEK/Kepler A simple Kepler workflow Composite Component (Sub-workflow) (T. McPhillips) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) Semantic Mediation System, SEEK/Kepler A simple Kepler workflow Lists Nexus files to process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. (T. McPhillips) Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. Semantic Mediation System, SEEK/Kepler A simple Kepler workflow An example workflow run, executed as a Dataflow Process Network Semantic Mediation System, SEEK/Kepler SMS motivation • Scientific Workflow Life-cycle – Resource Discovery • discover relevant datasets • discover relevant actors or workflow templates – Workflow Design and Configuration • data actor (data binding) • data data (data integration / merging / interlinking) • actor actor (actor / workflow composition) • Challenge: do all this in the presence of … – – – – – 100’s of workflows and templates 1000’s of actors (e.g. actors for web services, data analytics, …) 10,000’s of datasets – price to pay for these resources: $$$ (lots) 1,000,000’s of data items – scientist’s time wasted: priceless! … highly complex, heterogeneous data Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Iterative Development Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Iterative Development Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration • SEEK KR group is developing OWL-DL ontologies: – Various workflow-component ontologies (for categorizing by function, project, scientific discipline, …) – Scientific observation ontology (OBOE), an upper ontology for defining and relating observations, measurements, and units – Domain specific ontologies that extend OBOE (standard and derived units, ecology and biodiversity concepts, …) Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Iterative Development Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration • Annotations “connect” resources to ontologies – Conceptually describe a resource and/or its “data schema” – Annotations provide the means for ontology-based discovery, integration, … Semantic Mediation System, SEEK/Kepler “Hybrid” types … Semantic + Structural Typing O O Oout S S Sout O : Observation obsProperty.SpeciesOccurrence S : SpeciesData(site, day, spp, occ) Structural Types: Given a structural type language S – Datasets, inputs, and outputs can be assigned structural types S S Semantic Types: Given an ontology language O (e.g., OWL-DL) – Datasets, inputs, and outputs can be assigned ontology types O O Oout A1 Sout Oin A2 Sin Semantically compatible but structurally incompatible Semantic & structural types can be combined using logic constraints := (site,day,sp,occ) SpeciesData(site, day, sp, occ) (y) Observation(y), obsProp(y, occ), SpeciesOccurrence(occ) Semantic Mediation System, SEEK/Kepler Semantic Type Annotation in Kepler • Component input and output port annotation – Each port can be annotated with multiple classes from multiple ontologies – Annotations are stored within the component metadata Semantic Mediation System, SEEK/Kepler Component Annotation and Indexing • Component Annotations – New components can be annotated and indexed into the component library (e.g., specializing generic actors) – Existing components can also be revised, annotated, and indexed (hiding previous versions) Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Iterative Development Ontologies Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration • Ontology-based “smart” search – Find components by semantic types – Find components by input/output semantic types – Ontology-based query rewriting for discovery/integration • Joint work with GEON project (see SSDBM-04, SWDB-04) Semantic Mediation System, SEEK/Kepler Smart Search Browse for Components Search for Component Name Search for Category / Keyword Find a component (here: an actor) in different locations (“categories”) • … based on the semantic annotation of the component (or its ports) Semantic Mediation System, SEEK/Kepler Searching in context • Search for components with compatible input/output semantic types – … searches over actor library – … applies subsumption checking on port annotations Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Iterative Development Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration • Workflow validation and analysis – Check that workflows are semantically & structurally well-typed – Infer semantic type annotations of derived data (ie, type inference) • An initial approach and prototype based on mapping composition (see QLQP-05) – User-oriented provenance • Collect & query data-lineage of WF runs (see IPAW-06) Semantic Mediation System, SEEK/Kepler Workflow validation in Kepler • Statically perform semantic and structural type checking • Navigate errors and warnings within the workflow – Search for and insert “adapters” to fix (structural and semantic) errors … Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Iterative Development Ontologies Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration • Integrating and transforming data – Merge (“smart union”) datasets – Find mappings between data schemas for transformation • data binding, component connections (see DILS-04) Semantic Mediation System, SEEK/Kepler Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” – align attributes via annotations – open dialog for user refinement – store merge mapping in MOML • … enjoy! – … your merged dataset – almost, can be much more complicated Semantic Mediation System, SEEK/Kepler Under the hood of “Smart Merge” … Biomass Site a1 a3 a1a8 a4 a3a6 Merge a6 a8 Site Biomass a1 a b a2 a5 a6 0.1 0.2 0.3 a3 5 6 a4 10 11 a7 a8 a c d Merge Result a4 • Exploits semantic type annotations and ontology definitions to find mappings between sources • Executing the merge actor results in an integrated data product (via “outer union”) a1 a b a c d a3 5.0 6.0 0.1 0.2 0.3 a4 10 11 Semantic Mediation System, SEEK/Kepler Approach & SMS capabilities Ontologies Iterative Development Semantic Annotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration • Workflow design support – (Semi-) automatically combine resource discovery, integration, and validation – Abstract Executable WF – … ongoing work! Automated SWF Refinement Semantic Mediation System, SEEK/Kepler Summary • Outlook: – Ontologies and semantic anotations for WF design & reuse – Put ontologies to actual use in Kepler – Continue to develop Kepler tools for annotation (KR observation ontology), discovery, integration, design, … • Issues & Challenges: – Tools/approaches for ontology (OWL) management, organization, reasoning – Open source (distributed) ontology (OWL) storage and reasoning – Tools and techniques for robust ontology versioning, and extension • Acknowledgements – – – – – Timothy McPhillips, Dave Thau (UC Davis) Mark Schildhauer, Josh Madin, Matt Jones (UCSB) Deana Pennington (UNM) Rich Williams (Microsoft Research) Ferdinando Villa, Sergey Krivov (UVM) Semantic Mediation System, SEEK/Kepler