Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis Chad Berkley NCEAS

advertisement
Kepler: A Workflow Tool for
Heterogeneous Ecological Data Analysis
Chad Berkley
NCEAS
National Center for Ecological Analysis and Synthesis (NCEAS),
University of California Santa Barbara
Long Term Ecological Research Network Office, University of New Mexico
University of Kansas
San Diego Supercomputer Center
Edinburgh, Scotland
http://seek.ecoinformatics.org
December 4, 2003
Outline
„
„
„
„
„
„
Quick history
SEEK overview
Ecological Metadata Language
Using workflows in Ecology
Workflow editing with Kepler
Future visions
History
„
„
„
Late 1990s – patterns noticed in the problems
surrounding data synthesis at NCEAS
1999 - Michener et al paper on ecological metadata
2000 – Knowledge Network for Biocomplexity
„
„
„
Morpho, Metacat, Ecological Metadata Language
Some footholds into workflow creation and execution
2003 – Scientific Environment for Ecological
Knowledge (SEEK) Grant
„
„
Continues the work done on the KNB grant
Emphasis on using metadata for advanced data
processing
SEEK approach
„
„
„
„
General approach to specific ecological
problems
Data described with adequate metadata in a
grid accessible repository
Reasoning engine (ontology based) to locate
and extract data and processes
Modeling system to put it all together and
control execution flow
SEEK Components
„
Ecogrid
„
„
„
Semantic Mediation System
„
„
„
Analysis Library
Metadata and data repository
Controlled semantic vocabulary
Ontological discovery system
Analysis and Modeling System (Kepler)
„
„
Workflow control system
Utilizes resources from other components
SEEK Architecture
Ecological Metadata Language
„
„
„
„
„
„
Common language for archiving and
transport of datasets
XML based
Designed for/by the ecological community
Describes physical and logical structure of
data
Also includes project, literature and
software information
SEEK will add semantic information
Workflows in SEEK
„
„
„
In the SEEK model, data ingestion/cleaning is
metadata driven (specifically with EML)
Output generation includes creating appropriate
metadata
The analysis pipeline itself becomes metadata
Metadata driven data ingestion
„
Key information needed to read and machine
process a data file is in the metadata
„
„
File descriptors (CSV, Excel, RDBMS, etc.)
Entity (table) and Attribute (column) descriptions
„
„
„
„
Name
Type (integer, float, string, etc.)
Codes (missing values, nulls, etc.)
In the future, this will include semantic typing
Metadata revision
„
„
„
Metadata is revised following any
transformation
Versioning of metadata and data is very
important
This process results in a lineage of the data file
as it has been transformed
Typical ecological workflow example
„
Workflows can automate the integration process
if data is described with adequate structured
metadata
Homogeneous data integration
„
Integration of homogeneous or mostly homogeneous data
via EML metadata is relatively straightforward
Heterogeneous Data integration
„
Integration of heterogeneous data requires much more
advanced metadata and processing
„
„
„
„
Attributes must be semantically typed
Collection protocols must be known
Units and measurement scale must be known
Measurement mechanics must be known (i.e. that
Density=Count/Area)
Semantic typing and ontologies
„
„
Label data with semantic types
Label inputs and outputs of analytical components with semantic types
Data
„
„
Workflow Components
Use Semantic Mediation System (SMS) to generate transformation steps
„
„
Ontology
Beware analytical constraints
Use SMS to discover relevant components
Ontology – specification of a conceptualization (a knowledge map)
Measurement Ontology
„
„
„
„
Density is part of a larger measurement ontology
SEEK’s intent is to create one or more community created ecological
ontologies
Creates a controlled vocabulary for ecological metadata
More about this in Bertram’s talk
About Kepler
„
„
„
„
„
Kepler is the name of the SEEK/SDM
additions to the Ptolemy modeling system
Ptolemy was designed by the UC Berkeley
EECS department
Primary use is modeling EE circuits
Free, opensource, pure Java
Flexible design GUI for building workflows
Kepler
„
„
„
„
„
„
A Kepler model consists of linked “actors”
(which correspond to workflow steps)
Timing is controlled by a “director”
All actors are written in Java but can call
other applications (such as SAS and MATLAB
or native language code via JNI)
Actors can call arbitrary Web (or Grid)
Services
Ptolemy already has a very large inventory
of actors
Easy to use, drag ‘n drop interface
SEEK Contributions to Kepler (so far)
„
EML data ingestion actor
„
Actor design tool
EML data ingestion actor
„
„
„
„
„
Ingests any data format described by EML
metadata
Converts raw data to Kepler format
Data can then be operated on with other actors
Produces one output port for each attribute in the
dataset
Individual attributes can then be mapped to other
actors
Ptolemy model with EML ingestion actor
SEEK Contributions to Kepler (so far)
„
EML data ingestion actor
„
Actor design tool
Actor design tool
„
„
„
„
„
Allows “place-holder” actors to be defined on the
fly by non-programmers during workflow creation
Domain scientists can thereby create workflows
without programming knowledge
Workflows created with these actors can be
executed once their functionality is implemented
by a programmer
Allows quick prototyping of workflows by domain
scientists
“Place-holder” actors can still be linked to other
working actors
Ptolemy and dynamically created actor
How domain scientists will benefit
„
„
„
„
„
More fully automated integration systems
A library of pre-defined analytical
processes which can be executed on
heterogeneous data
Semantic data discovery and processing
Automated unit and measurement scale
conversions
A fuller understanding of cross site
research implications
Acknowledgements
More info: http://seek.ecoinformatics.org
Questions? IRC: irc.ecoinformatics.org #seek
This material is based upon work supported by:
The National Science Foundation under Grant Numbers 9980154, 9904777,
and 0225676 to NCEAS and its collaborators.
The National Center for Ecological Analysis and Synthesis, a Center funded
by NSF (Grant Number 0072909), the University of California, and the UC
Santa Barbara campus.
Primary Collaborators: University of New Mexico (Long Term Ecological
Research Network Office), San Diego Supercomputer Center, University of
Kansas (Center for Biodiversity Research)
Download