Overview of the Science Environment for Ecological Knowledge (SEEK)

advertisement
Overview of the
Science Environment for Ecological Knowledge
(SEEK)
http://seek.ecoinformatics.org
http://kepler-project.org
Ricardo Scachetti Pereira
(with many, many slides from Matt Jones, Bertram Ludäscher, Ilkay Altintas, Chad Berkeley and others)
University of Kansas, USA
June 30, 2005
Outline
• Introduction to SEEK
• Introduction to Kepler
• Kepler capabilities and sample workflows
• Current and future developments
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
What is SEEK?
Science Environment for Ecological Knowledge
Multidisciplinary project to create:
Scientific-workflow system (Kepler)
– Design, document, reuse, and execute scientific analyses
Distributed data network (EcoGrid)
– Environmental, ecological, and systematics data
Knowledge Representation & Semantic Mediation
– Discover, integrate, and compose hard-to-relate data and
services via ontologies
Taxonomic, Biology, and Education subcomponents
SWDB
Aug 29, 2004
Collaborators (the SEEK team)
•
•
NCEAS, UNM, SDSC/UCSD, U Kansas, UC Davis
Vermont, Napier, ASU, UNC
http://seek.ecoinformatics.org
June, 2005
Scientific Workflows
• Model the way scientists work with their data now
– Mentally coordinate export and import of data among software
systems
1)
2)
3)
4)
5)
6)
7)
Capture data in the field
Digitize it into Excel spreadsheets
Export as CSV files
Import into statistical package
Perform analysis
Export results, tables and graphics
Write and publish article
SWDB
Query EcoGrid
to find data
http://seek.ecoinformatics.org
Aug 29, 2004
Archive output to EcoGrid
with workflow metadata
June, 2005
Scientific Workflows
• Scientific workflows are:
– Not linear
– Involve multiple data sets
– Involve multiple analytical steps
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Metadata driven data ingestion
• Key information needed to read and machine process a data file is
in the metadata
– File descriptors (CSV, Excel, RDBMS, etc.)
– Entity (table) and Attribute (column) descriptions
•
•
•
•
Name
Type (integer, float, string, etc.)
Codes (missing values, nulls, etc.)
In the future, this will include semantic typing
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Metadata driven data ingestion
• Metadata is revised following any transformation
• Versioning of metadata and data is very important
• This process results in a lineage of the data file as it has
been transformed
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Data integration
•
Integration of heterogeneous data requires much more advanced
metadata and processing
–
–
–
–
–
Attributes must be semantically typed
Collection protocols must be known
Units and measurement scale must be known
Measurement mechanics must be known (i.e. that Density=Count/Area)
This is an advanced research topic within the SEEK project
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Semantic typing
•
•
Label data with semantic types
Label inputs and outputs of analytical components with semantic types
Data
Ontology
Workflow Components
•
Aug 29, 2004
Use SWDB
SMS to generate transformation
steps
•
•
Use SMS to discover relevant components
Ontology = specification of a conceptualization (a knowledge map)
–
Beware analytical constraints
http://seek.ecoinformatics.org
June, 2005
SEEK Components Revisited
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
SEEK EcoGrid
•
Goal: allow diverse environmental data systems to interoperate
•
Data systems
– Hides complexity of underlying systems using lightweight interfaces
– Integrate diverse data networks from ecology, biodiversity, and environmental
sciences
– Any system can implement these interfaces
– Prototyping using:
• Metacat, SRB, DiGIR, Xanthoria, etc.
•
Supports multiple metadata standards
•
Implemented as OGSA Grid Services
•
– EML, Darwin Core as foci
–
–
–
–
–
Query()
Get()
SWDB
Put()
Login()
…
Aug 29, 2004
Tiered-implementation critical to adoption
http://seek.ecoinformatics.org
June, 2005
Kepler: Scientific Workflows
•
Implements the workflow system in SEEK
•
Open, collaborative effort of:
•
Based on Ptolemy II system
•
Kepler aims to extend the Ptolemy system with:
•
Kepler actors are written in Java but can wrap other
applications (such as MATLAB, GRASS)
•
Actors can call arbitrary Web (or Grid) Services
•
Ptolemy already has a very large inventory of actors
–
–
–
–
–
SEEK, SciDAC/SDM, GEON, Ptolemy Project
Ecology, biodiversity, molecular bio, geology, engineering
Web and grid service access
Data integration support
Semantic reasoning
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Actor Search and Browse
• Actors Panel
– Large number of actors
– Organized hirarchically
– Search makes it easy to
find right actor
– Ontology-based
• Plan to support multiple
views
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
EcoGrid: EML Data Access
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
EcoGrid: Queries
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
EcoGrid: Queries
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
EML Metadata Display
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
EcoGrid: DarwinCore Access
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Kepler: database access
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Kepler: web service example
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Kepler: grid services access
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Kepler: ecological modeling
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
New ENM Workflow
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Data Analysis: Biodiversity Indices
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
‘R’ in Kepler
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
Source: Dan Higgins,June,
Kepler/SEEK
2005
ORB
SWDB
http://seek.ecoinformatics.org
Aug 29, 2004
June, 2005
Kepler today
•
Supports scientific workflows
•
EcoGrid access to heterogeneous data
– Ecology, molecular bio, geology, …
– Variety of analytical components (including spatial data transformations)
– Support for R scripts and Matlab scripts
– EML Data support
• Experimental data, survey data, spatial raster and vector data, etc.
– DarwinCore Data support
• Museum collections
– EcoGrid registry to discover data sources
•
Ontology-based browsing for analytical components
•
Demonstration workflows
– Exploit semantics to improve the user experience
–
–
–
–
Ecology: Ecological Niche Modeling
SWDB
29, 2004
Genomics: PromoterAug
Identification
Workflow
Geology: Geologic Map Information Integration
Oceanography: Real-time Revelle example of data access
http://seek.ecoinformatics.org
June, 2005
Kepler this year
•
Usability engineering
–
•
Full evaluation and user-oriented customization of all UI components
Distributed computing/grid computing
–
–
Large jobs, lots of machines
Detached execution
•
Component repository / downloadable components
•
“Smart” data and component discovery
–
Support annotating data sources
•
Automated data and service integration and transformation using ontologies
•
Complete EcoGrid access
–
–
–
•
Full EML support
Aug 29, 2004
Support for “large” data and 3rd-party transfer
More data sources and types of data sources (e.g., JDBC, GEON data)
SWDB
Provenance and metadata propagation
http://seek.ecoinformatics.org
June, 2005
Acknowledgements
This material is based upon work supported by:
The National Science Foundation under Grant Numbers 9980154,
9904777, 0131178, 9905838, 0129792, and 0225676.
Collaborators: NCEAS (UC Santa Barbara), University of New Mexico
(Long Term Ecological Research Network Office), San Diego
Supercomputer Center, University of Kansas (Center for Biodiversity
Research), University of Vermont, University of North Carolina, Napier
University, Arizona State University, UC Davis
The National Center for Ecological Analysis and Synthesis, a Center
funded by NSF (Grant Number 0072909), the University of California,
and the UC Santa Barbara campus.
SWDB
Aug 29, 2004
The Andrew W. Mellon Foundation.
Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON
http://seek.ecoinformatics.org
June, 2005
Download