Overview of the Science Environment for Ecological Knowledge (SEEK) http://seek.ecoinformatics.org http://kepler-project.org Ricardo Scachetti Pereira (with many, many slides from Matt Jones, Bertram Ludäscher, Ilkay Altintas, Chad Berkeley and others) University of Kansas, USA June 30, 2005 Outline • Introduction to SEEK • Introduction to Kepler • Kepler capabilities and sample workflows • Current and future developments SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 What is SEEK? Science Environment for Ecological Knowledge Multidisciplinary project to create: Scientific-workflow system (Kepler) – Design, document, reuse, and execute scientific analyses Distributed data network (EcoGrid) – Environmental, ecological, and systematics data Knowledge Representation & Semantic Mediation – Discover, integrate, and compose hard-to-relate data and services via ontologies Taxonomic, Biology, and Education subcomponents SWDB Aug 29, 2004 Collaborators (the SEEK team) • • NCEAS, UNM, SDSC/UCSD, U Kansas, UC Davis Vermont, Napier, ASU, UNC http://seek.ecoinformatics.org June, 2005 Scientific Workflows • Model the way scientists work with their data now – Mentally coordinate export and import of data among software systems 1) 2) 3) 4) 5) 6) 7) Capture data in the field Digitize it into Excel spreadsheets Export as CSV files Import into statistical package Perform analysis Export results, tables and graphics Write and publish article SWDB Query EcoGrid to find data http://seek.ecoinformatics.org Aug 29, 2004 Archive output to EcoGrid with workflow metadata June, 2005 Scientific Workflows • Scientific workflows are: – Not linear – Involve multiple data sets – Involve multiple analytical steps SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Metadata driven data ingestion • Key information needed to read and machine process a data file is in the metadata – File descriptors (CSV, Excel, RDBMS, etc.) – Entity (table) and Attribute (column) descriptions • • • • Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) In the future, this will include semantic typing SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Metadata driven data ingestion • Metadata is revised following any transformation • Versioning of metadata and data is very important • This process results in a lineage of the data file as it has been transformed SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Data integration • Integration of heterogeneous data requires much more advanced metadata and processing – – – – – Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement mechanics must be known (i.e. that Density=Count/Area) This is an advanced research topic within the SEEK project SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Semantic typing • • Label data with semantic types Label inputs and outputs of analytical components with semantic types Data Ontology Workflow Components • Aug 29, 2004 Use SWDB SMS to generate transformation steps • • Use SMS to discover relevant components Ontology = specification of a conceptualization (a knowledge map) – Beware analytical constraints http://seek.ecoinformatics.org June, 2005 SEEK Components Revisited SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 SEEK EcoGrid • Goal: allow diverse environmental data systems to interoperate • Data systems – Hides complexity of underlying systems using lightweight interfaces – Integrate diverse data networks from ecology, biodiversity, and environmental sciences – Any system can implement these interfaces – Prototyping using: • Metacat, SRB, DiGIR, Xanthoria, etc. • Supports multiple metadata standards • Implemented as OGSA Grid Services • – EML, Darwin Core as foci – – – – – Query() Get() SWDB Put() Login() … Aug 29, 2004 Tiered-implementation critical to adoption http://seek.ecoinformatics.org June, 2005 Kepler: Scientific Workflows • Implements the workflow system in SEEK • Open, collaborative effort of: • Based on Ptolemy II system • Kepler aims to extend the Ptolemy system with: • Kepler actors are written in Java but can wrap other applications (such as MATLAB, GRASS) • Actors can call arbitrary Web (or Grid) Services • Ptolemy already has a very large inventory of actors – – – – – SEEK, SciDAC/SDM, GEON, Ptolemy Project Ecology, biodiversity, molecular bio, geology, engineering Web and grid service access Data integration support Semantic reasoning SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Actor Search and Browse • Actors Panel – Large number of actors – Organized hirarchically – Search makes it easy to find right actor – Ontology-based • Plan to support multiple views SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 EcoGrid: EML Data Access SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 EcoGrid: Queries SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 EcoGrid: Queries SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 EML Metadata Display SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 EcoGrid: DarwinCore Access SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Kepler: database access SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Kepler: web service example SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Kepler: grid services access SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Kepler: ecological modeling SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 New ENM Workflow SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Data Analysis: Biodiversity Indices SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 ‘R’ in Kepler SWDB http://seek.ecoinformatics.org Aug 29, 2004 Source: Dan Higgins,June, Kepler/SEEK 2005 ORB SWDB http://seek.ecoinformatics.org Aug 29, 2004 June, 2005 Kepler today • Supports scientific workflows • EcoGrid access to heterogeneous data – Ecology, molecular bio, geology, … – Variety of analytical components (including spatial data transformations) – Support for R scripts and Matlab scripts – EML Data support • Experimental data, survey data, spatial raster and vector data, etc. – DarwinCore Data support • Museum collections – EcoGrid registry to discover data sources • Ontology-based browsing for analytical components • Demonstration workflows – Exploit semantics to improve the user experience – – – – Ecology: Ecological Niche Modeling SWDB 29, 2004 Genomics: PromoterAug Identification Workflow Geology: Geologic Map Information Integration Oceanography: Real-time Revelle example of data access http://seek.ecoinformatics.org June, 2005 Kepler this year • Usability engineering – • Full evaluation and user-oriented customization of all UI components Distributed computing/grid computing – – Large jobs, lots of machines Detached execution • Component repository / downloadable components • “Smart” data and component discovery – Support annotating data sources • Automated data and service integration and transformation using ontologies • Complete EcoGrid access – – – • Full EML support Aug 29, 2004 Support for “large” data and 3rd-party transfer More data sources and types of data sources (e.g., JDBC, GEON data) SWDB Provenance and metadata propagation http://seek.ecoinformatics.org June, 2005 Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. SWDB Aug 29, 2004 The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON http://seek.ecoinformatics.org June, 2005