The Way Things Go e-Science is a complex activity Scientific knowledge is comprehensible only in the context of those activities Adopt the Rube Goldberg view Rube Goldberg National Center for Supercomputing Applications Grand challenge: systems-scale science “... modeling complex systems will be a major research challenge for the 21st century” - National Science Foundation Observation and modeling of multiple systems at multiple scales Linking data and tools from different disciplines to get a valid global result! National Center for Supercomputing Applications Building current practices up isn't working Heterogeneous tools, data formats Little global coordination of research Little funding for sustained stewardship of tools and data M.C. Escher, “Tower of Babel” (1928) National Center for Supercomputing Applications Proposed solutions aren't working e-Journals – not machine-interpretable Collaboration tools scientists just use email like everyone else Portals and digital libraries – typically: centralized domain-specific The Grid – can orchestrate complex processing jobs, but that's not science National Center for Supercomputing Applications Only networks work at scale Desktop Single researcher Workgroup Community Network Ad hoc data mgt, single-user apps Community tools, resources, control Global No global practice, tools, control National Center for Supercomputing Applications How do we get there? model refine predict observe critical interface e-Science means managing Process, and Data Current approaches favor one or the other Information is getting lost data National Center for Supercomputing Applications Trends: process data process Workflow * provenance * the grid * portals Interactive * desktop apps * e-notebooks * digital libraries Batch * formats * mainframes Data Metadata * rules * ontologies data Semantics National Center for Supercomputing Applications Key technologies Semantic web: data/metadata Workflow: process Provides means of merging descriptive information even if it only partially agrees (e.g., comes from two different communities) Describes complex procedures independently of how they are executed Provenance: process + data/metadata Links workflow, data, and any ancillary descriptive information (e.g., attribution) National Center for Supercomputing Applications Semantics: data to knowledge Abstract Knowledge Ontologies, rules, models, etc. (a.k.a. semantics) Learning, inference Information Collections, tags, attributes, etc. (a.k.a. metadata) Aggregation, annotation Concrete Data Streams, arrays, swaths, etc. (a.k.a. files) (cf Reagan Moore) National Center for Supercomputing Applications Semantic web: RDF triple subject predicate object Declarative: asserts a fact Subject and object URI's identify arbitrary entities (things, people, concepts, events) Predicate identifies the relationship between them National Center for Supercomputing Applications Triples form an open network hasBreed Subject nodes aren't “owned” by any single agent or container Any actor can add arcs to the implicit, total, world graph Any two graphs can be joined National Center for Supercomputing Applications Non satis non scire (to know is not enough) Semantic web “layer cake” Where do we manage process? User interface? Applications? “Semantic Grid” (D. DeRoure, C. Goble) (source: World Wide Web Consortium) National Center for Supercomputing Applications Workflow: process description (Taverna) Describe complex operations as networks of simpler operations Abstract operation execution from description Can be shared (but may not be portable) (Kepler) National Center for Supercomputing Applications Anatomy of a workflow Execution model (usu. implicit) Declarative: says what do to Modules identify arbitrary procedures Arcs identify flow of control and/or data (data flow is usually implicit) “Module” Control flow National Center for Supercomputing Applications Workflow systems D2K (source: NCSA) Modules representing units of computation Language for specifying WF modules control flow Engine for executing WF National Center for Supercomputing Applications Work vs. workflow systems (source: CNRS/UCSD) Scientists are not WF modules Science work also involves social organization incl. funding field and “wet lab” manual work discourse: review, validation National Center for Supercomputing Applications Provenance: what happened Answers critical questions What led to this result? When and how were observations made, conclusions reached? Is a causal network of events National Center for Supercomputing Applications Complementary incomplete notions of provenance Artifact-centric (e.g., digital libraries) Process-centric (e.g., workflow) “lineage”= events in lifecycle of artifact e.g., custody computational events (e.g., service invocations) IR's focus on curation events (not antecedent processes) control flow artifacts are either not mentioned or opaque (tool-specific) National Center for Supercomputing Applications Provenance Challenges 1 & 2 IPAW 2006, HPDC 2007 20 teams, 1 workflow, 9 queries major players Interoperability? lots of manual work required call for standards (source: gridprovenance.org) National Center for Supercomputing Applications Artifact + process provenance = “open provenance” (source: Luc Moreau et al) Can describe any process, not just WF execution (e.g., science!) Allows alternate accounts by different observers Rules for inferring transitive causal relationships National Center for Supercomputing Applications Open Provenance Model (source: Luc Moreau et al) 3 node types – artifact, process, agent 5 arc types – used, generated, triggered, derived, controlled – and inference rules Generic – extensibility via annotation Choice of granularity and focus (e.g., artifact or process-centric) National Center for Supercomputing Applications NCSA Provenance Infrastructure Visualization, interaction destkop, portal, etc. Tracking, modeling, presentation OPM toolkit OPM toolkit Open Provenance Model Tupelo Semantic Content Repository Abstraction, inference, storage Context Context Context Store Store Store National Center for Supercomputing Applications Tupelo: semantic content (tupeloproject.org) Abstracts content from storage impls (e.g., Sesame, Mulgara) Provides location-independent addressing of content and metadata Supports transparent mirroring, caching, failover, etc. National Center for Supercomputing Applications CyberIntegrator: workflow by example Records what users do as provenance source, intermediate, and final artifacts steps and parameters Can re-enact interaction as a workflow National Center for Supercomputing Applications MAEviz: analaysis/viz app, workflow “behind the scenes” GIS app. platform Earthquake hazard analysis plug-in Data catalog built environment fragility/hazard models Driven by workflow -> provenance National Center for Supercomputing Applications CyberCollaboratory: collaboration + provenance User interaction with tools generates events Events are captured using the OPM and published to Tupelo Non-portal apps can browse / use provenance National Center for Supercomputing Applications Summary “The way things go” is critical to e-Science at scale Provenance is an open causal network New infrastructure supports provenance National Center for Supercomputing Applications Resources / acknowledgements Grid Provenance Challenge http://twiki.gridprovenance.org/ NCSA technologies Tupelo: http://tupeloproject.org/ CyberIntegrator: http://isda.ncsa.uiuc.edu/ MAEviz: http://maeviz.cee.uiuc.edu/ CyberCollaboratory: http://ecid.ncsa.uiuc.edu/cybercollab/ Acknowledgements: Jim Myers, Luc Moreau, Juliana Friere, Patrick Paulson, Simon Miles, Bob McGrath, and more ... National Center for Supercomputing Applications