Artificial Intelligence and Cyberinfrastructure: Workflow Planning and Beyond INFORMATION SCIENCES INSTITUTE Yolanda Gil USC/Information Sciences Institute gil@isi.edu www.isi.edu/~gil In collaboration with others in the Intelligent Systems Division and the Center for Grid Technologies at USC/ISI including: Ewa Deelman, Carl Kesselman, Jim Blythe Supported in part by NSF’s GriPhyn and SCEC/CME projects, and by internal grants from USC/ISI USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 1 Interactive Knowledge Capture @ USC/ISI http://www.isi.edu/ikcap Research focus: Acquiring knowledge from end users within a problem solving context or task Previous and ongoing work: User-centered knowledge capture techniques, including: • • • • • knowledge gaps and interdependencies [EXPECT, KANAL, CALO-AT] model-based acquisition wizards [ETM, CONSTABLE] visualization for knowledge elicitation [VEIL] incorporating instructional and tutoring principles [SLICK] from informal to formal representations [ACE] New directions: distributed knowledge capture and problem solving • • • Deriving structure from large collections of semi-structured k [TRELLIS] Distributed acquisition of knowledge for communities of practice [IKRAFT] Distributed problem solving in computational grids [PEGASUS, CAT] USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 2 The Southern California Earthquake Center’s Community Modeling Environment (SCEC-CME) (http://iowa.usc.edu/cmeportal/) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 3 Outline Motivation • • Research on workflow planning at USC/ISI • Using AI techniques in Pegasus to generate executable grid workflows Future directions in support of scientific workflows • • • Scientific workflows Challenges and opportunities for Artificial Intelligence Cognitive grids Intelligent interactive assistance and automatic completion Active workflows Knowledge infrastructure for science USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 4 Integrating Diverse Models of Complex Phenomena… Historic records Effect on structures Fault models Site response models USC INFORMATION SCIENCES INSTITUTE Fault ruptures Wave propagation Yolanda Gil 5 …for Broader Use Geophysicists, civil and structural engineers, city planners, emergency managers, … • • Analyze seismic hazard Learn and understand seismic hazard Of course, scientists need this infrastructure as well! USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 6 How This is Done Today Scientists: • • Verbal communication needed to compose models When an earthquake occurs, hard to respond quickly Other users (e.g., building engineers): • • • Use models based on correlations of historical data Employ consultants that know how to setup these models Delay in accessing state-of-the-art scientific models USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 7 Scientific Workflows Models composed into end-to-end scientific workflows that model/analyze complex physical phenomena • • UTM (, , , ) In-silico experimentation Data collection and analysis Reproducibility, reusability, pedigree UTM Converter (get-Lat-Longgiven-UTM) Task Result: Hazard curve: SA vs. prob. exc. Lat. long PEER-Fault Gaussian Dist No Truncation Total Moment Rate Duration-Year Fault-Grid-Spacing Rupture Offset Mag-Length-sigma Dip Rake Ruptures rfml Ruptures Magnitude (min) Rupture Magnitude (max) Magnitude (mean) Lat Long. Lat Long. CVM-getVelocityat-point Basin-Depth Calculator Velocity Hazard curve: SA vs. prob. exc. Hazard Curve Calculator: SA vs. prob. exc. Lat Long. SA exc. probs. Site VS30 Site Basin-Depth-2.5 Basin-Depth SA Period Gaussian Truncation Field (2000) IMR: SA exc. prob. rfml SA exc. prob. Std. Dev. Type USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 8 Executing Scientific Workflows on Grids Grids support this process through middleware services: • • • • • Seamless integration and management of resources (OGSA) Job submission (Condor) Resource Monitoring and Directory Service (MDS) Replica Location Service (RLS) Metadata Catalog Services (MCS) From [Kesselman 04]: Many sources of data, services, computation Discovery R RM Security & policy must underlie access & management decisions R RM Registries organize services of interest to a community Access RM Security Security service service Data integration activities may require access to, & exploration/analysis of, data at many locations USC INFORMATION SCIENCES INSTITUTE RM Resource management is needed to ensure progress & arbitrate competing demands RM Policy Policy service service Exploration & analysis may involve complex, multi-step workflows Yolanda Gil 9 Application Development and Execution Process FFT Application Component Selection ApplicationDomain Specify a Different Workflow FFT filea Resource Selection Data Replica Selection Transformation Instance Selection Abstract Workflow Pick different Resources transfer filea from host1:// home/filea to host2://home/file1 /usr/local/bin/fft /home/file1 DataTransfer Concrete Workflow host1 host2 host2 Retry Data Data Execution Environment USC INFORMATION SCIENCES INSTITUTE Failure Recovery Method Yolanda Gil 10 How Scientists Develop Workflows Scientists have high level requirements naturally stated in terms of the application domain • These requirements can be achieved by formulating workflows Workflows are often complex in terms of size and HPC requirements (grid) So, scientists must be well trained on high performance/distributed computing First, they have to turn these requirements into executable job workflows in detailed scripts • • Ex: Obtain frequency spectrum for signal S in instrument I and timeframe T They must figure out which code generates desired products, which files contain it, physical location of the files, hosts that support execution given code requirements, availability of hosts, access policies, etc. They have to be able to query grid middleware: metadata catalog, replica locator, resource descriptor and monitoring, etc. They must also oversee execution • Diagnose failures (code, memory, network, resource, etc) and design recovery strategies (replace resource, rearrange data, replace code, etc) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 11 Challenges Complexity: Many choices are involved as workflow is composed • • • Usability: Users should not need to be aware of infrastructure details • • • • Performance Reliability Resource Usage Global cost: minimizing cost across organizations • Files are distributed, indexed, replicated Match application requirements to host capabilities Solution cost: Evaluate the alternative solution costs • Alternative application components, files, and locations Many different interdependencies may occur among components May reach many dead ends Individual user’s choices in light of other user’s choices Reliability of execution: job resubmission upon failure • • Detection, diagnosis, repair Anticipation and avoidance, resource reservations USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 12 Not Just Large-Scale and HPC Issues: Large-Scope Science and Engineering Research “Whereas large-scale means increasing the resolution of the solution to a fixed physical model problem, largescope means increasing the physical complexity of the model itself. Increasing the scope involves adding more physical realism to the simulation, making the actual code more complex and heterogeneous, while keeping the resolution more or less constant.” -- Report from ACM Workshop on Strategic Directions in Computing Research, A. Sameh et al on Computational Science and Engineering, June 1996 USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 13 Challenges Revisited L A R G E S C O P E Complexity: Many choices are involved as workflow is composed • • • Usability: Users should not need to be aware of infrastructure details • • L A R G E S C A L E • • Performance Reliability Resource Usage Global cost: minimizing cost across organizations • Files are distributed, indexed, replicated Match application requirements to host capabilities Solution cost: Evaluate the alternative solution costs • Alternative application components and versions may be available Many different interdependencies and domain-specific constraints may occur among components May reach many dead ends Individual user’s choices in light of other user’s choices Reliability of execution: job resubmission upon failure • • Detection, diagnosis, repair Anticipation and avoidance, resource reservations USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 14 Ongoing Work in Grids GRaDS • GriPhyN • High-level languages and flexible compiler technology Virtual data concept, Chimera Others: asymmetric matchmaking, OGSA, etc. All are limited because they rely on programmatic approaches and empoverished schemas that lack the flexibility and expressivity required by the dynamics and scale of scientific applications USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 15 Challenges and opportunities for Artificial Intelligence We need alternative foundations that offer • • expressive representations to capture the complex knowledge involved in both the application domain and the execution environment flexible reasoners to explore this complex space systematically and incorporate constraints, tradeoffs, policies Many Artificial Intelligence (AI) techniques are relevant: – – – – – – – – – – – Planning to achieve given requirements Searching through problem spaces of related choices Using and combining heuristics Reasoners that can incorporate rules, definitions, axioms, etc. Schedulers and resource allocation techniques Coordination and communication in distributed problem solving Expressive knowledge representation languages Reasoning under uncertainty Dynamic replanning and reactive control Learning in complex dynamic environments Learning to improve problem solving skills USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 16 Outline Motivation • • Research on workflow planning at USC/ISI • Using AI techniques in Pegasus to generate executable grid workflows Future directions in support of scientific workflows • • • Scientific workflows Challenges and opportunities for Artificial Intelligence Cognitive grids Intelligent interactive assistance and automatic completion Active workflows Knowledge infrastructure for science USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 17 Reasoning about Distributed Execution Infrastructure in Grids with Pegasus (work with J. Blythe, E. Deelman, C. Kesselman, and others) Virtual Data Language Chimera Abstract Worfklow Request Manager Workflow Planning Data Management Workflow Workflow Reduction Generation Replica and Resource Selector Data Publication Globus Monitoring and Discovery Service at io n in fo rm Concrete Workflow Globus Replica Location Service Transformation Catalog Dynamic information Submission and Monitoring System on ito r in g workflow executor (DAGman) M Execution Replica Locatio n Available Reources Information and Models s ta Grid ks Raw data USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 18 Pegasus: Using AI Planning Techniques to Generate Executable Grid Workflows Given: desired result and constraints • • • • Find: an executable job workflow • • A desired result (high-level, metadata description) A set of application components described in the grid A set of resources in the grid (dynamic, distributed) A set of constraints and preferences on solution quality A configuration of components that generates the desired result A specification of resources where components can be executed and data can be stored Approach: Use AI planning techniques to search the solution space and evaluate tradeoffs • Exploit heuristics to direct the search for solutions and represent optimality and policy criteria USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 19 Workflow Generation as AI Planning Goal (Provided by the user) A metadata specification of the information the user requires and the desired location for the output file Initial State (Automatically extracted from Grid environment) Information about the state of the Grid, Information about data location Operators (Encoded for the application domain) Represent the execution of a component at a particular location and the generation a particular file(s) File movements across the network Heuristics as search control rules (Grid or application specific) specify options that should be exclusively considered at any choice point in the search algorithm (e.g., execute “close” to the data) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 20 Advantages of Using AI Planning Provide broad-base, generic foundation Use general techniques to search for solutions Explores alternatives, supports backtracking Incorporates domain-specific and domain-independent heuristics (as search control rules) Allow easy addition of new constraints and rules Incorporate optimality and policy into the search for solutions Interleave decisions at various levels Can integrate the generation of workflows across users and policies within virtual orgs. USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 21 Example operator (operator pulsar-search (preconds (effects ((<host> (or Condor-pool Mpi)) () (<file> File-Handle) ( (<start-time> Number) (add (created <file>)) (<channel> Channel) (add (at <file> <host>)) (<fcenter> Number) (add (pulsar <start-time> <end-time> <channel> (<right-ascension> Number) <instrument> <format> (<sample-rate> Number) <fcenter> <fband> … <fderv1> <fderv2> <fderv3> <fderv4> <fderv5> (<f0> (and Number (get-low-freq-from-center-and-band <right-ascension> <declination> <sample-rate> <fcenter> <fband>))) <file>))))) (<fN> (and Number (get-high-freq-from-center-and-band <fcenter> <fband>))) (<run-time> (and Number (estimate-pulsar-search-run-time <start-time> <end-time> <sample-rate> <f0> <fN> <host> <run-time>)))) (and (available pulsar-search <host>) (forall ((<sub-sft-file-group> (and File-Group-Handle (gen-sub-sft-range-for-pulsar-search <f0> <fN> <start-time> <end-time> <sub-sft-file-group>)))) (and (sub-sft-group <start-time> <end-time> <channel> <instrument> <format> <f0> <fN> <sample-rate> <sub-sft-file-group>) (at <sub-sft-file-group> <host>))))) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 22 Search Control Rules (control-rule only-transfer-from-loc-with-greatest-bandwidth (if (and (current-ops (transfer-file)) (current-goal (at ?file ?dest)) (true-in-state (at ?file ?loc1)) (true-in-state (at ?file ?loc2)) (higher-bandwidth ?loc1 ?loc2 ?dest))) (then reject bindings ((?from-loc ?loc2)))) Grid-specific Domain-specific (control-rule prefer-mpi-to-condor-for-pulsar-search (if (and (current-ops (pulsar-search)) (type-of ?mpi Mpi) (type-of ?condor Condor-pool))) (then prefer bindings ((?host ?mpi)) ((?host ?condor)))) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 23 Reasoning about Workflows in Pegasus Desired Results Final Workflow f c b a h f d i e Data processing tasks h c a g b a i f KEY The original node e d h Input transfer node Registration node g Output transfer node i USC INFORMATION SCIENCES INSTITUTE Unnecessary nodes Yolanda Gil 24 Searching for Pulsars with the Pegasus Planner Used AI planning techniques to compose executable grid workflows with hundreds of jobs Laser-Interferometer Gravitational Wave Observatory (LIGO) data, which aims to detect waves predicted by Einstein’s theory of relativity Used LIGO’s data collected during the first scientific run of the instruments in Fall 2002 Targeted a set of 1000 locations of known pulsars as well as random locations in the sky Performed using compute and storage resources at Caltech, University of Southern California, and University of Wisconsin Milwaukee. USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 25 Sample Pulsar Search Results SC 2002: Over 58 pulsar searches Total of • • • Fall 2002: 185 pulsar searches Total of 330 tasks 469 data transfers 330 output files produced. The total runtime was 11:24:35. • • • 975 tasks 1365 data transfers 975 output files Total runtime 96:49:47 USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 26 Pegasus Application Domains (work with E. Deelman and dozens of scientists) Pulsar search for gravitationalwave physics (LIGO) Galaxy morphology for NVO and NASA in Montage Thomography for neural structure reconstruction High-energy physics – Compact Muon Solenoid • 7 days, 678 jobs, produced ~ 200GB Gene alignment • In 24 hours, ~ 10,000 Grid jobs, >200,000 BLAST executions, produced 50 GB USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 27 Small Montage Workflow ~1200 nodes USC INFORMATION SCIENCES INSTITUTE [Deelman et al, 04] Yolanda Gil 28 Related Work Improving grids with algorithmic approaches • Improving grids with knowledge/semantics • • GRaDS, GriPhyN (Chimera) myGrid (semantic component matching) Semantic grid, Knowledge grid Planning techniques for software and service composition • [Lansky et al 94] [Chien et al 96] [Golden et al 02] [McDermott 02] [McIlraith et al 02] USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 29 Pegasus: Status and Ongoing Work Fully automated generation of executable grid workflows Heuristic state-space search AI planner • • Integration with grid environment • • Initially application and resource information populated manually Work almost completed to do so automatically Exploring tradeoffs and optimization • • Prodigy [Veloso et al 94] Expressive language for control rules and heuristic estimation Current heuristics address minimal execution time Adding criteria for resource and replica selection If components are (well) described, AI planner can select application components and generate the entire workflow from scratch USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 30 pegasus.isi.edu Publications in AI forums “The Role of Planning in Grid Computing” Jim Blythe, Ewa Deelman, Yolanda Gil, Carl Kesselman, Amit Agarwal, Gaurang Mehta, Karan Vahi. International Conference on Automated Planning and Scheduling (ICAPS) 2003. “Transparent Grid Computing: a Knowledge-Based Approach” Jim Blythe, Ewa Deelman, Yolanda Gil, Carl Kesselman. Innovative Applications of Artificial Intelligence Conference (IAAI) 2003. “Artificial Intelligence in Grids: Workflow Planning and Beyond” Yolanda Gil, Ewa Deelman, Jim Blythe, Carl Kesselman, H. Tangmurarunkit. IEEE Intelligent Systems, Jen/Feb 2004. Publications in Grid forums "Mapping Abstract Complex Workflows onto Grid Environments," Ewa Deelman, Jim Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, Adam Arbree, Richard Cavanaugh, Kent Blackburn, Albert Lazzarini, Scott Koranda. Journal of Grid Computing, Vol. 1 No. 1, 2003. “Workflow Management in GriPhyN”, Chapter in “The Grid Resource Management” book, E. Deelman, J. Blythe, Y. Gil, Carl Kesselman 2003. USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 31 Outline Motivation • • Research on workflow planning at USC/ISI • Using AI techniques in Pegasus to generate executable grid workflows Future directions in support of scientific workflows • • • Scientific workflows Challenges and opportunities for Artificial Intelligence Cognitive grids Intelligent interactive assistance and automatic completion Active workflows Knowledge infrastructure for science USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 32 Scientific Workflows: Future Directions Using AI to augment the execution infrastructure • Using AI to support the workflow creation process • Cognitive grids Interactive assistance and automatic completion Using AI to support the scientific experimentation process • Active workflows USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 33 Pervasive Knowledge Sources and Reasoners (work with J. Blythe, E. Deelman, C. Kesselman, H. Tangmurarunkit) [Gil et al, IEEE IS 04] High-level specification of desired results, constraints, requirements, user policies Resource KB Resource Indexes Policy Management Workflow Refinement Application KB Workflow Workflow history Workflow history History Simulation codes Replica Locators Smart Workflow Pool Resource Matching Workflow Repair Community Distributed Resources (e.g., computers, storage, network, simulation codes, data) Workflow Manager Policy KB Other Grid services Policy Information Services Other KB Intelligent Reasoners USC INFORMATION SCIENCES INSTITUTE Pervasive Knowledge Sources Yolanda Gil 34 Cognitive Grids: Pervasive Semantic Representations of the Environment at all Levels User and VO policy models Application Component Models Semantics for File-based data Users and Applications High-level Request descriptions Current Request Status, Results, Provenance Information Intelligent Reasoners (matchmaking, refinement, repair, coordination, negotiation…) Refined Workflow Policy Knowledgebases Provenance and Monitoring Resource Knowledgebases Higher-Level Service (Virtual Data Tools, Resource Brokers) Tasks Monitoring, Resources knowledge Resource Policy Descriptions Semantic Resource Descriptions Basic Grid Middleware (Globus Toolkit, Condor-G, DAGMan) Grid Resources (Compute, Data, Network) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 35 Cognitive Grids: Distributed Intelligent Reasoners that Incrementally Generate the Workflow User’s Request Workflow refinement Levels of abstraction Application -level knowledge Policy reasoner Workflow repair Relevant components Logical tasks Tasks bound to resources and sent for execution Full abstract workflow Onto-based Matchmaker Not yet executed Partial execution executed time USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 36 Many Opportunities for AI Techniques The Grid Now Syntax-based matchmaking of resources to job requirements • • Scheduling of jobs based on Gridable users that specify job execution sequences and computing requirements • • • Condor matchmaker Attribute based discovery and selection The Future Grid • • USC INFORMATION SCIENCES INSTITUTE More agility and coordination Wide range of users can specify high level requirements in a mixed-initiative mode • Semantic matchmaking Aggregate resource reasoning Task-level reasoning to plan and schedule jobs and resources • Scripting languages Workflow languages, Task graphs Explicit mappings from task to jobs, simple job brokers Explicit service negotiation and recovery strategies Knowledge-based reasoning about resources enables Mapping of high-level requirements to details required for execution End-to-end resource negotiation and adaptive strategies to accommodate failure Yolanda Gil 37 Scientific Workflows: Future Directions Using AI to augment the execution infrastructure • Using AI to support the workflow creation process • Cognitive grids Interactive assistance and automatic completion Using AI to support the scientific experimentation process • Active workflows USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 38 The Process of Creating an Executable Workflow User guided 1. Creating a valid workflow template (human guided) • Selecting application components and connecting inputs and outputs Adding other steps for data conversions/transformations • Providing input data to pathway inputs (logical assignments) • Given requirements of each model, find and assign adequate resources for each model Select physical locations for logical names Include data movement steps, including data deposition steps • 2. Creating instantiated workflow 3. Creating executable workflow (automatically) • • Automated USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 39 Challenges for Interactive Composition of Valid Workflow Templates Provide flexible interaction • • • Automatic tracking of workflow constraints • User is notified if there are problems but does not have to keep track of details Proactive assistance • User can start from initial data, from data products, or steps User can specify abstract descriptions of steps and later specialize them User can reuse, merge, or build from scratch System should not just point out problems but help user by suggesting fixes (always) And… how do we define what “valid” means? USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 40 Desirable Properties of Workflow Templates Based on AI Planning Formalisms (with J. Kim and M. Spraragen) Satisfied iff the sources of input parameters for all components are specified • Purposeful iff the workflow template specifies at least one end result • A Link l <co,po,ci, pi> L is redundant iff link l2 <co’,po’,ci’, pi’> L s.t. l l2 and co = co’ and po’ = po and ci = ci’ and pi = pi’. Well-Formed iff acyclic, justified, and parsimonious Consistent iff all links satisfy defined component requirements and constraints • • A component c C is justified iff c G or c2 G where c is Linked to c2. Parsimonious iff there are no redundant links or components • A workflow template <C, L, I, G> is acyclic iff c C , c is not Linked to c. Justified iff all components contribute to the end results • A workflow template <C, L, I, G> is grounded iff c C, c is grounded(c) Complete iff satisfied, purposeful, and grounded Acyclic iff no loops • A workflow template <C, L, I, G> is purposeful G ≠ Ø. Grounded iff each component has a unique assignment to an executable component • A parameter p input-parameters (c) is satisfied iff a link < co,po,ci,pi> L s.t. pi = p A Link <c1,p1, c2, p2> is type-consistent iff subtype-of(range(c1,p1),range(c2,p2)) A Link <c1,p1, c2, p2> is semantically-consistent iff subsumes(range(c1,p1),range(c2,p2) Correct iff complete, well-formed, and consistent USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 41 Assisting Users in Creating Workflow Templates (with J. Kim and M. Spraragen) [Kim et al, IUI-04] [Spraragen et al, 04] User interaction results in modifications to workflows • • • Specify desired result, external/user provided input Add/remove step, add/remove link Specialize step (e.g., IMR -> IMR-SA) As user creates a workflow, intermediate stages result in possibly incorrect workflows ErrorScan algorithm detects errors and generates possible fixes • • Knowledge base that represents components and constraints Formal definitions of desirable properties of workflows Fixes are multi-step and “click-through” Errors and fixes are ranked using heuristics If no errors detected, workflow is guaranteed to be correct USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 42 Assisting Users in Creating Workflow Templates (with J. Kim and M. Spraragen) ErrorScan algorithm [Kim et al,ErrorScan IUI-04] [Spraragen et al, 04] Input: Workflow W <C,L,I,G> User interaction results in modifications to workflows Output: list of errors and corresponding fix suggestions • • • • Suggestions p that is output-parameter (c), find components cj inin the possibly workflow or the KB that have pj as inputAs user creates a workflow, intermediate stages result incorrect parameter(cj), and subsumes(pj,p), AddLink(c,p,cj,pj) workflows b. If C is not grounded, return Error. ErrorScan algorithm detects errors and generates possible fixes Suggestions: ( Cj FindDirectSubtypes(c), • I. If W is not purposeful, return Error. Specify desired result, external/user provided input Suggestions: define end result e using types from the KB, AddEndResult (e). Add/remove step, add/remove link II. For each Component C in W: a. If C is not Justified, return Error. Specialize step (e.g., IMR -> IMR-SA) SpecializeComponent(C, Knowledge base that represents components and constraints Cj). c. For each i in input-parameter(c): If i is not Satisfied, return Error. Formal definitions of desirable properties of 1.workflows Suggestions: cj C with output parameter pj such that Fixes are multi-step and “click-through” Errors and fixes are ranked using heuristics subsumes(range(c,i),range(cj,pj)) AddLink(cj,pj,c,i). Suggestions: cj FindMatchingOutput (i)), AddLink(cj,pj,c,i). Suggestion:AddAndLinkComponent (W, AddInitialInput(i),range( i), c, i) III. For L in W: If no errors detected, workflow is guaranteed toeach beLink correct a.If L is not Consistent, return Error. Suggestions: Ci FindInterPosingComponent(L), InterposeComponent (Ci, L). Suggestion: RemoveLink(L). b. If L is Redundant, return Error. Suggestion: RemoveLink (L). USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 43 CAT: Composition Analysis Tool to Create Workflow Templates Declarative descriptions of models are linked to ontologies and reasoners System reasons about model constraints and points out errors and fixes User builds a workflow specification from library of models USC INFORMATION SCIENCES INSTITUTE System guarantees correctness of workflow templates Yolanda Gil 44 The Process of Creating an Executable Workflow 1. Creating a valid workflow template -- reuse, pedigree • • Selecting application components and connecting inputs and outputs Adding other steps for data conversions/transformations 2. Creating instantiated workflow • Providing input data to pathway inputs (logical assignments) 3. Creating executable workflow • • • Given requirements of each model, find and assign adequate resources for each model Select physical locations for logical names Include data movement steps, including data deposition steps USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 45 Integration of Scientific Data Sources in Dynamic Distributed Environments Common challenges: • • • • Data (often large sets) is distributed and replicated Data is often stored separately from its models Models evolve (even shared ones) Integrating data with different semantics, stored with different models New challenges: • • Users need to define their own attributes on the fly as they store new kinds of data Data sources come and go at any time USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 46 Artemis: Integrating Distributed Info Sources on the Grid (work with E. Deelman, S. Thakkar, R. Tuchinda) [Tuchinda et al, IAAI-04] Query Wizard Entity selection User Filters Dynamic Model Generator Models Prometheus Query Mediator Model mappings Ontology USC INFORMATION SCIENCES INSTITUTE Theseus query execution Metadata Catalog Services Metadata Catalog Services Data Source Data Source … Metadata Catalog Services Yolanda Gil Data Source 47 Robust Integration of Data Sources: Some Implications for Semantic Representations Semantic models of the data sources are not predefined • Metadata Catalog Services (MCS) [Deelman et al 03] – New attributes can be defined on the fly as new data is stored – As a result, different replicas may contain additional attributes Semantic models for mediator are not predefined • Dynamic Model Generation [Tuchinda et al 04] – Obtain from current (and on-line) MCSs updated attributes – Create model that is appropriate for query mediator [Knoblock et al 02] Query language does not have fixed set of terms • Interactive query wizard [Tuchinda et al 04] – Guides users to formulate queries based on current model Many more challenges remain, e.g., • • • Execution monitoring via service state => build on Theseus [Knoblock et al 03] Heterogeneous catalog services => meta-mediators for services? Customizable query languages, i.e., terms and views USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 48 Scientific Workflows: Future Directions Using AI to augment the execution infrastructure • Using AI to support the workflow creation process • Cognitive grids Interactive assistance and automatic completion Using AI to support the scientific experimentation process • Active workflows USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 49 Supporting the Interactive and Incremental Nature of Scientific Exploration (with M. Ellisman, E. Deelman, C. Kesselman) Workflows cannot always be created in advance • • Experimental design depends on initial / partial results Scientific experimentation is often exploratory Need to support interactive and incremental creation and execution of workflows Active workflows: represent evolving workflows and are continually authored, refined, executed, and modified USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 50 Supporting the Evolution of Active Workflows (I) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 51 Supporting the Evolution of Active Workflows (II) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 52 Supporting the Evolution of Active Workflows (and III) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 53 Outline Motivation • • Research on workflow planning at USC/ISI • Using AI techniques in Pegasus to generate executable grid workflows Future research in support of scientific workflows • • • Scientific workflows Challenges and opportunities for Artificial Intelligence Cognitive grids Intelligent interactive assistance and automatic completion Active workflows Knowledge infrastructure for science USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 54 Knowledge infrastructure for science: Future Directions Representing scientific knowledge • Challenges to knowledge representation technology Proactive acquisition and scaffolding of knowledge Contributors of scientific knowledge • Staged policies for contributors with different skills USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 55 Requirements for SCEC Ontology 1. 2. 3. be a community wide effort have community-wide acceptance be used in practice on a daily basis to compose simulation code and annotate their results USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 56 Scientists Ask Lots of Questions, Knowledge Representation has few Answers How do you get started? How to ensure the community will accept it (use it)? How do you (can you?) represent alternative views? What is the process to contribute to it? What is the process to make changes to it? What happens when there is an update? How is it implemented? How is it managed? Who does what, when, where, why? USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 57 SCEC/GO Workshop on Ontology Development: Lessons Learned and Prospects (Nov’02, Cambridge UK) SCEC learns from the Gene Ontology (GO) experience [Bada et al, forthcoming]: • • • • Had a successful jumpstart Done by biologists, not knowledge engineers Developed by a wide, distributed community Focused on specific aspects of genomics – Fly-base, yeast, mouse • • • • • • • Used 24/7 from day 1 Accepted widely by the community Extended based on use requirements of a wide community Quite large (13K terms) Simple (and messy) representation Simple infrastructure Process to accommodate changes, curation USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 58 Scientific Workflows: Future Directions Representing scientific knowledge • Challenges to knowledge representation technology Proactive acquisition and scaffolding of knowledge Contributors of scientific knowledge • Staged policies for contributors with different skills USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 59 Proactive Acquisition of Knowledge through Analogy [Chklovski 03] USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 60 Formalization Aids through Natural Language Processing [Chklovski 04] Automatic entity detection in concise statements Formalization can be very weak and yet useful USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 61 IKRAFT [Gil & Ratnakar 02] 1) Start with free text document RESULT: FORMALIZATION GROUNDED IN ORIGINAL DOCUMENT 2) Formulate concise statement 3) Formalize Statement (e.g., in RDF) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 62 Scientific Workflows: Future Directions Representing scientific knowledge • Challenges to knowledge representation technology Proactive acquisition and scaffolding of knowledge Contributors of scientific knowledge • Staged policies for contributors with different skills USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 63 Some Policies for Organizing Contributions Curated by knowledge engineers: processes changes requested by users • Curated by domain experts: group of domain curators processes changes requested by users • http://www.geneontology.org Open contributions: any user can add content • http://www.ecocyc.org http://www.dmoz.org, http://www.openmind.org Open editing: any user can edit and create any page on a web site. • http://wiki.org USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 64 Comparing Policies for Organizing Contributions Curated by knowledge engineers (+) supports inference and ensures consistency (-) does not scale, not clear community buy-in Curated by domain experts (+) ensures consistency and community buy-in (-) scale and content size are limited by resources Open contributions (+) engages massive amounts of contributors (-) one-shot content creation Open editing (+) enables massive content and updates (-) no assurance of consistency, validity (or inference) USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 65 Staged Policies for Multi-User Contributions Process/content/user/policy relations at different stages of knowledge entry process: 1. 2. 3. 4. Initial stage of broad knowledge entry provides large amount of content by broad range of users in open editing format Content structuring by selected set of users adopts open contributions Pockets of expertise maintained by domain curators Application-oriented pockets developed by knowledge engineers USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 66 Staged Policies 1 2 3 4 <<< >> <>>>>> USC INFORMATION SCIENCES INSTITUTE <subclassOf foton … <>>>> Yolanda Gil 67 A Knowledge Infrastructure for Science Richer representations More ambiguous More versatile <<< >> <>>>>> USC INFORMATION SCIENCES INSTITUTE More formal More concrete <subclassOf foton … <>>>> More mechanizable Yolanda Gil 68 “As We May Think” “Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them […]. The lawyer has at his touch the associated opinions and decisions of his whole experience, and of the experience of friends and authorities. The patent attorney has on call the millions of issued patents, with familiar trails to every point of his client's interest. […] The chemist, struggling with the synthesis of an organic compound, has all the chemical literature before him in his laboratory, with trails following the analogies of compounds, and side trails to their physical and chemical behavior. […] There is a new profession of trail blazers, those who find delight in the task of establishing useful trails through the enormous mass of the common record. The inheritance from the master becomes, not only his additions to the world's record, but for his disciples the entire scaffolding by which [their additions] were erected.” --- Vannevar Bush, 1945 http://www.theatlantic.com/unbound/flashbks/computer/bushf.htm USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 69 Summary: Scientific Workflows and AI Clear requirement to operate in complex, human-guided, dynamic decision space Need to support scientific exploration process Tremendous opportunity for AI techniques: flexible and expressive representations and reasoners Work to date demonstrates leap forward • Pegasus can isolate users from complexities of the grid Many opportunities ahead for AI! • • • • Cognitive grids Interactive assistance and automatic completion Active workflows … USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 70 Summary: Knowledge Infrastructure for Science State-of-the-art AI techniques need to be complemented with significant investment in novel directions • • • • Self-assessment and proactive acquisition of new knowledge Scaffolding formal knowledge into its sources Integration with natural language processing Contributors with different skills USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 71 Thank you! Scientific workflows • pegasus.isi.edu • www.isi.edu/ikcap/cat Cognitive grids • AI and science • • www.isi.edu/ikcap/cognitive-grids IEEE Intelligent Systems Jan/Feb 2004, De Roure, Gil, Hendler (Eds), Special issue on e-Science Panel at 2004 IAAI/AAAI conference www.isi.edu/~gil USC INFORMATION SCIENCES INSTITUTE Yolanda Gil 72