Artificial Intelligence and Cyberinfrastructure: Workflow Planning and Beyond Yolanda Gil

advertisement
Artificial Intelligence and
Cyberinfrastructure:
Workflow Planning and Beyond
INFORMATION
SCIENCES
INSTITUTE
Yolanda Gil
USC/Information Sciences Institute
gil@isi.edu
www.isi.edu/~gil
In collaboration with others in the Intelligent Systems Division and the Center for Grid
Technologies at USC/ISI including:
Ewa Deelman, Carl Kesselman, Jim Blythe
Supported in part by NSF’s GriPhyn and SCEC/CME projects,
and by internal grants from USC/ISI
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
1
Interactive Knowledge Capture @ USC/ISI
http://www.isi.edu/ikcap

Research focus: Acquiring knowledge from end users within a
problem solving context or task

Previous and ongoing work: User-centered knowledge capture
techniques, including:
•
•
•
•
•

knowledge gaps and interdependencies [EXPECT, KANAL, CALO-AT]
model-based acquisition wizards [ETM, CONSTABLE]
visualization for knowledge elicitation [VEIL]
incorporating instructional and tutoring principles [SLICK]
from informal to formal representations [ACE]
New directions: distributed knowledge capture and problem solving
•
•
•
Deriving structure from large collections of semi-structured k [TRELLIS]
Distributed acquisition of knowledge for communities of practice [IKRAFT]
Distributed problem solving in computational grids [PEGASUS, CAT]
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
2
The Southern California Earthquake Center’s
Community Modeling Environment (SCEC-CME)
(http://iowa.usc.edu/cmeportal/)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
3
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future directions in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Cognitive grids
Intelligent interactive assistance and automatic completion
Active workflows
Knowledge infrastructure for science
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
4
Integrating Diverse Models of Complex
Phenomena…
Historic records
Effect on structures
Fault models
Site response models
USC INFORMATION SCIENCES INSTITUTE
Fault ruptures
Wave propagation
Yolanda Gil
5
…for Broader Use

Geophysicists, civil and structural engineers, city planners,
emergency managers, …
•
•

Analyze seismic hazard
Learn and understand seismic hazard
Of course, scientists need this infrastructure as well!
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
6
How This is Done Today

Scientists:
•
•

Verbal communication needed to compose models
When an earthquake occurs, hard to respond quickly
Other users (e.g., building engineers):
•
•
•
Use models based on correlations of historical data
Employ consultants that know how to setup these
models
Delay in accessing state-of-the-art scientific models
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
7
Scientific Workflows

Models composed into end-to-end scientific workflows
that model/analyze complex physical phenomena
•
•

UTM
(, , , )
In-silico experimentation
Data collection and analysis
Reproducibility, reusability, pedigree
UTM
Converter
(get-Lat-Longgiven-UTM)
Task Result: Hazard curve: SA vs.
prob. exc.
Lat.
long
PEER-Fault
Gaussian Dist
No Truncation
Total Moment
Rate
Duration-Year
Fault-Grid-Spacing
Rupture Offset
Mag-Length-sigma
Dip
Rake
Ruptures
rfml
Ruptures
Magnitude (min)
Rupture
Magnitude (max)
Magnitude (mean)
Lat
Long.
Lat
Long.
CVM-getVelocityat-point
Basin-Depth
Calculator
Velocity
Hazard curve: SA
vs. prob. exc.
Hazard Curve
Calculator: SA
vs. prob. exc.
Lat
Long.
SA exc.
probs.
Site VS30
Site Basin-Depth-2.5
Basin-Depth
SA Period
Gaussian
Truncation
Field
(2000)
IMR: SA
exc. prob.
rfml
SA exc.
prob.
Std. Dev. Type
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
8
Executing Scientific Workflows on Grids

Grids support this process through middleware services:
•
•
•
•
•
Seamless integration and management of resources (OGSA)
Job submission (Condor)
Resource Monitoring and Directory Service (MDS)
Replica Location Service (RLS)
Metadata Catalog Services (MCS)
From [Kesselman 04]:
Many sources
of data, services,
computation
Discovery
R
RM
Security & policy
must underlie access
& management
decisions
R
RM
Registries organize
services of interest
to a community
Access
RM
Security
Security
service
service
Data integration activities
may require access to, &
exploration/analysis of, data
at many locations
USC INFORMATION SCIENCES INSTITUTE
RM
Resource management
is needed to ensure
progress & arbitrate
competing demands
RM
Policy
Policy
service
service
Exploration & analysis
may involve complex,
multi-step workflows
Yolanda Gil
9
Application Development and Execution Process
FFT
Application
Component
Selection
ApplicationDomain
Specify a
Different
Workflow
FFT filea
Resource Selection
Data Replica Selection
Transformation Instance
Selection
Abstract
Workflow
Pick different Resources
transfer filea from host1://
home/filea
to host2://home/file1
/usr/local/bin/fft /home/file1
DataTransfer
Concrete
Workflow
host1
host2
host2
Retry
Data
Data
Execution
Environment
USC INFORMATION SCIENCES INSTITUTE
Failure Recovery
Method
Yolanda Gil
10
How Scientists Develop Workflows

Scientists have high level requirements naturally stated in terms of the
application domain
•




These requirements can be achieved by formulating workflows
Workflows are often complex in terms of size and HPC requirements (grid)
So, scientists must be well trained on high performance/distributed computing
First, they have to turn these requirements into executable job workflows in
detailed scripts
•
•

Ex: Obtain frequency spectrum for signal S in instrument I and timeframe T
They must figure out which code generates desired products, which files contain it,
physical location of the files, hosts that support execution given code requirements,
availability of hosts, access policies, etc.
They have to be able to query grid middleware: metadata catalog, replica locator,
resource descriptor and monitoring, etc.
They must also oversee execution
•
Diagnose failures (code, memory, network, resource, etc) and design recovery
strategies (replace resource, rearrange data, replace code, etc)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
11
Challenges

Complexity: Many choices are involved as workflow is composed
•
•
•

Usability: Users should not need to be aware of infrastructure details
•
•

•
•
Performance
Reliability
Resource Usage
Global cost: minimizing cost across organizations
•

Files are distributed, indexed, replicated
Match application requirements to host capabilities
Solution cost: Evaluate the alternative solution costs
•

Alternative application components, files, and locations
Many different interdependencies may occur among components
May reach many dead ends
Individual user’s choices in light of other user’s choices
Reliability of execution: job resubmission upon failure
•
•
Detection, diagnosis, repair
Anticipation and avoidance, resource reservations
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
12
Not Just Large-Scale and HPC Issues:
Large-Scope Science and Engineering Research

“Whereas large-scale means increasing the resolution of
the solution to a fixed physical model problem, largescope means increasing the physical complexity of the
model itself. Increasing the scope involves adding more
physical realism to the simulation, making the actual
code more complex and heterogeneous, while keeping
the resolution more or less constant.”
-- Report from ACM Workshop on Strategic Directions in Computing
Research, A. Sameh et al on Computational Science and
Engineering, June 1996
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
13
Challenges Revisited
L
A
R
G
E
S
C
O
P
E

Complexity: Many choices are involved as workflow is composed
•
•
•

Usability: Users should not need to be aware of infrastructure details
•
•
L
A
R
G
E
S
C
A
L
E

•
•
Performance
Reliability
Resource Usage
Global cost: minimizing cost across organizations
•

Files are distributed, indexed, replicated
Match application requirements to host capabilities
Solution cost: Evaluate the alternative solution costs
•

Alternative application components and versions may be available
Many different interdependencies and domain-specific constraints may occur
among components
May reach many dead ends
Individual user’s choices in light of other user’s choices
Reliability of execution: job resubmission upon failure
•
•
Detection, diagnosis, repair
Anticipation and avoidance, resource reservations
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
14
Ongoing Work in Grids

GRaDS
•

GriPhyN
•

High-level languages and flexible compiler technology
Virtual data concept, Chimera
Others: asymmetric matchmaking, OGSA, etc.
All are limited because they rely on programmatic
approaches and empoverished schemas that lack the
flexibility and expressivity required by the dynamics
and scale of scientific applications
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
15
Challenges and opportunities for Artificial
Intelligence

We need alternative foundations that offer
•
•

expressive representations to capture the complex knowledge
involved in both the application domain and the execution
environment
flexible reasoners to explore this complex space systematically and
incorporate constraints, tradeoffs, policies
Many Artificial Intelligence (AI) techniques are relevant:
–
–
–
–
–
–
–
–
–
–
–
Planning to achieve given requirements
Searching through problem spaces of related choices
Using and combining heuristics
Reasoners that can incorporate rules, definitions, axioms, etc.
Schedulers and resource allocation techniques
Coordination and communication in distributed problem solving
Expressive knowledge representation languages
Reasoning under uncertainty
Dynamic replanning and reactive control
Learning in complex dynamic environments
Learning to improve problem solving skills
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
16
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future directions in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Cognitive grids
Intelligent interactive assistance and automatic completion
Active workflows
Knowledge infrastructure for science
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
17
Reasoning about Distributed Execution Infrastructure in Grids
with Pegasus (work with J. Blythe, E. Deelman, C. Kesselman, and others)
Virtual Data
Language
Chimera
Abstract Worfklow
Request Manager
Workflow
Planning
Data
Management
Workflow
Workflow
Reduction
Generation
Replica and
Resource
Selector
Data
Publication
Globus Monitoring
and Discovery
Service
at
io
n
in
fo
rm
Concrete
Workflow
Globus Replica
Location Service
Transformation
Catalog
Dynamic
information
Submission and
Monitoring System
on
ito
r
in
g
workflow executor
(DAGman)
M
Execution
Replica Locatio
n
Available
Reources
Information and
Models
s
ta
Grid
ks
Raw data
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
18
Pegasus: Using AI Planning Techniques to
Generate Executable Grid Workflows

Given: desired result and constraints
•
•
•
•

Find: an executable job workflow
•
•

A desired result (high-level, metadata description)
A set of application components described in the grid
A set of resources in the grid (dynamic, distributed)
A set of constraints and preferences on solution quality
A configuration of components that generates the desired result
A specification of resources where components can be executed and data
can be stored
Approach: Use AI planning techniques to search the solution space
and evaluate tradeoffs
•
Exploit heuristics to direct the search for solutions and represent
optimality and policy criteria
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
19
Workflow Generation as
AI Planning
Goal (Provided by the user)
 A metadata specification of the information the user requires and the
desired location for the output file
Initial State (Automatically extracted from Grid environment)
 Information about the state of the Grid, Information about data
location
Operators (Encoded for the application domain)
 Represent the execution of a component at a particular location and
the generation a particular file(s)
 File movements across the network
Heuristics as search control rules (Grid or application specific)
 specify options that should be exclusively considered at any choice
point in the search algorithm (e.g., execute “close” to the data)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
20
Advantages of Using AI Planning








Provide broad-base, generic foundation
Use general techniques to search for solutions
Explores alternatives, supports backtracking
Incorporates domain-specific and domain-independent
heuristics (as search control rules)
Allow easy addition of new constraints and rules
Incorporate optimality and policy into the search for
solutions
Interleave decisions at various levels
Can integrate the generation of workflows across users
and policies within virtual orgs.
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
21
Example operator
(operator pulsar-search
(preconds
(effects
((<host> (or Condor-pool Mpi))
()
(<file> File-Handle)
(
(<start-time> Number)
(add (created <file>))
(<channel> Channel)
(add (at <file> <host>))
(<fcenter> Number)
(add (pulsar <start-time> <end-time> <channel>
(<right-ascension> Number)
<instrument> <format>
(<sample-rate> Number)
<fcenter> <fband>
…
<fderv1>
<fderv2> <fderv3> <fderv4> <fderv5>
(<f0> (and Number (get-low-freq-from-center-and-band
<right-ascension> <declination> <sample-rate>
<fcenter> <fband>)))
<file>)))))
(<fN> (and Number (get-high-freq-from-center-and-band
<fcenter> <fband>)))
(<run-time> (and Number
(estimate-pulsar-search-run-time
<start-time> <end-time> <sample-rate>
<f0> <fN> <host> <run-time>))))
(and (available pulsar-search <host>)
(forall ((<sub-sft-file-group>
(and File-Group-Handle
(gen-sub-sft-range-for-pulsar-search
<f0> <fN> <start-time> <end-time>
<sub-sft-file-group>))))
(and (sub-sft-group <start-time> <end-time>
<channel> <instrument> <format>
<f0> <fN> <sample-rate> <sub-sft-file-group>)
(at <sub-sft-file-group> <host>)))))
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
22
Search Control Rules
(control-rule only-transfer-from-loc-with-greatest-bandwidth
(if (and (current-ops (transfer-file))
(current-goal (at ?file ?dest))
(true-in-state (at ?file ?loc1))
(true-in-state (at ?file ?loc2))
(higher-bandwidth ?loc1 ?loc2 ?dest)))
(then reject bindings ((?from-loc ?loc2))))
Grid-specific
Domain-specific
(control-rule prefer-mpi-to-condor-for-pulsar-search
(if (and (current-ops (pulsar-search))
(type-of ?mpi Mpi)
(type-of ?condor Condor-pool)))
(then prefer bindings ((?host ?mpi)) ((?host ?condor))))
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
23
Reasoning about Workflows in Pegasus
Desired
Results
Final Workflow
f
c
b
a
h
f
d
i
e
Data processing tasks
h
c
a
g
b
a
i
f
KEY
The original node
e
d
h
Input transfer node
Registration node
g
Output transfer node
i
USC INFORMATION SCIENCES INSTITUTE
Unnecessary nodes
Yolanda Gil
24
Searching for Pulsars with the Pegasus
Planner





Used AI planning techniques to
compose executable grid
workflows with hundreds of jobs
Laser-Interferometer Gravitational
Wave Observatory (LIGO) data,
which aims to detect waves
predicted by Einstein’s theory of
relativity
Used LIGO’s data collected during
the first scientific run of the
instruments in Fall 2002
Targeted a set of 1000 locations of
known pulsars as well as random
locations in the sky
Performed using compute and
storage resources at Caltech,
University of Southern California,
and University of Wisconsin
Milwaukee.
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
25
Sample Pulsar Search Results
SC 2002:
 Over 58 pulsar searches
 Total of
•
•
•

Fall 2002:
 185 pulsar searches
 Total of
330 tasks
469 data transfers
330 output files produced.
The total runtime was 11:24:35.
•
•
•

975 tasks
1365 data transfers
975 output files
Total runtime
96:49:47
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
26
Pegasus Application Domains
(work with E. Deelman and dozens of scientists)




Pulsar search for gravitationalwave physics (LIGO)
Galaxy morphology for NVO
and NASA in Montage
Thomography for neural
structure reconstruction
High-energy physics –
Compact Muon Solenoid
•

7 days, 678 jobs, produced ~
200GB
Gene alignment
•
In 24 hours, ~ 10,000 Grid jobs,
>200,000 BLAST executions,
produced 50 GB
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
27
Small Montage Workflow
~1200 nodes
USC INFORMATION SCIENCES INSTITUTE
[Deelman et al, 04]
Yolanda Gil
28
Related Work

Improving grids with algorithmic approaches
•

Improving grids with knowledge/semantics
•
•

GRaDS, GriPhyN (Chimera)
myGrid (semantic component matching)
Semantic grid, Knowledge grid
Planning techniques for software and service composition
•
[Lansky et al 94] [Chien et al 96] [Golden et al 02] [McDermott 02]
[McIlraith et al 02]
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
29
Pegasus:
Status and Ongoing Work


Fully automated generation of executable grid
workflows
Heuristic state-space search AI planner
•
•

Integration with grid environment
•
•

Initially application and resource information populated
manually
Work almost completed to do so automatically
Exploring tradeoffs and optimization
•
•

Prodigy [Veloso et al 94]
Expressive language for control rules and heuristic estimation
Current heuristics address minimal execution time
Adding criteria for resource and replica selection
If components are (well) described, AI planner can select
application components and generate the entire
workflow from scratch
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
30
pegasus.isi.edu
Publications in AI forums



“The Role of Planning in Grid Computing” Jim Blythe, Ewa Deelman,
Yolanda Gil, Carl Kesselman, Amit Agarwal, Gaurang Mehta, Karan
Vahi. International Conference on Automated Planning and Scheduling
(ICAPS) 2003.
“Transparent Grid Computing: a Knowledge-Based Approach”
Jim Blythe, Ewa Deelman, Yolanda Gil, Carl Kesselman. Innovative
Applications of Artificial Intelligence Conference (IAAI) 2003.
“Artificial Intelligence in Grids: Workflow Planning and Beyond”
Yolanda Gil, Ewa Deelman, Jim Blythe, Carl Kesselman, H.
Tangmurarunkit. IEEE Intelligent Systems, Jen/Feb 2004.
Publications in Grid forums


"Mapping Abstract Complex Workflows onto Grid Environments," Ewa
Deelman, Jim Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan
Vahi, Adam Arbree, Richard Cavanaugh, Kent Blackburn, Albert Lazzarini,
Scott Koranda. Journal of Grid Computing, Vol. 1 No. 1, 2003.
“Workflow Management in GriPhyN”, Chapter in “The Grid Resource
Management” book, E. Deelman, J. Blythe, Y. Gil, Carl Kesselman 2003.
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
31
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future directions in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Cognitive grids
Intelligent interactive assistance and automatic completion
Active workflows
Knowledge infrastructure for science
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
32
Scientific Workflows:
Future Directions

Using AI to augment the execution infrastructure
•

Using AI to support the workflow creation process
•

Cognitive grids
Interactive assistance and automatic completion
Using AI to support the scientific experimentation process
•
Active workflows
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
33
Pervasive Knowledge Sources and Reasoners
(work with J. Blythe, E. Deelman, C. Kesselman, H. Tangmurarunkit)
[Gil et al, IEEE IS 04]
High-level specification of
desired results, constraints,
requirements, user policies
Resource
KB
Resource
Indexes
Policy
Management
Workflow
Refinement
Application
KB
Workflow
Workflow
history
Workflow
history
History
Simulation
codes
Replica
Locators
Smart Workflow
Pool
Resource
Matching
Workflow
Repair
Community Distributed Resources
(e.g., computers, storage, network,
simulation codes, data)
Workflow Manager
Policy
KB
Other
Grid
services
Policy
Information
Services
Other
KB
Intelligent Reasoners
USC INFORMATION SCIENCES INSTITUTE
Pervasive Knowledge Sources
Yolanda Gil
34
Cognitive Grids: Pervasive Semantic
Representations of the Environment at all Levels
User and VO policy
models
Application Component
Models
Semantics for
File-based data
Users and Applications
High-level
Request
descriptions
Current Request Status, Results,
Provenance Information
Intelligent Reasoners (matchmaking, refinement, repair, coordination, negotiation…)
Refined Workflow
Policy Knowledgebases
Provenance and
Monitoring
Resource Knowledgebases
Higher-Level Service (Virtual Data Tools, Resource Brokers)
Tasks
Monitoring, Resources
knowledge
Resource Policy
Descriptions
Semantic Resource
Descriptions
Basic Grid Middleware (Globus Toolkit, Condor-G, DAGMan)
Grid Resources (Compute, Data, Network)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
35
Cognitive Grids: Distributed Intelligent Reasoners
that Incrementally Generate the Workflow
User’s
Request
Workflow
refinement
Levels of
abstraction
Application
-level
knowledge
Policy
reasoner
Workflow
repair
Relevant
components
Logical
tasks
Tasks
bound to
resources
and sent for
execution
Full
abstract
workflow
Onto-based
Matchmaker
Not yet
executed
Partial
execution
executed
time
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
36
Many Opportunities for AI Techniques
The Grid Now

Syntax-based matchmaking of
resources to job requirements
•
•

Scheduling of jobs based on Gridable users that specify job
execution sequences and
computing requirements
•
•
•


Condor matchmaker
Attribute based discovery and
selection
The Future Grid

•
•


USC INFORMATION SCIENCES INSTITUTE
More agility and coordination
Wide range of users can specify
high level requirements in a
mixed-initiative mode
•

Semantic matchmaking
Aggregate resource reasoning
Task-level reasoning to plan and
schedule jobs and resources
•
Scripting languages
Workflow languages,
Task graphs
Explicit mappings from task to
jobs, simple job brokers
Explicit service negotiation and
recovery strategies
Knowledge-based reasoning about
resources enables
Mapping of high-level
requirements to details required
for execution
End-to-end resource negotiation
and adaptive strategies to
accommodate failure
Yolanda Gil
37
Scientific Workflows:
Future Directions

Using AI to augment the execution infrastructure
•

Using AI to support the workflow creation process
•

Cognitive grids
Interactive assistance and automatic completion
Using AI to support the scientific experimentation process
•
Active workflows
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
38
The Process of Creating an Executable
Workflow
User guided
1. Creating a valid workflow template (human guided)
•
Selecting application components and connecting inputs and
outputs
Adding other steps for data conversions/transformations
•
Providing input data to pathway inputs (logical assignments)
•
Given requirements of each model, find and assign adequate
resources for each model
Select physical locations for logical names
Include data movement steps, including data deposition steps
•
2. Creating instantiated workflow
3. Creating executable workflow (automatically)
•
•
Automated
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
39
Challenges for Interactive Composition of
Valid Workflow Templates

Provide flexible interaction
•
•
•

Automatic tracking of workflow constraints
•

User is notified if there are problems but does not have to keep track of details
Proactive assistance
•

User can start from initial data, from data products, or steps
User can specify abstract descriptions of steps and later specialize them
User can reuse, merge, or build from scratch
System should not just point out problems but help user by suggesting fixes
(always)
And… how do we define what “valid” means?
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
40
Desirable Properties of Workflow Templates Based
on AI Planning Formalisms (with J. Kim and M. Spraragen)

Satisfied iff the sources of input parameters for all components are specified
•

Purposeful iff the workflow template specifies at least one end result
•



A Link l <co,po,ci, pi>  L is redundant iff  link l2 <co’,po’,ci’, pi’>  L s.t. l  l2 and co
= co’ and po’ = po and ci = ci’ and pi = pi’.
Well-Formed iff acyclic, justified, and parsimonious
Consistent iff all links satisfy defined component requirements and constraints
•
•

A component c  C is justified iff c  G or  c2  G where c is Linked to c2.
Parsimonious iff there are no redundant links or components
•

A workflow template <C, L, I, G> is acyclic iff  c  C , c is not Linked to c.
Justified iff all components contribute to the end results
•

A workflow template <C, L, I, G> is grounded iff  c  C, c is grounded(c)
Complete iff satisfied, purposeful, and grounded
Acyclic iff no loops
•

A workflow template <C, L, I, G> is purposeful  G ≠ Ø.
Grounded iff each component has a unique assignment to an executable
component
•

A parameter p  input-parameters (c) is satisfied iff  a link < co,po,ci,pi>  L s.t. pi = p
A Link <c1,p1, c2, p2> is type-consistent iff subtype-of(range(c1,p1),range(c2,p2))
A Link <c1,p1, c2, p2> is semantically-consistent iff subsumes(range(c1,p1),range(c2,p2)
Correct iff complete, well-formed, and consistent
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
41
Assisting Users in Creating Workflow
Templates (with J. Kim and M. Spraragen)
[Kim et al, IUI-04] [Spraragen et al, 04]

User interaction results in modifications to workflows
•
•
•


Specify desired result, external/user provided input
Add/remove step, add/remove link
Specialize step (e.g., IMR -> IMR-SA)
As user creates a workflow, intermediate stages result in possibly incorrect
workflows
ErrorScan algorithm detects errors and generates possible fixes
•
•
Knowledge base that represents components and constraints
Formal definitions of desirable properties of workflows

Fixes are multi-step and “click-through”
Errors and fixes are ranked using heuristics

If no errors detected, workflow is guaranteed to be correct

USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
42
Assisting Users in Creating Workflow
Templates (with J. Kim and M. Spraragen)
ErrorScan algorithm
[Kim et al,ErrorScan
IUI-04] [Spraragen et al, 04]

Input: Workflow W <C,L,I,G>
User interaction results in modifications to workflows
Output: list of errors and corresponding fix suggestions
•
•
•


•


Suggestions p that is output-parameter (c), find components
cj inin
the possibly
workflow or the KB
that have pj as inputAs user creates a workflow, intermediate stages result
incorrect
parameter(cj), and subsumes(pj,p),
AddLink(c,p,cj,pj)
workflows
b. If C is not
grounded, return
Error.
ErrorScan algorithm detects errors and generates
possible
fixes
Suggestions: ( Cj  FindDirectSubtypes(c),
•

I. If W is not purposeful, return Error.
Specify desired result, external/user provided
input
Suggestions:
define end result e using types from the KB,
AddEndResult (e).
Add/remove step, add/remove link
II. For each Component C in W:
a. If C is not Justified, return Error.
Specialize step (e.g., IMR -> IMR-SA)
SpecializeComponent(C,
Knowledge base that represents components
and
constraints Cj).
c. For each i in input-parameter(c):
If i is not Satisfied, return Error.
Formal definitions of desirable properties of 1.workflows
Suggestions:  cj  C with output parameter pj such that
Fixes are multi-step and “click-through”
Errors and fixes are ranked using heuristics
subsumes(range(c,i),range(cj,pj))
AddLink(cj,pj,c,i).
Suggestions:  cj  FindMatchingOutput (i)),
AddLink(cj,pj,c,i).
Suggestion:AddAndLinkComponent
(W, AddInitialInput(i),range( i), c, i)
III. For
L in W:
If no errors detected, workflow is guaranteed
toeach
beLink
correct
a.If L is not Consistent, return Error.
Suggestions:  Ci  FindInterPosingComponent(L),
InterposeComponent (Ci, L).
Suggestion: RemoveLink(L).
b. If L is Redundant, return Error.
Suggestion: RemoveLink (L).
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
43
CAT: Composition Analysis Tool
to Create Workflow Templates
Declarative descriptions of
models are linked to
ontologies and reasoners
System reasons about model
constraints and points out
errors and fixes
User builds a workflow
specification from library of
models
USC INFORMATION SCIENCES INSTITUTE
System guarantees
correctness of workflow
templates
Yolanda Gil
44
The Process of Creating an Executable
Workflow
1. Creating a valid workflow template -- reuse, pedigree
•
•
Selecting application components and connecting inputs and
outputs
Adding other steps for data conversions/transformations
2. Creating instantiated workflow
•
Providing input data to pathway inputs (logical
assignments)
3. Creating executable workflow
•
•
•
Given requirements of each model, find and assign adequate
resources for each model
Select physical locations for logical names
Include data movement steps, including data deposition
steps
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
45
Integration of Scientific Data Sources in
Dynamic Distributed Environments

Common challenges:
•
•
•
•

Data (often large sets) is distributed and replicated
Data is often stored separately from its models
Models evolve (even shared ones)
Integrating data with different semantics, stored with different
models
New challenges:
•
•
Users need to define their own attributes on the fly as they store
new kinds of data
Data sources come and go at any time
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
46
Artemis: Integrating Distributed Info Sources on
the Grid (work with E. Deelman, S. Thakkar, R. Tuchinda)
[Tuchinda et al, IAAI-04]
Query Wizard
Entity
selection
User
Filters
Dynamic
Model
Generator
Models
Prometheus
Query
Mediator
Model
mappings
Ontology
USC INFORMATION SCIENCES INSTITUTE
Theseus
query
execution
Metadata
Catalog
Services
Metadata
Catalog
Services
Data
Source
Data
Source
…
Metadata
Catalog
Services
Yolanda Gil
Data
Source
47
Robust Integration of Data Sources: Some
Implications for Semantic Representations

Semantic models of the data sources are not predefined
•
Metadata Catalog Services (MCS) [Deelman et al 03]
– New attributes can be defined on the fly as new data is stored
– As a result, different replicas may contain additional attributes

Semantic models for mediator are not predefined
•
Dynamic Model Generation [Tuchinda et al 04]
– Obtain from current (and on-line) MCSs updated attributes
– Create model that is appropriate for query mediator [Knoblock et al 02]

Query language does not have fixed set of terms
•
Interactive query wizard [Tuchinda et al 04]
– Guides users to formulate queries based on current model

Many more challenges remain, e.g.,
•
•
•
Execution monitoring via service state => build on Theseus [Knoblock et al
03]
Heterogeneous catalog services => meta-mediators for services?
Customizable query languages, i.e., terms and views
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
48
Scientific Workflows:
Future Directions

Using AI to augment the execution infrastructure
•

Using AI to support the workflow creation process
•

Cognitive grids
Interactive assistance and automatic completion
Using AI to support the scientific experimentation process
•
Active workflows
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
49
Supporting the Interactive and Incremental Nature of
Scientific Exploration (with M. Ellisman, E. Deelman, C. Kesselman)

Workflows cannot always be created in advance
•
•


Experimental design depends on initial / partial results
Scientific experimentation is often exploratory
Need to support interactive and incremental creation and
execution of workflows
Active workflows: represent evolving workflows and are
continually authored, refined, executed, and modified
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
50
Supporting the Evolution of Active Workflows
(I)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
51
Supporting the Evolution of Active Workflows
(II)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
52
Supporting the Evolution of Active Workflows
(and III)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
53
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future research in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Cognitive grids
Intelligent interactive assistance and automatic completion
Active workflows
Knowledge infrastructure for science
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
54
Knowledge infrastructure for science:
Future Directions

Representing scientific knowledge
•
Challenges to knowledge representation technology

Proactive acquisition and scaffolding of knowledge

Contributors of scientific knowledge
•
Staged policies for contributors with different skills
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
55
Requirements for SCEC Ontology
1.
2.
3.
be a community wide effort
have community-wide acceptance
be used in practice on a daily basis to compose
simulation code and annotate their results
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
56
Scientists Ask Lots of Questions, Knowledge
Representation has few Answers









How do you get started?
How to ensure the community will accept it (use
it)?
How do you (can you?) represent alternative
views?
What is the process to contribute to it?
What is the process to make changes to it?
What happens when there is an update?
How is it implemented?
How is it managed?
Who does what, when, where, why?
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
57
SCEC/GO Workshop on Ontology Development:
Lessons Learned and Prospects (Nov’02, Cambridge UK)

SCEC learns from the Gene Ontology (GO) experience
[Bada et al, forthcoming]:
•
•
•
•
Had a successful jumpstart
Done by biologists, not knowledge engineers
Developed by a wide, distributed community
Focused on specific aspects of genomics
– Fly-base, yeast, mouse
•
•
•
•
•
•
•
Used 24/7 from day 1
Accepted widely by the community
Extended based on use requirements of a wide community
Quite large (13K terms)
Simple (and messy) representation
Simple infrastructure
Process to accommodate changes, curation
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
58
Scientific Workflows:
Future Directions

Representing scientific knowledge
•
Challenges to knowledge representation technology

Proactive acquisition and scaffolding of knowledge

Contributors of scientific knowledge
•
Staged policies for contributors with different skills
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
59
Proactive Acquisition of Knowledge through
Analogy [Chklovski 03]
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
60
Formalization Aids through Natural Language
Processing [Chklovski 04]
Automatic
entity
detection
in concise
statements
Formalization
can be very
weak and yet
useful
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
61
IKRAFT
[Gil & Ratnakar 02]
1) Start with
free text
document
RESULT:
FORMALIZATION
GROUNDED IN
ORIGINAL
DOCUMENT
2) Formulate
concise
statement
3) Formalize
Statement
(e.g., in RDF)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
62
Scientific Workflows:
Future Directions

Representing scientific knowledge
•
Challenges to knowledge representation technology

Proactive acquisition and scaffolding of knowledge

Contributors of scientific knowledge
•
Staged policies for contributors with different skills
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
63
Some Policies for Organizing Contributions

Curated by knowledge engineers: processes changes
requested by users
•

Curated by domain experts: group of domain curators
processes changes requested by users
•

http://www.geneontology.org
Open contributions: any user can add content
•

http://www.ecocyc.org
http://www.dmoz.org, http://www.openmind.org
Open editing: any user can edit and create any page on a
web site.
•
http://wiki.org
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
64
Comparing Policies for Organizing
Contributions

Curated by knowledge engineers
(+) supports inference and ensures consistency
(-) does not scale, not clear community buy-in

Curated by domain experts
(+) ensures consistency and community buy-in
(-) scale and content size are limited by resources

Open contributions
(+) engages massive amounts of contributors
(-) one-shot content creation

Open editing
(+) enables massive content and updates
(-) no assurance of consistency, validity (or inference)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
65
Staged Policies for Multi-User Contributions

Process/content/user/policy relations at
different stages of knowledge entry process:
1.
2.
3.
4.
Initial stage of broad knowledge entry provides
large amount of content by broad range of users in
open editing format
Content structuring by selected set of users adopts
open contributions
Pockets of expertise maintained by domain curators
Application-oriented pockets developed by
knowledge engineers
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
66
Staged Policies
1
2
3
4
<<<
>>
<>>>>>
USC INFORMATION SCIENCES INSTITUTE
<subclassOf
foton … <>>>>
Yolanda Gil
67
A Knowledge Infrastructure for Science
Richer representations
More ambiguous
More versatile
<<<
>>
<>>>>>
USC INFORMATION SCIENCES INSTITUTE
More formal
More concrete
<subclassOf
foton … <>>>>
More mechanizable
Yolanda Gil
68
“As We May Think”
“Wholly new forms of encyclopedias will appear, ready made with a
mesh of associative trails running through them […]. The lawyer has
at his touch the associated opinions and decisions of his whole
experience, and of the experience of friends and authorities. The
patent attorney has on call the millions of issued patents, with
familiar trails to every point of his client's interest. […] The chemist,
struggling with the synthesis of an organic compound, has all the
chemical literature before him in his laboratory, with trails following
the analogies of compounds, and side trails to their physical and
chemical behavior. […]
There is a new profession of trail blazers, those who find delight in the
task of establishing useful trails through the enormous mass of the
common record. The inheritance from the master becomes, not only
his additions to the world's record, but for his disciples the entire
scaffolding by which [their additions] were erected.”
--- Vannevar Bush, 1945
http://www.theatlantic.com/unbound/flashbks/computer/bushf.htm
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
69
Summary:
Scientific Workflows and AI




Clear requirement to operate in complex, human-guided,
dynamic decision space
Need to support scientific exploration process
Tremendous opportunity for AI techniques: flexible and
expressive representations and reasoners
Work to date demonstrates leap forward
•

Pegasus can isolate users from complexities of the grid
Many opportunities ahead for AI!
•
•
•
•
Cognitive grids
Interactive assistance and automatic completion
Active workflows
…
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
70
Summary:
Knowledge Infrastructure for Science

State-of-the-art AI techniques need to be complemented
with significant investment in novel directions
•
•
•
•
Self-assessment and proactive acquisition of new knowledge
Scaffolding formal knowledge into its sources
Integration with natural language processing
Contributors with different skills
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
71
Thank you!


Scientific workflows
•
pegasus.isi.edu
•
www.isi.edu/ikcap/cat
Cognitive grids
•

AI and science
•
•

www.isi.edu/ikcap/cognitive-grids
IEEE Intelligent Systems Jan/Feb 2004, De Roure, Gil, Hendler (Eds),
Special issue on e-Science
Panel at 2004 IAAI/AAAI conference
www.isi.edu/~gil
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
72
Download