Artificial Intelligence and Large-Scope Science: Workflow Planning and Beyond Yolanda Gil

advertisement
Artificial Intelligence
and Large-Scope Science:
Workflow Planning and Beyond
INFORMATION
SCIENCES
INSTITUTE
Yolanda Gil
USC/Information Sciences Institute
gil@isi.edu
www.isi.edu/~gil
In collaboration with others in the Intelligent Systems Division and the Center for Grid
Technologies at USC/ISI including:
Ewa Deelman, Carl Kesselman, Jim Blythe
Supported in part by NSF’s GriPhyn and SCEC/CME projects,
and by internal grants from USC/ISI
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
1
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future directions in support of scientific workflows
•
•
•

Large-scope large-scale science
Challenges and opportunities for Artificial Intelligence
Intelligent interactive assistance and automatic completion
Active workflows
Cognitive grids
Knowledge infrastructure for science
•
Challenges in Community-Based Knowledge Capture and
Representation
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
2
The Southern California Earthquake Center’s
Community Modeling Environment (SCEC-CME)
(http://iowa.usc.edu/cmeportal/)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
3
Integrating Diverse Models of Complex
Phenomena…
Historic records
Effect on structures
Fault models
Site response models
USC INFORMATION SCIENCES INSTITUTE
Fault ruptures
Wave propagation
Yolanda Gil
4
…for Broader Use

Geophysicists, civil and structural engineers, city planners,
emergency managers, …
•
•

Analyze seismic hazard
Learn and understand seismic hazard
Of course, scientists need this infrastructure as well!
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
5
Not Just Large-Scale and HPC Issues:
Large-Scope Science and Engineering Research

“Whereas large-scale means increasing the resolution of
the solution to a fixed physical model problem, largescope means increasing the physical complexity of the
model itself. Increasing the scope involves adding more
physical realism to the simulation, making the actual
code more complex and heterogeneous, while keeping
the resolution more or less constant.”
-- Report from ACM Workshop on Strategic Directions in Computing
Research, A. Sameh et al on Computational Science and
Engineering, June 1996
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
6
How This is Done Today

Scientists:
•
•

Verbal communication needed to compose models
When an earthquake occurs, hard to respond quickly
Other users (e.g., building engineers):
•
•
•
Use models based on correlations of historical data
Employ consultants that know how to setup these
models
Delay in accessing state-of-the-art scientific models
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
7
Scientific Workflows

Models composed into end-to-end scientific workflows
that model/analyze complex physical phenomena
•
•

UTM
(, , , )
In-silico experimentation
Data collection and analysis
Reproducibility, reusability, pedigree
UTM
Converter
(get-Lat-Longgiven-UTM)
Task Result: Hazard curve: SA vs.
prob. exc.
Lat.
long
PEER-Fault
Gaussian Dist
No Truncation
Total Moment
Rate
Duration-Year
Fault-Grid-Spacing
Rupture Offset
Mag-Length-sigma
Dip
Rake
Ruptures
rfml
Ruptures
Magnitude (min)
Rupture
Magnitude (max)
Magnitude (mean)
Lat
Long.
Lat
Long.
CVM-getVelocityat-point
Basin-Depth
Calculator
Velocity
Hazard curve: SA
vs. prob. exc.
Hazard Curve
Calculator: SA
vs. prob. exc.
Lat
Long.
SA exc.
probs.
Site VS30
Site Basin-Depth-2.5
Basin-Depth
SA Period
Gaussian
Truncation
Field
(2000)
IMR: SA
exc. prob.
rfml
SA exc.
prob.
Std. Dev. Type
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
8
Executing Scientific Workflows on Grids

Grids support this process through middleware services:
•
•
•
•
•
Seamless integration and management of resources (OGSA)
Job submission (Condor)
Resource Monitoring and Directory Service (MDS)
Replica Location Service (RLS)
Metadata Catalog Services (MCS)
From [Kesselman 04]:
Many sources
of data, services,
computation
Discovery
R
RM
Security & policy
must underlie access
& management
decisions
R
RM
Registries organize
services of interest
to a community
Access
RM
Security
Security
service
service
Data integration activities
may require access to, &
exploration/analysis of, data
at many locations
USC INFORMATION SCIENCES INSTITUTE
RM
Resource management
is needed to ensure
progress & arbitrate
competing demands
RM
Policy
Policy
service
service
Exploration & analysis
may involve complex,
multi-step workflows
Yolanda Gil
9
Application Development and Execution Process
FFT
Application
Component
Selection
ApplicationDomain
Specify a
Different
Workflow
FFT filea
Resource Selection
Data Replica Selection
Transformation Instance
Selection
Abstract
Workflow
Pick different Resources
transfer filea from host1://
home/filea
to host2://home/file1
/usr/local/bin/fft /home/file1
DataTransfer
Concrete
Workflow
host1
host2
host2
Retry
Data
Data
Execution
Environment
USC INFORMATION SCIENCES INSTITUTE
Failure Recovery
Method
Yolanda Gil
10
Challenges

Complexity: Many choices are involved as workflow is composed
•
•
•

Usability: Users should not need to be aware of infrastructure details
•
•

•
•
Performance
Reliability
Resource Usage
Global cost: minimizing cost across organizations
•

Files are distributed, indexed, replicated
Match application requirements to host capabilities
Solution cost: Evaluate the alternative solution costs
•

Alternative application components, files, and locations
Many different interdependencies may occur among components
May reach many dead ends
Individual user’s choices in light of other user’s choices
Reliability of execution: job resubmission upon failure
•
•
Detection, diagnosis, repair
Anticipation and avoidance, resource reservations
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
11
Challenges and opportunities for Artificial
Intelligence

We need alternative foundations that offer
•
•

expressive representations to capture the complex knowledge
involved in both the application domain and the execution
environment
flexible reasoners to explore this complex space systematically and
incorporate constraints, tradeoffs, policies
Many Artificial Intelligence (AI) techniques are relevant:
–
–
–
–
–
–
–
–
–
–
–
Planning to achieve given requirements
Searching through problem spaces of related choices
Using and combining heuristics
Reasoners that can incorporate rules, definitions, axioms, etc.
Schedulers and resource allocation techniques
Coordination and communication in distributed problem solving
Expressive knowledge representation languages
Reasoning under uncertainty
Dynamic replanning and reactive control
Learning in complex dynamic environments
Learning to improve problem solving skills
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
12
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future directions in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Intelligent interactive assistance and automatic completion
Active workflows
Cognitive grids
Knowledge infrastructure for science
•
Challenges in Community-Based Knowledge Capture and
Representation
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
13
Reasoning about Distributed Execution Infrastructure in Grids
with Pegasus (work with J. Blythe, E. Deelman, C. Kesselman, and others)
Virtual Data
Language
[Gil et al, IEEE IS 04]
Chimera
Abstract Worfklow
Request Manager
Workflow
Planning
Data
Management
Workflow
Workflow
Reduction
Generation
Replica and
Resource
Selector
Data
Publication
Globus Monitoring
and Discovery
Service
at
io
n
in
fo
rm
Concrete
Workflow
Globus Replica
Location Service
Transformation
Catalog
Dynamic
information
Submission and
Monitoring System
on
ito
r
in
g
workflow executor
(DAGman)
M
Execution
Replica Locatio
n
Available
Reources
Information and
Models
s
ta
Grid
ks
Raw data
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
14
Pegasus: Using AI Planning Techniques to
Generate Executable Grid Workflows

Given: desired result and constraints
•
•
•
•

Find: an executable job workflow
•
•
•

A desired result (high-level description of data product)
A set of application components described in the grid
A set of resources in the grid (dynamic, distributed)
A set of constraints and preferences on solution quality
A configuration of components that generates the desired result
A specification of resources where components can be executed and data
can be stored
A specification of data sources and data movements
Approach: Use AI planning techniques to search the solution space
and evaluate tradeoffs
•
Exploit heuristics to direct the search for solutions and represent
optimality and policy criteria
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
15
Advantages of Using AI Planning








Provide broad-base, generic foundation
Use general techniques to search for solutions
Explores alternatives, supports backtracking
Incorporates domain-specific and domain-independent
heuristics (as search control rules)
Allow easy addition of new constraints and rules
Incorporate optimality and policy into the search for
solutions
Interleave decisions at various levels
Can integrate the generation of workflows across users
and policies within virtual orgs.
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
16
Reasoning about Workflows in Pegasus
Desired
Results
Final Workflow
f
c
b
a
h
f
d
i
e
Data processing tasks
h
c
a
g
b
a
i
f
KEY
The original node
e
d
h
Input transfer node
Registration node
g
Output transfer node
i
USC INFORMATION SCIENCES INSTITUTE
Unnecessary nodes
Yolanda Gil
17
Pegasus Application Domains
(work with E. Deelman and dozens of scientists)

Pulsar search for gravitationalwave physics (LIGO)
•



Galaxy morphology for NVO
and NASA in Montage
Thomography for neural
structure reconstruction
High-energy physics –
Compact Muon Solenoid
•

975 tasks, 1365 data transfers,
975 output files, 96hrs runtime
7 days, 678 jobs, produced
~200GB
Gene alignment
•
In 24 hours, ~ 10,000 Grid jobs,
>200,000 BLAST executions,
produced 50 GB
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
18
Small Montage Workflow
~1200 nodes
USC INFORMATION SCIENCES INSTITUTE
[Deelman et al, 04]
Yolanda Gil
19
Artemis: Integrating Distributed Info Sources on
the Grid (work with E. Deelman, S. Thakkar, R. Tuchinda)
[Tuchinda et al, IAAI-04]
Query Wizard
Entity
selection
User
Filters
Dynamic
Model
Generator
Models
Prometheus
Query
Mediator
Model
mappings
Ontology
USC INFORMATION SCIENCES INSTITUTE
Theseus
query
execution
Metadata
Catalog
Services
Metadata
Catalog
Services
Data
Source
Data
Source
…
Metadata
Catalog
Services
Yolanda Gil
Data
Source
20
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future directions in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Intelligent interactive assistance and automatic completion
Active workflows
Cognitive grids
Knowledge infrastructure for science
•
Challenges in Community-Based Knowledge Capture and
Representation
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
21
Scientific Workflows:
Future Directions

Using AI to support the workflow creation process
•

Using AI to support the scientific experimentation process
•

Interactive assistance and automatic completion
Active workflows
Using AI to augment the execution infrastructure
•
Cognitive grids
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
22
The Process of Creating an Executable
Workflow
User guided
1. Creating a valid workflow template (human guided)
•
Selecting application components and connecting inputs and
outputs
Adding other steps for data conversions/transformations
•
Providing input data to pathway inputs (logical assignments)
•
Given requirements of each model, find and assign adequate
resources for each model
Select physical locations for logical names
Include data movement steps, including data deposition steps
•
2. Creating instantiated workflow
3. Creating executable workflow (automatically)
•
•
Automated
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
23
Challenges for Interactive Composition of
Valid Workflow Templates

Provide flexible interaction
•
•
•

Automatic tracking of workflow constraints
•

User is notified if there are problems but does not have to keep track of details
Proactive assistance
•

User can start from initial data, from data products, or steps
User can specify abstract descriptions of steps and later specialize them
User can reuse, merge, or build from scratch
System should not just point out problems but help user by suggesting fixes
(always)
And… how do we define what “valid” means?
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
24
Assisting Users in Creating Workflow
Templates (with J. Kim and M. Spraragen)
[Kim et al, IUI-04] [Spraragen et al, 04]

User interaction results in modifications to workflows
•
•
•


Specify desired result, external/user provided input
Add/remove step, add/remove link
Specialize step (e.g., IMR -> IMR-SA)
As user creates a workflow, intermediate stages result in possibly incorrect
workflows
ErrorScan algorithm detects errors and generates possible fixes
•
•
Knowledge base that represents components and constraints
Formal definitions of desirable properties of workflows based on AI
planning techniques

Fixes are multi-step and “click-through”
Errors and fixes are ranked using heuristics

If no errors detected, workflow is guaranteed to be correct

USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
25
Scientific Workflows:
Future Directions

Using AI to support the workflow creation process
•

Using AI to support the scientific experimentation process
•

Interactive assistance and automatic completion
Active workflows
Using AI to augment the execution infrastructure
•
Cognitive grids
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
26
Supporting the Interactive and Incremental Nature of
Scientific Exploration (with M. Ellisman, E. Deelman, C. Kesselman)

Workflows cannot always be created in advance
•
•


Experimental design depends on initial / partial results
Scientific experimentation is often exploratory
Need to support interactive and incremental creation and
execution of workflows
Active workflows: represent evolving workflows and are
continually authored, refined, executed, and modified
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
27
Supporting the Evolution of Active Workflows
(I)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
28
Supporting the Evolution of Active Workflows
(II)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
29
Supporting the Evolution of Active Workflows
(and III)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
30
Scientific Workflows:
Future Directions

Using AI to support the workflow creation process
•

Using AI to support the scientific experimentation process
•

Interactive assistance and automatic completion
Active workflows
Using AI to augment the execution infrastructure
•
Cognitive grids
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
31
Pervasive Knowledge Sources and Reasoners
(work with J. Blythe, E. Deelman, C. Kesselman, H. Tangmurarunkit)
[Gil et al, IEEE IS 04]
High-level specification of
desired results, constraints,
requirements, user policies
Resource
KB
Resource
Indexes
Policy
Management
Workflow
Refinement
Application
KB
Workflow
Workflow
history
Workflow
history
History
Simulation
codes
Replica
Locators
Smart Workflow
Pool
Resource
Matching
Workflow
Repair
Community Distributed Resources
(e.g., computers, storage, network,
simulation codes, data)
Workflow Manager
Policy
KB
Other
Grid
services
Policy
Information
Services
Other
KB
Intelligent Reasoners
USC INFORMATION SCIENCES INSTITUTE
Pervasive Knowledge Sources
Yolanda Gil
32
Cognitive Grids: Pervasive Semantic
Representations of the Environment at all Levels
User and VO policy
models
Application Component
Models
Semantics for
File-based data
Users and Applications
High-level
Request
descriptions
Current Request Status, Results,
Provenance Information
Intelligent Reasoners (matchmaking, refinement, repair, coordination, negotiation…)
Refined Workflow
Policy Knowledgebases
Provenance and
Monitoring
Resource Knowledgebases
Higher-Level Service (Virtual Data Tools, Resource Brokers)
Tasks
Monitoring, Resources
knowledge
Resource Policy
Descriptions
Semantic Resource
Descriptions
Basic Grid Middleware (Globus Toolkit, Condor-G, DAGMan)
Grid Resources (Compute, Data, Network)
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
33
Cognitive Grids: Distributed Intelligent Reasoners
that Incrementally Generate the Workflow
User’s
Request
Workflow
refinement
Levels of
abstraction
Application
-level
knowledge
Policy
reasoner
Workflow
repair
Relevant
components
Logical
tasks
Tasks
bound to
resources
and sent for
execution
Full
abstract
workflow
Onto-based
Matchmaker
Not yet
executed
Partial
execution
executed
time
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
34
Many Opportunities for AI Techniques
The Grid Now

Syntax-based matchmaking of
resources to job requirements
•
•

Scheduling of jobs based on Gridable users that specify job
execution sequences and
computing requirements
•
•
•


Condor matchmaker
Attribute based discovery and
selection
The Future Grid

•
•


USC INFORMATION SCIENCES INSTITUTE
More agility and coordination
Wide range of users can specify
high level requirements in a
mixed-initiative mode
•

Semantic matchmaking
Aggregate resource reasoning
Task-level reasoning to plan and
schedule jobs and resources
•
Scripting languages
Workflow languages,
Task graphs
Explicit mappings from task to
jobs, simple job brokers
Explicit service negotiation and
recovery strategies
Knowledge-based reasoning about
resources enables
Mapping of high-level
requirements to details required
for execution
End-to-end resource negotiation
and adaptive strategies to
accommodate failure
Yolanda Gil
35
Outline

Motivation
•
•

Research on workflow planning at USC/ISI
•

Using AI techniques in Pegasus to generate executable grid workflows
Future research in support of scientific workflows
•
•
•

Scientific workflows
Challenges and opportunities for Artificial Intelligence
Intelligent interactive assistance and automatic completion
Active workflows
Cognitive grids
Knowledge infrastructure for science
•
Challenges in Community-Based Knowledge Capture and
Representation
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
36
Knowledge Infrastructure for Science: Challenges in
Community-Based Knowledge Capture & Representation
1.
2.
3.
be a community-wide effort
have community-wide acceptance
be used in practice on a daily basis to compose
simulation code and annotate their results
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
37
Scientists Ask Lots of Questions, Knowledge
Representation has few Answers









How do you get started?
How to ensure the community will accept it (use it)?
How do you (can you?) represent alternative views?
What is the process to contribute to it?
What is the process to make changes to it?
What is the impact to my application when there is an
update?
How is it implemented?
How is it managed?
Who does what, when, where, why?
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
38
SCEC/GO Workshop on Ontology Development:
Lessons Learned and Prospects [Bada et al, forthcoming]

SCEC learns from the Gene Ontology (GO) experience
(Workshop Nov’02, Cambridge UK):
•
•
•
•
Had a successful jumpstart
Done by biologists, not knowledge engineers
Developed by a wide, distributed community
Focused on specific aspects of genomics
– Fly-base, yeast, mouse
•
•
•
•
•
•
•
Used 24/7 from day 1
Accepted widely by the community
Extended based on use requirements of a wide community
Quite large (13K terms)
Simple (and messy) representation
Simple infrastructure
Process to accommodate changes, curation
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
39
Some Policies for Organizing Contributions

Curated by knowledge engineers: processes changes
requested by users
•

Curated by domain experts: group of domain curators
processes changes requested by users
•

http://www.geneontology.org
Open contributions: any user can add content
•

http://www.ecocyc.org
http://www.dmoz.org, http://www.openmind.org
Open editing: any user can edit and create any page on a
web site.
•
http://wiki.org
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
40
Broad Range of Contributors of Scientific Knowledge
(with T. Chklovski)
More inexpensive
More inaccurate
More ambiguous
Deeper into
society/impact
<<<
>>
<>>>>>
USC INFORMATION SCIENCES INSTITUTE
<subclassOf
foton … <>>>>
More expensive
More accurate
More concrete
Deeper into the
science
Yolanda Gil
41
Thank you!

Scientific workflows
•

Cognitive grids
•

www.isi.edu/ikcap/cognitive-grids
AI and science
•

pegasus.isi.edu
IEEE Intelligent Systems Jan/Feb 2004, De Roure, Gil, Hendler (Eds),
Special issue on e-Science
www.isi.edu/~gil
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
42
“As We May Think”
“Wholly new forms of encyclopedias will appear, ready made with a
mesh of associative trails running through them […]. The lawyer has
at his touch the associated opinions and decisions of his whole
experience, and of the experience of friends and authorities. The
patent attorney has on call the millions of issued patents, with
familiar trails to every point of his client's interest. […] The chemist,
struggling with the synthesis of an organic compound, has all the
chemical literature before him in his laboratory, with trails following
the analogies of compounds, and side trails to their physical and
chemical behavior. […]
There is a new profession of trail blazers, those who find delight in the
task of establishing useful trails through the enormous mass of the
common record. The inheritance from the master becomes, not only
his additions to the world's record, but for his disciples the entire
scaffolding by which [their additions] were erected.”
--- Vannevar Bush, 1945
http://www.theatlantic.com/unbound/flashbks/computer/bushf.htm
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
43
Searching for Pulsars with the Pegasus
Planner





Used AI planning techniques to
compose executable grid
workflows with hundreds of jobs
Laser-Interferometer Gravitational
Wave Observatory (LIGO) data,
which aims to detect waves
predicted by Einstein’s theory of
relativity
Used LIGO’s data collected during
the first scientific run of the
instruments in Fall 2002
Targeted a set of 1000 locations of
known pulsars as well as random
locations in the sky
Performed using compute and
storage resources at Caltech,
University of Southern California,
and University of Wisconsin
Milwaukee.
USC INFORMATION SCIENCES INSTITUTE
Yolanda Gil
44
Download