Planning on the Grid With slides contributed by

advertisement
Planning on the Grid
With slides contributed by
Ewa Deelman and Yolanda Gil
Thinking about applications of planning
You’ve seen Planning as X,
X  {SAT, CSP, ILP, …}
Now: Y as Planning
Y  {Grid/Web services composition, …}
USC INFORMATION SCIENCES INSTITUTE
2
Problem-solving on Grids

Users pool access to distributed resources
(computers, instruments, data, ..)

Applications are often composed of separate
components run at several locations

Grid middleware tools allow for scheduling jobs,
resource discovery. e.g. Globus toolkit
USC INFORMATION SCIENCES INSTITUTE
3
The Computational Grid

Emerging computational and networking infrastructure


Enable entirely new approaches to applications and problem
solving



remote resources the rule, not the exception
can solve ever bigger problems
Wide-area distributed computing


bring together compute resources, data storage system,
instruments, human resources
national and international
Facilitate collaborative environments

Sharing of data which can be expensive to produce
(experimentation/simulation)
USC INFORMATION SCIENCES INSTITUTE
4
Example: LIGO Experiment
(Laser Interferometer Gravitational-Wave Observatory)


Aims to detect gravitational waves predicted
by theory of relativity.
Can be used to detect




Two installations: in Louisiana (Livingston) and Washington State




binary pulsars
mergers of black holes
“starquakes” in neutron stars
Other projects: Virgo (Italy), GEO (Germany), Tama (Japan)
Instruments are designed to measure the effect of gravitational
waves on test masses suspended in vacuum.
Data collected during experiments is a collection of time series
(multi-channel)
Analysis is performed in time and Fourier domains
USC INFORMATION SCIENCES INSTITUTE
5
Interferom
eter
LIGO’s Pulsar Search
(Laser Interferometer Gravitational-wave Observatory)
archive
Extract
channel
transpose
Long time frames
raw channels
Single Frame
Extract
frequency
range
Short
Fourier
Transform
30 minutes
Short time frames
Time-frequency
Image
Construct
image
Hz
USC INFORMATION SCIENCES INSTITUTE
Find Candidate
Store
Time
event
DB
6
Motivation:
Using Today’s Grid

Users have high level requirements naturally stated in terms of
the application domain


Users have to turn these requirements into executable job
workflows in detailed scripts



Ex: Obtain frequency spectrum for signal S in instrument I and
timeframe T
Users must figure out which code generates desired products,
which files contain it, physical location of the files, hosts that
support execution given code requirements, availability of hosts,
access policies, etc.
Users must query Grid middleware: metadata catalog, replica
locator, resource descriptor and monitoring, etc.
Users must oversee execution
USC INFORMATION SCIENCES INSTITUTE
7
Problems with today’s Grid

Usability: users must be proficient in grid computing
 Complexity: many interrelated choices and dead
ends
 Solution cost: any-cost solutions are already hard
 Global cost: optimization necessary when
contention
 Reliability of execution: job resubmission upon
failure
USC INFORMATION SCIENCES INSTITUTE
8
Planning for workflow generation and
maintenance
Outline:

Formalization as a planning problem
 Integration with the grid middleware
 Case study: planning for workflows in LIGO
 The grid as a test bed for planning and scheduling
research
USC INFORMATION SCIENCES INSTITUTE
9
Application Development and Execution Process
Abstract
Workflow
Generation
FFT
Application
Component
Selection
ApplicationDomain
Specify a
Different
Workflow
Concrete
Workflow
Generation
FFT filea
Resource Selection
Data Replica Selection
Transformation Instance
Selection
Abstract
Workflow
Pick different Resources
transfer filea from host1://
home/filea
to host2://home/file1
/usr/local/bin/fft /home/file1
DataTransfer
Concrete
Workflow
host1
host2
host2
Retry
Data
Data
Execution
Environment
USC INFORMATION SCIENCES INSTITUTE
Failure Recovery
Method
10
Desiderata for workflow generator

Allow users to refer to data requirements by
descriptions, not file names

Intuitive, requires far less input

Seek high quality workflows according to variable
metric

Model variety of constraints declaratively

Data dependencies, resource constraints, user access
rights, ….
USC INFORMATION SCIENCES INSTITUTE
11
Planning for workflow generation and
maintenance
Outline:

Formalization as a planning problem
 Integration with the grid middleware
 Case study: planning for workflows in LIGO
 The grid as a test bed for planning and scheduling
research
USC INFORMATION SCIENCES INSTITUTE
12
Planning for workflow generation

Application components as operators

Desired data as goals

World state includes available hosts, existing data
products, network bandwidths, …
USC INFORMATION SCIENCES INSTITUTE
13
Existing tools for building workflows:
abstract workflow generation

Chimera

Input-ouput transforms for files, in ‘Virtual Data Language’:
DV third1->pulsar(a=@{input:"H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd"},
b=@{output:"H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.ilwd"},
t1="714384000", t2="714384255", format="ilwd", channel="LSC-AS-Q",
fcenter="50.5", fband="0.004", instrument="H2", ra="3.123643", de="+2.56234",
fderv1="0.0", fderv2="0.0", fderv3="0.0", fderv4="0.0", fderv5="0.0");
USC INFORMATION SCIENCES INSTITUTE
14
Planning operator
(operator pulsar-search
(preconds
(
(<start-time> 7143800)
(<channel> LSC-AS-Q)
(<fcenter> 0.5)
(<right-ascension> 50)
(<sample-rate> 20)
…)
(and
(created “H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd”))
(effects
()
( (add
(created “H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.ilwd”))
)
))
USC INFORMATION SCIENCES INSTITUTE
15
Operator with metadata parameters
(operator pulsar-search
(preconds
(
(effects
(<start-time> Number)
()
(<channel> Channel)
(
(<fcenter> Number)
(add (created <file>))
(<right-ascension> Number)
(<sample-rate> Number)
(add (pulsar <start-time> <end-time> <channel>
(<file> File-Handle)
<instrument> <format>
;; These two are parameters for the frequency-extract.
<fcenter> <fband>
(<f0> (and Number (get-low-freq-from-center-and-band
<fderv1> <fderv2> <fderv3> <fderv4> <fderv5>
<fcenter> <fband>)))
<right-ascension> <declination> <sample-rate>
(<fN> (and Number (get-high-freq-from-center-and-band
<file>))
<fcenter> <fband>)))
)
…)
))
(and
(forall ((<sub-sft-file-group>
(and File-Group-Handle
(gen-sub-sft-range-for-pulsar-search
<f0> <fN> <start-time> <end-time>
<sub-sft-file-group>))))
(and (sub-sft-group <start-time> <end-time>
<channel> <instrument> <format>
<f0> <fN> <sample-rate> <sub-sft-file-group>)
(at <sub-sft-file-group> <host>)))))
USC INFORMATION SCIENCES INSTITUTE
16
Operator with host identified
(operator pulsar-search
(preconds
((<host> (or Condor-pool Mpi))
(effects
(<start-time> Number)
()
(<channel> Channel)
(
(<fcenter> Number)
(add (created <file>))
(<right-ascension> Number)
(add (at <file> <host>))
(<sample-rate> Number)
(add (pulsar <start-time> <end-time> <channel>
(<file> File-Handle)
<instrument> <format>
;; These two are parameters for the frequency-extract.
<fcenter> <fband>
(<f0> (and Number (get-low-freq-from-center-and-band
<fderv1> <fderv2> <fderv3> <fderv4> <fderv5>
<fcenter> <fband>)))
<right-ascension> <declination> <sample-rate>
(<fN> (and Number (get-high-freq-from-center-and-band
<file>))
<fcenter> <fband>)))
)
(<run-time> (and Number
))
(estimate-pulsar-search-run-time
<start-time> <end-time> <sample-rate>
<f0> <fN> <host> <run-time>)))
…)
(and (available pulsar-search <host>)
(forall ((<sub-sft-file-group>
(and File-Group-Handle
(gen-sub-sft-range-for-pulsar-search
<f0> <fN> <start-time> <end-time>
<sub-sft-file-group>))))
(and (sub-sft-group <start-time> <end-time>
<channel> <instrument> <format>
<f0> <fN> <sample-rate> <sub-sft-file-group>)
(at <sub-sft-file-group> <host>)))))
USC INFORMATION SCIENCES INSTITUTE
17
Planning for workflow generation

Application components as operators

Parameters include host: plan is a concrete workflow

Desired data (in descriptive form) as goals

World state includes available hosts, existing data
products, network bandwidths, …
USC INFORMATION SCIENCES INSTITUTE
18
Operator descriptions

Represent applying a given component at a particular
location with fixed parameters, inputs and outputs.

Preconditions combine


data dependencies – derive input requirements from outputs
Task constraints – e.g. component must be run on an MPI
machine
USC INFORMATION SCIENCES INSTITUTE
19
Plan quality

Objective function may include



Performance – expected runtime, variance
Reliability – probability of failure, expected number
of retries
Computational cost – use of ‘expensive’ resources,
conformance to policies
USC INFORMATION SCIENCES INSTITUTE
20
Using local heuristics and global metrics

Need local heuristics since search space is
intractable


e.g. prefer host for program with high-bandwidth connection
to where the output is required
Need to test a global metric (e.g. overall runtime)
since local heuristics can lead to globally poor
solution


Create as many plans as possible, return best
Search control to eliminate redundant solutions
USC INFORMATION SCIENCES INSTITUTE
21
Example search heuristics
(control-rule only-transfer-from-loc-with-greatest-bandwidth
(if (and (current-ops (transfer-file))
(current-goal (at <file> <dest>))
(true-in-state (at <file> <loc1>))
(true-in-state (at <file> <loc2>))
(higher-bandwidth <loc1> <loc2> <dest>)))
(then reject bindings ((<from-loc> . <loc2>))))
(control-rule prefer-mpi-to-condor-for-pulsar-search
(if (and (current-ops (pulsar-search))
(type-of <mpi> Mpi)
(type-of <condor> Condor-pool)))
(then prefer bindings ((<host> . <mpi>)) ((<host> . <condor>))))
USC INFORMATION SCIENCES INSTITUTE
22
Planning for workflow generation and
maintenance
Outline:

Formalization as a planning problem
 Integration with the grid middleware
 The grid as a test bed for planning and scheduling
research
USC INFORMATION SCIENCES INSTITUTE
23
High-level specs of
desired results and
intermediate data
products
Metadata Catalog
Service
Request Manager
Workflow
Planning
AI-based
Planner
Current
State
Generator
Globus Replica
Location Service
Models and
current state
information
Concrete
Workflow
Dynamic
information
Submission and
Monitoring System
Resource Models
ng
ori
t
i
n
Mo
workflow executor
(DAGman)
Execution
Globus Monitoring
and Discovery
Service
a
rm
o
f
in
n
tio
Information and
Models
s
ta
ks
Grid
Raw data
detector
USC INFORMATION SCIENCES INSTITUTE
24
Generating the planning problem

Currently, static file representation for available hosts,
bandwidths

Query grid services prior to planning to find which
relevant files exist


Future versions will make dynamic queries
Goal is translated from user request, plan is
translated into DAG format suitable for grid
scheduler.
USC INFORMATION SCIENCES INSTITUTE
25
LIGO’s Pulsar Search at SC’02

Used LIGO’s data collected during
the first scientific run of the
instrument

Targeted a set of 1000 locations:
known pulsar or random locations

Results of the analysis published
to the LIGO Scientific Collaboration

Performed using LDAS and
compute and storage resources at
Caltech, University of Southern
California, University of Wisconsin
Milwaukee.
USC INFORMATION SCIENCES INSTITUTE
26
Summary: benefits of planning

Automating workflow composition


Reasoning with explicit descriptions of data



Just being addressed in Grid middleware
More intuitive for users
Far fewer inputs required than at file level
Better workflows by searching many plans
USC INFORMATION SCIENCES INSTITUTE
27
Planning for workflow generation and
maintenance
Outline:

Existing Grid tools for workflow generation
 Formalization as a planning problem
 Integration with the grid middleware
 The grid as a test bed for planning and
scheduling research
USC INFORMATION SCIENCES INSTITUTE
28
Many areas of planning research relevant
for grid

Planning for a dynamic environment: plan monitoring
and repair, planning under uncertainty
 Scheduling: resource reasoning, temporal reasoning
 Plan quality: learning, acquiring preferences, local
search planning
 Planning for information gathering: integrating access
to grid services with workflow creation
 Domain modeling: handling multiple ontologies,
acquiring metadata descriptions, acquiring operators
USC INFORMATION SCIENCES INSTITUTE
29
Fault-tolerant planning for a dynamic
environment

Grid resources become unavailable, queue length &
network bandwidth change

Exploring plan repair strategies, balance of work
done off-line and on-line

Modeling failures, keeping statistics for creating plans
more likely to succeed, conditional plans, ..
USC INFORMATION SCIENCES INSTITUTE
30
Fault-tolerant straw men
1.
Current version: build fully detailed plan offline,
resource allocation is fixed

2.
Ignores world dynamics
Build abstract plan (without specifying hosts) offline,
use a matchmaker online

Matchmaker makes local decisions only
USC INFORMATION SCIENCES INSTITUTE
31
Global reasoning is needed
for resource allocation
Finish
C (5)
A (3)
B (1)
Start
USC INFORMATION SCIENCES INSTITUTE
32
Approaches for fault-tolerant planning in
dynamic domains

RAX (Jonsson et al.) general framework. As implemented:
offline: builds complete plan
online: adjusts temporal intervals

Combining planning and scheduling
offline: build several abstract plans
online: reason about critical path to instantiate each plan

MDP/POMDP approaches

Open area..
USC INFORMATION SCIENCES INSTITUTE
33
Challenge: understanding when different
approaches are more important


Hypotheses:

Uneven task distribution, in terms of computational and data
expense and resource constraints will indicate global
planning

Time-dependency, e.g. need to re-plan during execution, will
indicate local planning
Interesting project: use experiments in synthetic and
real domains to test hypotheses and uncover new
insights
USC INFORMATION SCIENCES INSTITUTE
34
Empirical tests
with synthetic LIGO problems

Example: Problem requires 100 files on one
machine. Vary the number that exist.
distribution - 1 machine
800
run-time
700
min
max
600
p-max
500
g-max
400
avg
10
0
90
80
70
60
50
40
30
20
10
300
no of files
USC INFORMATION SCIENCES INSTITUTE
35
Domain modeling
Current system:
Knowledge from several
sources must be used
task
requirements
available
resources
resource
policies
Info from Grid services
(RLS, MCS etc)
existing data
in files
Comp.
selector
Resource
selector
Exec.
monitor
USC INFORMATION SCIENCES INSTITUTE
User
policies
Resource
queues
State info
(files, resources)
Monolithic planner
KBs combined
in one location
Concrete tasks
Network
bandwidth
Grid task schedulers
36
Where does knowledge used by our
planners come from?
task
resource
requirements
data
dependencies
(VDL*)
(Operator …
(preconditions
..
))
(effects
..
))
user policies
& preferences
resource
policies
Each knowledge component is used for other purposes
beyond planning
USC INFORMATION SCIENCES INSTITUTE
37
Automatically generated operators for
several application domains
{
Digital sky survey
LIGO
GEO
Galaxy morphology
Tomography
task
resource
requirements
data
dependencies
(VDL*)
(Operator …
(preconditions
..
))
policies
(effects
..
))
Investigating patterns of data descriptions for
more efficient planning
USC INFORMATION SCIENCES INSTITUTE
38

Question: if operators are gathered from distributed
services, can we still guarantee soundness and
completeness?
 Under what kinds of conditions?
USC INFORMATION SCIENCES INSTITUTE
39
Representing appropriate information units with
metadata

E.g. Have 60,000 files, want to allocate 60 tasks
each dealing with 1,000 files.

Previously, application components specified in terms
of specific files:
1000 files
DV run59000->extractSFTData( input=[@{input:“nSFT.59000"},…,@{input:”nSFT.59999”}],
output=[@{output:” eSFT.59000”},…,@{output:”eSFT.59999”}],
t1="714384000", t2="714384063", freq=“1008”,band=“4”,instrument="H2");
… 59 similar clauses…
60000 files
DV final->computeFStatistic( input=[@{input:”eSFT.00000”},…,@{input:”eSFT.59999”}],…);
USC INFORMATION SCIENCES INSTITUTE
40
Metadata representation

Replace with two clauses, two input predicates


A predicate now represents a range of files
Simpler to model, greater generality, more efficient for reasoner
(operator run-extractSFTData-range
(preconds
((<begin-file> Number)
(<number-of-files> (and Number (> <number-of-files> 0)))
(<local-begin-file> (and Number
(gen-smaller-number <number-of-files> 1000 <begin-file>))))
(and (range "eSFT" <begin-file> 2 1 <local-begin-file>)
(range "nSFT" <local-begin-file> 2 1 999)))
(effects ()
((add (range "eSFT" <begin-file> 2 <number-of-files>)))))
USC INFORMATION SCIENCES INSTITUTE
41
Requires library operators for ranges

E.g. if a range of files exists, then so does any subrange

Questions: what are the required operators? Similar to spatial
calculus RCC-8?
(operator subranges-exist
(preconds
((<begin-file> Number)
(<type> Object)
(<number-of-files> (and Number (> <number-of-files> 0)))
(<enclosing-begin> (and Number (gen-known-enclosing-begins <type> <begin-file>
2 1 <number-of-files>)))
(<enclosing-number-of-files>
(and Number (gen-known-enclosing-number-of-files <type> <enclosing-begin>
2 1 <number-of-files>
<begin-file>))))
(created-range <type> <enclosing-begin> 2 1 <enclosing-number-of-files>))
(effects ()
((add (created-range <type> <begin-file> 2 1 <number-of-files>)))))
USC INFORMATION SCIENCES INSTITUTE
42
Conclusions

Implemented system takes data description requests
from LIGO users, composes workflow and executes
on the Grid

Planning and scheduling technologies can make a
large contribution to Grid infrastructure

Many interesting challenges for planning and
scheduling research from Grid applications
http://www.isi.edu/ikcap/cognitive-grids
http://www.isi.edu/~deelman/pegasus.htm
USC INFORMATION SCIENCES INSTITUTE
43
Koehler and Srivastava

Different approaches to specifying workflows by hand
USC INFORMATION SCIENCES INSTITUTE
44
WSDL service specification
(no workflow specified)
<definitions targetNamespace="http://..."
xmlns="http://schemas.xmlsoap.org/wsdl/">
<message name = "OrderEvent"></message>
<message name = "TripRquest"></message>
<message name = "FlightRequest"></message>
<message name = "HotelRequest"></message>
<message name = "BookingFailure"></message>
<portType name ="pt1">
<operation name ="CToCI">
<input message ="TripRequest"/>
</operation>
</portType>
<portType name ="pt2">
<operation name ="CIToHS">
<output message ="HotelRequest"/>
</operation>
</portType>
<portType name ="pt3">
<operation name ="CIToFS">
<output message ="FlightRequest"/>
</operation>
</portType>SCIENCES INSTITUTE
USC INFORMATION
45
BPEL4WS
<sequence>
<receive partner="Customer"
portType ="pt1"
operation ="CToCI"
container ="OrderEvent">
</receive>
<flow>
<invoke partner ="HotelService"
portType ="pt2"
operation ="CIToHS"
inputContainer ="HotelRequest">
</invoke>
<invoke partner ="FlightService"
portType ="pt3"
operation ="CIToFS"
inputContainer ="FlightRequest">
</invoke>
</flow>
USC INFORMATION SCIENCES INSTITUTE
46
Golog
USC INFORMATION SCIENCES INSTITUTE
47
Back-up slides
USC INFORMATION SCIENCES INSTITUTE
48
What is Needed

We need alternative foundations that offer



expressive representations
flexible reasoners
Many Artificial Intelligence (AI) techniques are
relevant:






Planning to achieve given requirements
Searching through problem spaces of related choices
Using and combining heuristics
Expressive knowledge representation languages
Reasoners that can incorporate rules, definitions, axioms,
etc.
Schedulers and resource allocation techniques
USC INFORMATION SCIENCES INSTITUTE
49
Existing tools for building workflows:
abstract workflow generation

Chimera

Input-ouput transforms at level of actual files, in ‘Virtual Data
Language’:
DV first1->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384000_64.gwf"},
t1="714384000", t2="714384063", format="frame", channel="H2:LSC-AS-Q",
instrument="H2");
DV first2->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384064_64.gwf"},
t1="714384064", t2="714384127", format="frame", channel="H2:LSC-AS-Q",
instrument="H2");
DV third1->pulsar(a=@{input:"H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd"},
b=@{output:"H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.123643_+2.56234.ilwd"},
t1="714384000", t2="714384255", format="ilwd", channel="LSC-AS-Q",
fcenter="50.5", fband="0.004", instrument="H2", ra="3.123643", de="+2.56234",
fderv1="0.0", fderv2="0.0", fderv3="0.0", fderv4="0.0", fderv5="0.0");
USC INFORMATION SCIENCES INSTITUTE
50
Existing tools for building workflows:
abstract workflow generation

Chimera

Input-ouput transforms for files, in ‘Virtual Data Language’:
DV first1->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384000_64.gwf"},
t1="714384000", t2="714384063", format="frame", channel="H2:LSC-AS-Q",
instrument="H2");
DV first2->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384064_64.gwf"},
t1="714384064", t2="714384127", format="frame", channel="H2:LSC-AS-Q",
instrument="H2");
DV third1->pulsar(a=@{input:"H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd"},
b=@{output:"H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.123643_+2.56234.ilwd"},
t1="714384000", t2="714384255", format="ilwd", channel="LSC-AS-Q",
fcenter="50.5", fband="0.004", instrument="H2", ra="3.123643", de="+2.56234",
fderv1="0.0", fderv2="0.0", fderv3="0.0", fderv4="0.0", fderv5="0.0");
USC INFORMATION SCIENCES INSTITUTE
51
Existing tools 2: concrete planner




Assigns specific hosts and data locations for tasks
Makes random selection of resources and data
Provided a feasible solution
Reused existing data products
Gridftp host://f.a ….lumpy.isi.edu/
nfs/temp/f.a
INPUT:
OUTPUT:
F.a
lumpy.isi.edu://usr/local/
bin/extract
Extract
F.b1
Jet.caltech.edu://home/malcom/
resample -I /home/malcolm/F.b1
F.b2
Decimate
Resample
F.c2
F.c1
Concat
F.c1
F.c2
Concat
Data
Transfer
Nodes
Replica
Catalog
Registration
Nodes
F.d
Register /F.d at home/malcolm/f2
USC INFORMATION SCIENCES INSTITUTE
52
Sample Pulsar Search Results to Date
SC 2002 run:
 Over 58 pulsar searches
 Total of




To date:
 185 pulsar searches
 Total of
330 tasks
469 data transfers
330 output files produced.
The total runtime was
11:24:35.
USC INFORMATION SCIENCES INSTITUTE




975 tasks
1365 data transfers
975 output files
Total runtime
96:49:47
53
Download