Planning on the Grid With slides contributed by Ewa Deelman and Yolanda Gil Thinking about applications of planning You’ve seen Planning as X, X {SAT, CSP, ILP, …} Now: Y as Planning Y {Grid/Web services composition, …} USC INFORMATION SCIENCES INSTITUTE 2 Problem-solving on Grids Users pool access to distributed resources (computers, instruments, data, ..) Applications are often composed of separate components run at several locations Grid middleware tools allow for scheduling jobs, resource discovery. e.g. Globus toolkit USC INFORMATION SCIENCES INSTITUTE 3 The Computational Grid Emerging computational and networking infrastructure Enable entirely new approaches to applications and problem solving remote resources the rule, not the exception can solve ever bigger problems Wide-area distributed computing bring together compute resources, data storage system, instruments, human resources national and international Facilitate collaborative environments Sharing of data which can be expensive to produce (experimentation/simulation) USC INFORMATION SCIENCES INSTITUTE 4 Example: LIGO Experiment (Laser Interferometer Gravitational-Wave Observatory) Aims to detect gravitational waves predicted by theory of relativity. Can be used to detect Two installations: in Louisiana (Livingston) and Washington State binary pulsars mergers of black holes “starquakes” in neutron stars Other projects: Virgo (Italy), GEO (Germany), Tama (Japan) Instruments are designed to measure the effect of gravitational waves on test masses suspended in vacuum. Data collected during experiments is a collection of time series (multi-channel) Analysis is performed in time and Fourier domains USC INFORMATION SCIENCES INSTITUTE 5 Interferom eter LIGO’s Pulsar Search (Laser Interferometer Gravitational-wave Observatory) archive Extract channel transpose Long time frames raw channels Single Frame Extract frequency range Short Fourier Transform 30 minutes Short time frames Time-frequency Image Construct image Hz USC INFORMATION SCIENCES INSTITUTE Find Candidate Store Time event DB 6 Motivation: Using Today’s Grid Users have high level requirements naturally stated in terms of the application domain Users have to turn these requirements into executable job workflows in detailed scripts Ex: Obtain frequency spectrum for signal S in instrument I and timeframe T Users must figure out which code generates desired products, which files contain it, physical location of the files, hosts that support execution given code requirements, availability of hosts, access policies, etc. Users must query Grid middleware: metadata catalog, replica locator, resource descriptor and monitoring, etc. Users must oversee execution USC INFORMATION SCIENCES INSTITUTE 7 Problems with today’s Grid Usability: users must be proficient in grid computing Complexity: many interrelated choices and dead ends Solution cost: any-cost solutions are already hard Global cost: optimization necessary when contention Reliability of execution: job resubmission upon failure USC INFORMATION SCIENCES INSTITUTE 8 Planning for workflow generation and maintenance Outline: Formalization as a planning problem Integration with the grid middleware Case study: planning for workflows in LIGO The grid as a test bed for planning and scheduling research USC INFORMATION SCIENCES INSTITUTE 9 Application Development and Execution Process Abstract Workflow Generation FFT Application Component Selection ApplicationDomain Specify a Different Workflow Concrete Workflow Generation FFT filea Resource Selection Data Replica Selection Transformation Instance Selection Abstract Workflow Pick different Resources transfer filea from host1:// home/filea to host2://home/file1 /usr/local/bin/fft /home/file1 DataTransfer Concrete Workflow host1 host2 host2 Retry Data Data Execution Environment USC INFORMATION SCIENCES INSTITUTE Failure Recovery Method 10 Desiderata for workflow generator Allow users to refer to data requirements by descriptions, not file names Intuitive, requires far less input Seek high quality workflows according to variable metric Model variety of constraints declaratively Data dependencies, resource constraints, user access rights, …. USC INFORMATION SCIENCES INSTITUTE 11 Planning for workflow generation and maintenance Outline: Formalization as a planning problem Integration with the grid middleware Case study: planning for workflows in LIGO The grid as a test bed for planning and scheduling research USC INFORMATION SCIENCES INSTITUTE 12 Planning for workflow generation Application components as operators Desired data as goals World state includes available hosts, existing data products, network bandwidths, … USC INFORMATION SCIENCES INSTITUTE 13 Existing tools for building workflows: abstract workflow generation Chimera Input-ouput transforms for files, in ‘Virtual Data Language’: DV third1->pulsar(a=@{input:"H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd"}, b=@{output:"H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.ilwd"}, t1="714384000", t2="714384255", format="ilwd", channel="LSC-AS-Q", fcenter="50.5", fband="0.004", instrument="H2", ra="3.123643", de="+2.56234", fderv1="0.0", fderv2="0.0", fderv3="0.0", fderv4="0.0", fderv5="0.0"); USC INFORMATION SCIENCES INSTITUTE 14 Planning operator (operator pulsar-search (preconds ( (<start-time> 7143800) (<channel> LSC-AS-Q) (<fcenter> 0.5) (<right-ascension> 50) (<sample-rate> 20) …) (and (created “H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd”)) (effects () ( (add (created “H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.ilwd”)) ) )) USC INFORMATION SCIENCES INSTITUTE 15 Operator with metadata parameters (operator pulsar-search (preconds ( (effects (<start-time> Number) () (<channel> Channel) ( (<fcenter> Number) (add (created <file>)) (<right-ascension> Number) (<sample-rate> Number) (add (pulsar <start-time> <end-time> <channel> (<file> File-Handle) <instrument> <format> ;; These two are parameters for the frequency-extract. <fcenter> <fband> (<f0> (and Number (get-low-freq-from-center-and-band <fderv1> <fderv2> <fderv3> <fderv4> <fderv5> <fcenter> <fband>))) <right-ascension> <declination> <sample-rate> (<fN> (and Number (get-high-freq-from-center-and-band <file>)) <fcenter> <fband>))) ) …) )) (and (forall ((<sub-sft-file-group> (and File-Group-Handle (gen-sub-sft-range-for-pulsar-search <f0> <fN> <start-time> <end-time> <sub-sft-file-group>)))) (and (sub-sft-group <start-time> <end-time> <channel> <instrument> <format> <f0> <fN> <sample-rate> <sub-sft-file-group>) (at <sub-sft-file-group> <host>))))) USC INFORMATION SCIENCES INSTITUTE 16 Operator with host identified (operator pulsar-search (preconds ((<host> (or Condor-pool Mpi)) (effects (<start-time> Number) () (<channel> Channel) ( (<fcenter> Number) (add (created <file>)) (<right-ascension> Number) (add (at <file> <host>)) (<sample-rate> Number) (add (pulsar <start-time> <end-time> <channel> (<file> File-Handle) <instrument> <format> ;; These two are parameters for the frequency-extract. <fcenter> <fband> (<f0> (and Number (get-low-freq-from-center-and-band <fderv1> <fderv2> <fderv3> <fderv4> <fderv5> <fcenter> <fband>))) <right-ascension> <declination> <sample-rate> (<fN> (and Number (get-high-freq-from-center-and-band <file>)) <fcenter> <fband>))) ) (<run-time> (and Number )) (estimate-pulsar-search-run-time <start-time> <end-time> <sample-rate> <f0> <fN> <host> <run-time>))) …) (and (available pulsar-search <host>) (forall ((<sub-sft-file-group> (and File-Group-Handle (gen-sub-sft-range-for-pulsar-search <f0> <fN> <start-time> <end-time> <sub-sft-file-group>)))) (and (sub-sft-group <start-time> <end-time> <channel> <instrument> <format> <f0> <fN> <sample-rate> <sub-sft-file-group>) (at <sub-sft-file-group> <host>))))) USC INFORMATION SCIENCES INSTITUTE 17 Planning for workflow generation Application components as operators Parameters include host: plan is a concrete workflow Desired data (in descriptive form) as goals World state includes available hosts, existing data products, network bandwidths, … USC INFORMATION SCIENCES INSTITUTE 18 Operator descriptions Represent applying a given component at a particular location with fixed parameters, inputs and outputs. Preconditions combine data dependencies – derive input requirements from outputs Task constraints – e.g. component must be run on an MPI machine USC INFORMATION SCIENCES INSTITUTE 19 Plan quality Objective function may include Performance – expected runtime, variance Reliability – probability of failure, expected number of retries Computational cost – use of ‘expensive’ resources, conformance to policies USC INFORMATION SCIENCES INSTITUTE 20 Using local heuristics and global metrics Need local heuristics since search space is intractable e.g. prefer host for program with high-bandwidth connection to where the output is required Need to test a global metric (e.g. overall runtime) since local heuristics can lead to globally poor solution Create as many plans as possible, return best Search control to eliminate redundant solutions USC INFORMATION SCIENCES INSTITUTE 21 Example search heuristics (control-rule only-transfer-from-loc-with-greatest-bandwidth (if (and (current-ops (transfer-file)) (current-goal (at <file> <dest>)) (true-in-state (at <file> <loc1>)) (true-in-state (at <file> <loc2>)) (higher-bandwidth <loc1> <loc2> <dest>))) (then reject bindings ((<from-loc> . <loc2>)))) (control-rule prefer-mpi-to-condor-for-pulsar-search (if (and (current-ops (pulsar-search)) (type-of <mpi> Mpi) (type-of <condor> Condor-pool))) (then prefer bindings ((<host> . <mpi>)) ((<host> . <condor>)))) USC INFORMATION SCIENCES INSTITUTE 22 Planning for workflow generation and maintenance Outline: Formalization as a planning problem Integration with the grid middleware The grid as a test bed for planning and scheduling research USC INFORMATION SCIENCES INSTITUTE 23 High-level specs of desired results and intermediate data products Metadata Catalog Service Request Manager Workflow Planning AI-based Planner Current State Generator Globus Replica Location Service Models and current state information Concrete Workflow Dynamic information Submission and Monitoring System Resource Models ng ori t i n Mo workflow executor (DAGman) Execution Globus Monitoring and Discovery Service a rm o f in n tio Information and Models s ta ks Grid Raw data detector USC INFORMATION SCIENCES INSTITUTE 24 Generating the planning problem Currently, static file representation for available hosts, bandwidths Query grid services prior to planning to find which relevant files exist Future versions will make dynamic queries Goal is translated from user request, plan is translated into DAG format suitable for grid scheduler. USC INFORMATION SCIENCES INSTITUTE 25 LIGO’s Pulsar Search at SC’02 Used LIGO’s data collected during the first scientific run of the instrument Targeted a set of 1000 locations: known pulsar or random locations Results of the analysis published to the LIGO Scientific Collaboration Performed using LDAS and compute and storage resources at Caltech, University of Southern California, University of Wisconsin Milwaukee. USC INFORMATION SCIENCES INSTITUTE 26 Summary: benefits of planning Automating workflow composition Reasoning with explicit descriptions of data Just being addressed in Grid middleware More intuitive for users Far fewer inputs required than at file level Better workflows by searching many plans USC INFORMATION SCIENCES INSTITUTE 27 Planning for workflow generation and maintenance Outline: Existing Grid tools for workflow generation Formalization as a planning problem Integration with the grid middleware The grid as a test bed for planning and scheduling research USC INFORMATION SCIENCES INSTITUTE 28 Many areas of planning research relevant for grid Planning for a dynamic environment: plan monitoring and repair, planning under uncertainty Scheduling: resource reasoning, temporal reasoning Plan quality: learning, acquiring preferences, local search planning Planning for information gathering: integrating access to grid services with workflow creation Domain modeling: handling multiple ontologies, acquiring metadata descriptions, acquiring operators USC INFORMATION SCIENCES INSTITUTE 29 Fault-tolerant planning for a dynamic environment Grid resources become unavailable, queue length & network bandwidth change Exploring plan repair strategies, balance of work done off-line and on-line Modeling failures, keeping statistics for creating plans more likely to succeed, conditional plans, .. USC INFORMATION SCIENCES INSTITUTE 30 Fault-tolerant straw men 1. Current version: build fully detailed plan offline, resource allocation is fixed 2. Ignores world dynamics Build abstract plan (without specifying hosts) offline, use a matchmaker online Matchmaker makes local decisions only USC INFORMATION SCIENCES INSTITUTE 31 Global reasoning is needed for resource allocation Finish C (5) A (3) B (1) Start USC INFORMATION SCIENCES INSTITUTE 32 Approaches for fault-tolerant planning in dynamic domains RAX (Jonsson et al.) general framework. As implemented: offline: builds complete plan online: adjusts temporal intervals Combining planning and scheduling offline: build several abstract plans online: reason about critical path to instantiate each plan MDP/POMDP approaches Open area.. USC INFORMATION SCIENCES INSTITUTE 33 Challenge: understanding when different approaches are more important Hypotheses: Uneven task distribution, in terms of computational and data expense and resource constraints will indicate global planning Time-dependency, e.g. need to re-plan during execution, will indicate local planning Interesting project: use experiments in synthetic and real domains to test hypotheses and uncover new insights USC INFORMATION SCIENCES INSTITUTE 34 Empirical tests with synthetic LIGO problems Example: Problem requires 100 files on one machine. Vary the number that exist. distribution - 1 machine 800 run-time 700 min max 600 p-max 500 g-max 400 avg 10 0 90 80 70 60 50 40 30 20 10 300 no of files USC INFORMATION SCIENCES INSTITUTE 35 Domain modeling Current system: Knowledge from several sources must be used task requirements available resources resource policies Info from Grid services (RLS, MCS etc) existing data in files Comp. selector Resource selector Exec. monitor USC INFORMATION SCIENCES INSTITUTE User policies Resource queues State info (files, resources) Monolithic planner KBs combined in one location Concrete tasks Network bandwidth Grid task schedulers 36 Where does knowledge used by our planners come from? task resource requirements data dependencies (VDL*) (Operator … (preconditions .. )) (effects .. )) user policies & preferences resource policies Each knowledge component is used for other purposes beyond planning USC INFORMATION SCIENCES INSTITUTE 37 Automatically generated operators for several application domains { Digital sky survey LIGO GEO Galaxy morphology Tomography task resource requirements data dependencies (VDL*) (Operator … (preconditions .. )) policies (effects .. )) Investigating patterns of data descriptions for more efficient planning USC INFORMATION SCIENCES INSTITUTE 38 Question: if operators are gathered from distributed services, can we still guarantee soundness and completeness? Under what kinds of conditions? USC INFORMATION SCIENCES INSTITUTE 39 Representing appropriate information units with metadata E.g. Have 60,000 files, want to allocate 60 tasks each dealing with 1,000 files. Previously, application components specified in terms of specific files: 1000 files DV run59000->extractSFTData( input=[@{input:“nSFT.59000"},…,@{input:”nSFT.59999”}], output=[@{output:” eSFT.59000”},…,@{output:”eSFT.59999”}], t1="714384000", t2="714384063", freq=“1008”,band=“4”,instrument="H2"); … 59 similar clauses… 60000 files DV final->computeFStatistic( input=[@{input:”eSFT.00000”},…,@{input:”eSFT.59999”}],…); USC INFORMATION SCIENCES INSTITUTE 40 Metadata representation Replace with two clauses, two input predicates A predicate now represents a range of files Simpler to model, greater generality, more efficient for reasoner (operator run-extractSFTData-range (preconds ((<begin-file> Number) (<number-of-files> (and Number (> <number-of-files> 0))) (<local-begin-file> (and Number (gen-smaller-number <number-of-files> 1000 <begin-file>)))) (and (range "eSFT" <begin-file> 2 1 <local-begin-file>) (range "nSFT" <local-begin-file> 2 1 999))) (effects () ((add (range "eSFT" <begin-file> 2 <number-of-files>))))) USC INFORMATION SCIENCES INSTITUTE 41 Requires library operators for ranges E.g. if a range of files exists, then so does any subrange Questions: what are the required operators? Similar to spatial calculus RCC-8? (operator subranges-exist (preconds ((<begin-file> Number) (<type> Object) (<number-of-files> (and Number (> <number-of-files> 0))) (<enclosing-begin> (and Number (gen-known-enclosing-begins <type> <begin-file> 2 1 <number-of-files>))) (<enclosing-number-of-files> (and Number (gen-known-enclosing-number-of-files <type> <enclosing-begin> 2 1 <number-of-files> <begin-file>)))) (created-range <type> <enclosing-begin> 2 1 <enclosing-number-of-files>)) (effects () ((add (created-range <type> <begin-file> 2 1 <number-of-files>))))) USC INFORMATION SCIENCES INSTITUTE 42 Conclusions Implemented system takes data description requests from LIGO users, composes workflow and executes on the Grid Planning and scheduling technologies can make a large contribution to Grid infrastructure Many interesting challenges for planning and scheduling research from Grid applications http://www.isi.edu/ikcap/cognitive-grids http://www.isi.edu/~deelman/pegasus.htm USC INFORMATION SCIENCES INSTITUTE 43 Koehler and Srivastava Different approaches to specifying workflows by hand USC INFORMATION SCIENCES INSTITUTE 44 WSDL service specification (no workflow specified) <definitions targetNamespace="http://..." xmlns="http://schemas.xmlsoap.org/wsdl/"> <message name = "OrderEvent"></message> <message name = "TripRquest"></message> <message name = "FlightRequest"></message> <message name = "HotelRequest"></message> <message name = "BookingFailure"></message> <portType name ="pt1"> <operation name ="CToCI"> <input message ="TripRequest"/> </operation> </portType> <portType name ="pt2"> <operation name ="CIToHS"> <output message ="HotelRequest"/> </operation> </portType> <portType name ="pt3"> <operation name ="CIToFS"> <output message ="FlightRequest"/> </operation> </portType>SCIENCES INSTITUTE USC INFORMATION 45 BPEL4WS <sequence> <receive partner="Customer" portType ="pt1" operation ="CToCI" container ="OrderEvent"> </receive> <flow> <invoke partner ="HotelService" portType ="pt2" operation ="CIToHS" inputContainer ="HotelRequest"> </invoke> <invoke partner ="FlightService" portType ="pt3" operation ="CIToFS" inputContainer ="FlightRequest"> </invoke> </flow> USC INFORMATION SCIENCES INSTITUTE 46 Golog USC INFORMATION SCIENCES INSTITUTE 47 Back-up slides USC INFORMATION SCIENCES INSTITUTE 48 What is Needed We need alternative foundations that offer expressive representations flexible reasoners Many Artificial Intelligence (AI) techniques are relevant: Planning to achieve given requirements Searching through problem spaces of related choices Using and combining heuristics Expressive knowledge representation languages Reasoners that can incorporate rules, definitions, axioms, etc. Schedulers and resource allocation techniques USC INFORMATION SCIENCES INSTITUTE 49 Existing tools for building workflows: abstract workflow generation Chimera Input-ouput transforms at level of actual files, in ‘Virtual Data Language’: DV first1->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384000_64.gwf"}, t1="714384000", t2="714384063", format="frame", channel="H2:LSC-AS-Q", instrument="H2"); DV first2->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384064_64.gwf"}, t1="714384064", t2="714384127", format="frame", channel="H2:LSC-AS-Q", instrument="H2"); DV third1->pulsar(a=@{input:"H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd"}, b=@{output:"H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.123643_+2.56234.ilwd"}, t1="714384000", t2="714384255", format="ilwd", channel="LSC-AS-Q", fcenter="50.5", fband="0.004", instrument="H2", ra="3.123643", de="+2.56234", fderv1="0.0", fderv2="0.0", fderv3="0.0", fderv4="0.0", fderv5="0.0"); USC INFORMATION SCIENCES INSTITUTE 50 Existing tools for building workflows: abstract workflow generation Chimera Input-ouput transforms for files, in ‘Virtual Data Language’: DV first1->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384000_64.gwf"}, t1="714384000", t2="714384063", format="frame", channel="H2:LSC-AS-Q", instrument="H2"); DV first2->createSFT( b=@{output:"H2_SFT_LSC-AS-Q_714384064_64.gwf"}, t1="714384064", t2="714384127", format="frame", channel="H2:LSC-AS-Q", instrument="H2"); DV third1->pulsar(a=@{input:"H2_sSFT_LSC-AS-Q_714384000_256_50_1.ilwd"}, b=@{output:"H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.004_3.123643_+2.56234.ilwd"}, t1="714384000", t2="714384255", format="ilwd", channel="LSC-AS-Q", fcenter="50.5", fband="0.004", instrument="H2", ra="3.123643", de="+2.56234", fderv1="0.0", fderv2="0.0", fderv3="0.0", fderv4="0.0", fderv5="0.0"); USC INFORMATION SCIENCES INSTITUTE 51 Existing tools 2: concrete planner Assigns specific hosts and data locations for tasks Makes random selection of resources and data Provided a feasible solution Reused existing data products Gridftp host://f.a ….lumpy.isi.edu/ nfs/temp/f.a INPUT: OUTPUT: F.a lumpy.isi.edu://usr/local/ bin/extract Extract F.b1 Jet.caltech.edu://home/malcom/ resample -I /home/malcolm/F.b1 F.b2 Decimate Resample F.c2 F.c1 Concat F.c1 F.c2 Concat Data Transfer Nodes Replica Catalog Registration Nodes F.d Register /F.d at home/malcolm/f2 USC INFORMATION SCIENCES INSTITUTE 52 Sample Pulsar Search Results to Date SC 2002 run: Over 58 pulsar searches Total of To date: 185 pulsar searches Total of 330 tasks 469 data transfers 330 output files produced. The total runtime was 11:24:35. USC INFORMATION SCIENCES INSTITUTE 975 tasks 1365 data transfers 975 output files Total runtime 96:49:47 53