Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame Cooperative Computing Lab University of Notre Dame http://www.nd.edu/~ccl The Cooperative Computing Lab • We collaborate with people who have large large computing scale computing problems in science, scale problems in science, engineering, and other fields. • We operate computer systems on the • We operatecores: computer systems on grids. the O(10,000) clusters, clouds, O(10,000) cores: clusters, clouds, grids. • We conduct computer science research in the develop context of realsource peoplesoftware and problems. • We open for large distributed • scale We develop opencomputing. source software for large scale distributed computing. http://www.nd.edu/~ccl 3 Plan for Today’s Tutorial 1. Our CCTools Software i. Makeflow, Work Queue, Parrot, Chirp 2. Makeflow i. Lecture: Overview, features ii. Tutorial: Write simple Makeflows 3. Work Queue i. Lecture: Overview, features ii. Tutorial: Write simple WQ programs Science Depends on Computing! The Good News: Computing is Plentiful! 6 Superclusters by the Hour http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars 9 I have a standard, debugged, trusted application that runs on my laptop. A toy problem completes in one hour. A real problem will take a month (I think.) Can I get a single result faster? Can I get more results in the same time? Last year, I heard about this grid thing. This year, I heard about this cloud thing. 10 I have allocations on clusters (unlimited) + grids (limited) + clouds ($)! How do I run my application on those machines? Should I port my program to MPI or Hadoop? Learn MPI / Hadoop Learn C / Java Re-architect Re-write Re-test Re-debug Re-certify And my application looks like this… Makeflow & Work Queue • Easy to scale from one desktop to national scale infrastructure. • Harness all available resources: desktops, clusters, clouds, grids. • Portable across operating systems, storage systems, batch systems. • No special privileges required. Makeflow part1 part2 part3: input.data split.py ./split.py input.data out1: part1 mysim.exe ./mysim.exe part1 >out1 out2: part2 mysim.exe ./mysim.exe part2 >out2 out3: part3 mysim.exe ./mysim.exe part3 >out3 result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result 15 Work Queue Library #include “work_queue.h” while( not done ) { while (more work ready) { task = work_queue_task_create(); // add some details to the task work_queue_submit(queue, task); } task = work_queue_wait(queue); // process the completed task } http://www.nd.edu/~ccl/software/workqueue 16 Parrot and Chirp Ordinary Appl Filesystem Interface: open/read/write/close Parrot Virtual File System Local Web Servers HTTP CVMFS CVMFS Network Chirp Chirp Server iRODS iRODS Server 17 http://www.nd.edu/~ccl/software/manuals Source code in GitHub http://github.com/cooperative-computing-lab/cctools Makeflow & Work Queue • Federate/harness all available resources: desktops, clusters, clouds, grids. • Simple interfaces & API • Part of CCTools software – No special privileges required to install. Makeflow Lecture: Outline 1. What is Makeflow? – Portable: One Makeflow program for SGE, Condor, PBS 2. How to write an application using Makeflow? – Simple rule-based syntax 3. How to run Makeflow? – Features, commands, using Work Queue An Old Idea: Makefiles part1 part2 part3: input.data split.py ./split.py input.data out1: part1 mysim.exe ./mysim.exe part1 >out1 out2: part2 mysim.exe ./mysim.exe part2 >out2 out3: part3 mysim.exe ./mysim.exe part3 >out3 result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result 22 Makeflow Language - Rules • Each rule specifies: – a set of target files to create; – a set of source files needed to create them; – a command that generates the target files from the source files. part1 part2 part3: input.data split.py ./split.py input.data out1: part1 mysim.exe ./mysim.exe part1 >out1 out2: part2 mysim.exe ./mysim.exe part2 >out2 out3: part3 mysim.exe ./mysim.exe part3 >out3 result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result out1 : part1 mysim.exe mysim.exe part1 > out1 You must state all the files needed by the command. sims.mf out.10 : in.dat calib.dat sim.exe sim.exe –p 10 in.data > out.10 out.20 : in.dat calib.dat sim.exe sim.exe –p 20 in.data > out.20 out.30 : in.dat calib.dat sim.exe sim.exe –p 30 in.data > out.30 Makeflow = Make + Workflow http://www.nd.edu/~ccl/software/makeflow • • • • Provides portability across batch systems. Enable parallelism (but not too much!) Fault tolerance at multiple scales. Data and resource management. Makeflow Local Condor SGE Work Queue Makeflow + Batch System Makefile XSEDE Cluster Private Cluster Campus Condor Pool Public Cloud Provider Makeflow Local Files and Programs How to run a Makeflow • Run a workflow local % makeflow -T local sims.mf • Run the workflow on SGE: % makeflow -T sge sims.mf • Run the workflow on Condor: % makeflow -T condor sims.mf • Clean up the workflow outputs: % makeflow -c sims.mf Makeflow Syntax Checker • Makeflow can verify if your Makeflow file is syntactically correct % makeflow -k sims.mf Makeflow: Syntax OK. • Makeflow will point out syntax errors if any % makeflow -k sims.mf makeflow: out10 is defined multiple times at out.10:1 and out.10:4 Makeflow Visualization • Makeflow can output a makeflow file as a Dot graph. % makeflow -D dot sims.mf digraph { node [shape=ellipse,color = green,style = unfilled,fixedsize = false]; N2 [label="sim.exe"]; N1 [label="sim.exe"]; N0 [label="sim.exe"]; node [shape=box,color=blue,style=unfilled,fixedsize=false]; F3 [label = "out.30"]; F0 [label = "sim.exe"]; F5 [label = "out.10"]; F2 [label = "in.dat"]; F1 [label = "calib.dat"]; F4 [label = "out.20"]; .. .. Example App: Biocompute Portal BLAST SSAHA SHRIMP EST MAKER … Progress Bar Transaction Log Generate Makefile Update Status Run Workflow Make flow Submit Tasks Condor Pool Makeflow + Work Queue Makeflow + Batch System Makefile XSEDE Cluster Private Cluster Campus Condor Pool Public Cloud Provider Makeflow Local Files and Programs Makeflow + Work Queue sge_submit_workers W XSEDE Cluster Makefile Makeflow submit tasks W W W W W Private Cluster W Campus Condor Pool W Thousands of Workers in a Personal Cloud W W W W Public Cloud Provider Local Files and Programs condor_submit_workers ssh W Advantages of Work Queue • Scalability: Harness multiple infrastructure simultaneously. • Elasticity: Scale resources up & down as needed. • Data Management: Remote data caching. • Data Locality: Matches tasks to nodes with data. Fault Tolerance • MF +WQ is fault tolerant : – If Makeflow crashes (or killed), it recovers by reading log and continues where it left off. – If a worker crashes, the master will detect and restart the task elsewhere. – Workers can be added and removed any time during execution. Makeflow and Work Queue To start the Makeflow % makeflow -T wq sims.mf Could not create work_queue on port 9123. % makeflow -T wq -p 0 sims.mf Listening for workers on port 8374… To start one worker: % work_queue_worker ccl.cse.nd.edu 8374 Start Workers Everywhere! Submit workers to SGE: % sge_submit_workers ccl.cse.nd.edu 8374 25 Submit workers to Condor: % condor_submit_workers ccl.cse.nd.edu 8374 25 Submit workers to Torque: % torque_submit_workers ccl.cse.nd.edu 8374 25 Keeping track of port numbers gets old fast… Project Names makeflow … –a –N myproject work_queue_worker -a –N myproject connect to ccl.cse.nd.edu:4057 Makeflow Worker (port 4057) advertise query Catalog “myproject” is at ccl.cse.nd.edu:4057 Makeflow with Project Names Start Makeflow with a project name: % makeflow -T wq -p 0 -a -N xsede-tutorial sims.mf Listening for workers on port XYZ… Start one worker: % work_queue_worker -N xsede-tutorial Start many workers: % sge_submit_workers -N ccgrid-tutorial 5 http://www.nd.edu/~ccl/software/makeflow/ Makeflow The Cooperative Computing Lab •• Portable: One program for clusters, grids, clouds We collaborate with people who have large scale computing problems in science, • Simple syntax: inputs, outputs, command engineering, and other fields. • Alloperate files needed by command be specified • We computer systemsmust on the O(10,000) with cores: clusters, clouds, grids. • Makeflow Work Queue • We conduct computer science research in • Federation, Elasticity, Data management the context of real people and problems. • Project Names We develop open source software for large scale distributed computing. • Easy to remember locations of Makeflow masters 43 Acknowledgements • Chris Hempel (TACC) • David Gignac (TACC) Go to: http://www.nd.edu/~ccl Click on “Tutorial at XSEDE 2013” Click on “Tutorial” under Makeflow