Work Queue: A Scalable Master/Worker Framework Peter Bui June 29, 2010 Master/Worker Model • Central Master application o o o Divides work into tasks Sends tasks to Workers Gathers results • Distributed collection of Workers o o o Receives input and executable files Runs executable files Returns output files Work Queue versus MPI Work Queue MPI – Number of workers dynamic – Scale up to large number of workers (100s - 1000s) – Reliable and fault tolerant at the task level – Allows for heterogeneous deployment environments – Workers communicate only with Master – Number of workers static – Scale up to limited number of workers (16, 32, 64) – Reliable at application level but no fault tolerance – Requires homogeneous deployment environment – Workers can communicate with anyone Success Stories All-Pairs Makeflow Wavefront SAND Architecture (Overview) Architecture (Master) • Uses Work Queue library o o Creates a Queue Submits Tasks Command Input files Output files o Library keeps tracks of Tasks When a Worker is available, the library sends Tasks o When Tasks complete Retrieve output files Architecture (Workers) • User start workers on any machine • Contact Master and request work • When Task is received, perform commutation, return results • After set idle timeout, quit and cleanup API Overview (Work Queue) Simple C API • Work Queue o work_queue_create(int port) Create a new work queue. o work_queue_delete(struct work_queue *q) Delete a work queue. o work_queue_empty(struct work_queue *q) Determine whether there are any known tasks queued, running, or waiting to be collected. API Overview (Task) Simple C API • Task o work_queue_task_create(const char *command) Create a new task specification. o work_queue_task_delete(struct work_queue_task *t) Delete a task specification. o work_queue_task_specify_input_file(struct work_queue_task *t, const char *fname, const char *rname); Add input file specification. o work_queue_task_specify_output_file(struct work_queue_task *t, const char *rname, const char *fname); Add output file specification. API Overview (Execution) Simple C API • Execution o work_queue_submit(struct work_queue *q, struct work_queue_task *t) Submit a job to a work queue. o work_queue_wait(struct work_queue *q, int timeout) Wait for tasks to complete. Software Configuration Web Information http://cse.nd.edu/~ccl/software/installed.shtml AFS $ setenv PATH ~ccl/software/cctools/bin:$PATH $ setenv PATH ~condor/software/bin:$PATH CRC $ module use /afs/nd.edu/user37/ccl/software/modulefiles $ module load cctools $ module load condor Example 1: DConvert • Goal: convert set of input images to specified format in parallel o o Input: <format> <input_image1> <input_image2> ... Output: converted images in specified format • Skeleton: o ~pbui/www/scratch/workqueue-tutorial.tar.gz DConvert (Preparation) Setup scratch workspace $ mkdir /tmp/$USER-scratch $ cd /tmp/$USER-scratch $ pwd Copy source tarball and extract it $ cp ~pbui/www/scratch/workqueue-tutorial.tar.gz . $ tar xzvf workqueue-tutorial.tar.gz $ cd workqueue-tutorial $ ls Open dconvert.c source file for editting $ gedit dconvert.c & DConvert (TODO 1, 2, and 3) // TODO 1: include work queue header file #include "work_queue.h" // TODO 2: declare work queue and task structs struct work_queue *q; struct work_queue_task *t; // TODO 3: create work queue using default port q = work_queue_create(0); DConvert (TODO 4, 5, 6) // TODO 4: create task, specify input and output file, submit task t = work_queue_task_create(command); work_queue_task_specify_input_file(t, input_file, input_file); work_queue_task_specify_output_file(t, output_file, output_file); work_queue_submit(q, t); // TODO 5: while work queue is empty wait for task, then delete returned task while (!work_queue_empty(q)) { t = work_queue_wait(q, 10); if (t) work_queue_task_delete(t); } // TODO 6: delete work queue work_queue_delete(q); DConvert (Demonstration) Build and prepare application $ make $ cp /usr/share/pixmaps/*.png . Start batch of workers $ condor_submit_workers `hostname` 9123 5 Start application $ ./dconvert jpg *.png Tips and Tricks (Debugging) Debugging • Enable cctools debugging system o In master application: debug_flags_set("wq"); debug_flags_set("debug"); o In workers: work_queue_worker -d debug -d wq <hostname> <port> • Incrementally test number of workers Failed Execution • Include executable and dependencies as input files • Right target platform (32-bit vs 64-bit, OS, etc.) Tips and Tricks (Tasks) Tag Tasks • Give a task an identifying tag so Master can keep track of it Use input and output buffers • • work_queue_task_specify_input_buf o Contents of buffer will be materialized as a file task->output o Buffer that contains standard output of task Check task results • • task->result: result of task task->return_status: exit code of command line at worker Tips and Tricks (Batch) Custom Worker Environment • Modify batch system specific submit scripts o condor_submit_workers Set requirements o sge_submit_workers Set environment Set modules Tips and Tricks (CRC) Submit master, find host, submit workers • qsub myscript.sh #!/bin/csh master • qstat -u <afsid> | grep myscript.sh • sge_submit_workers <hostname> <port> Example 2: Mandelbrot Generator • Goal: generate mandelbrot image o o Input: <width> <height> <xmin> <xmax> <ymin> <ymax> <max_iterations> Output: mandelbrot image in PPM format • Skeleton: o ~pbui/www/scratch/workqueue-tutorial.tar.gz Mandelbrot (Overview) z(n+1) = z^2 + c Escape Time Algorithm • For each pixel (r, c) in image calculate if corresponding point (x, y) escapes boundary • Iterative algorithm where each pixel computation is independent Application design • Master partitions image into tasks • Workers compute Escape Time Algorithm on partitions Mandelbrot (Naive Approach) Master • For each pixel (r, c) in image (width x height) o o Computer corresponding x, y Submit task with for pixel with x, y Pass x, y parameters as input buffer Tag task with r, c values • Wait for each task to complete: o o o Retrieve output of worker from task->output Retrieve r, c from task->tag Store pixel[r, c] = output • Output pixels in PPM format Mandelbrot (Naive Approach) Worker • Read in parameters from input file: o x0, y0, max_iterations, black_value • Perform Mandelbrot computation as specified from Wikipedia: o http://en.wikipedia.org/wiki/Mandelbrot_set#For_programmers • Output result (iterations) to standard out Mandelbrot (Analysis) Problem • Processing each pixel as a single task is inefficient o Too-fine grained o Overhead of sending parameters, running tasks, and retrieving results > than computation time Work Queue Golden Rule: Computation Time > Data Transfer Time + Task setup overhead Mandelbrot (Better Approach) Send Rows • Process groups of pixels rather than individual ones: o Send a row and have the worker return a series of results o Perhaps send multiple rows? • Should take execution time from minutes to seconds Mandelbrot (Demonstration) Build application $ make Start batch of workers $ condor_submit_workers `hostname` 9123 10 Start application $ ./mandelbrot_master 512 512 -2 1 -1.5 1.5 250 > output.ppm $ display output.ppm Advanced Features Fast Abort • Allow Work Queue to pre-emptively kill slow tasks • work_queue_activate_fast_abort(q, X) o X is the fast abort multiplier o if (runtime >= average_runtime * X) fast_abort Scheduling • Change how workers are selected o o o FCFS: first come, first serve FILES: has the most cached files TIME: fastest average turn around time • Can be set for queue or for task Advanced Features (More) Automatic Master Detection • Start master with a project name: o setenv WORK_QUEUE_NAME="project_name" • Enable master auto selection mode with workers o o work_queue_worker -a -N "project_name" work_queue_pool -T condor -a -N "project_name" • Checkout master at http://chirp.cse.nd.edu Shut down workers • work_queue_shut_down_workers Web Resources Website http://www.nd.edu/~ccl/software/workqueue/ • User manual and C API documentation Bug Reports and Suggestions http://www.cse.nd.edu/~ccl/software/help.shtml Python-API http://bitbucket.org/pbui/python-workqueue/ • Experimental Python binding