On-the-Fly Pipeline Parallelism

On-the-Fly Pipeline Parallelism SPAA 2013 I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang* MIT CSAIL* Intel Corporation† Dedup PARSEC Benchmark [BKS08] Dedup compresses a stream of data by compressing unique elements and removing duplicates. int fd_out = open_output_file(); bool done = false; while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); } Stage 0: While there is more data, read the next chunk from the stream. Stage 1: Check for duplicates. Stage 2: Compress first-seen chunk. Stage 3: Write to output file. } 2 Parallelism in Dedup while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { Stage 0 Stage 1 chunk->is_dup = deduplicate(chunk); Stage 2 if(!chunk->is_dup) compress(chunk); Stage 3 write_to_file(fd_out, chunk); } } i0 i1 i2 i3 i4 i5 Stage 0 ... Stage 1 ... Stage 2 ... Stage 3 ... Let’s model Dedup’s execution as a pipeline dag.  A node denotes the execution of a stage in an iteration.  Edges denote dependencies between nodes. Dedup exhibits pipeline parallelism. : cross edge 3 Pipeline Parallelism We can measure parallelism in terms of work and span [CLRS09]. Example: = weight 1 = weight 8 Work T1: The sum of the weights of the nodes in the dag. T1 = 75 Span T∞: The length of a longest path in the dag. T∞ = 20 Parallelism T1 / T∞: The maximum possible speedup. T1 / T∞ = 3.75 4 Executing a Parallel Pipeline To execute Dedup in parallel, we must Stage 0 answer two questions. while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { Stage 1  How do we encode chunk->is_dup = deduplicate(chunk); Stage 2 if(!chunk->is_dup) compress(chunk); Stage 3 the parallelism in Dedup? write_to_file(fd_out, chunk); } } i0 i1 i2 i3 i4 i5 Stage 0 ... Stage 1 ... Stage 2 ... Stage 3 ...  How do we assign work to parallel processors to execute this computation efficiently? 5 On-the-Fly Pipeline Parallelism SPAA July 24, 2013 I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang* MIT CSAIL* Intel Corporation† Construct-and-Run Pipelining tbb::pipeline pipeline; GetChunk_Filter filter1(SERIAL, item); Deduplicate_Filter filter2(SERIAL); Compress_Filter filter3(PARALLEL); WriteToFile_Filter filter4(SERIAL, out_item); pipeline.add_filter(filter1); pipeline.add_filter(filter2); pipeline.add_filter(filter3); pipeline.add_filter(filter4); pipeline.run(pipeline_depth); A construct-and-run pipeline specifies the stages and their dependencies a priori before execution. Ex: TBB [MRR12], StreamIt [GTA06], GRAMPS [SLY+11] 7 On-the-Fly Pipelining of X264 I P P P I P P I P P P Not easily expressible using TBB's pipeline construct [RCJ11]. An on-the-fly pipeline is constructed dynamically as the program executes. 8 On-the-Fly Pipeline Parallelism in Cilk-P  We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named Cilk-P, which features:  simple linguistics for specifying on-the-fly pipeline parallelism that are composable with Cilk's existing forkjoin primitives; and  PIPER, a theoretically sound randomized work-stealing scheduler that handles both pipeline and fork-join parallelism.  We hand-compiled 3 applications with pipeline parallelism (ferret, dedup, and x264 from PARSEC [BKS08]) to run on Cilk-P.  Empirical results indicate that Cilk-P exhibits low serial overhead and good scalability. 9 Outline  On-the-Fly Pipeline Parallelism  The Pipeline Linguistics in Cilk-P  The PIPER Scheduler  Empirical Evaluation  Concluding Remarks 10 The Pipeline Linguistics in Cilk-P int fd_out = open_output_file(); bool done = false; while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); } } 11 The Pipeline Linguistics in Cilk-P Loop iterations may execute in int fd_out = open_output_file(); parallel in a pipelined fashion, bool done = false; where stage 0 executes serially. pipe_while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk); pipe_wait(3); write_to_file(fd_out, chunk); } } End the current stage, advance to stage 1, and wait for the previous iteration to finish stage 1. End the current stage and advance to stage 2. The Pipeline Linguistics in Cilk-P pipe_while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk); pipe_wait(3); write_to_file(fd_out, chunk); } } These keywords denote the logical parallelism of the computation. Stage 0 ... Stage 1 ... Stage 2 ... Stage 3 ... : cross edge The pipe_while enforces that stage 0 executes serially. The pipe_wait(1) enforces cross edges across stage 1. The pipe_wait(3) enforces cross edges 13 across stage 3. The Pipeline Linguistics in Cilk-P int fd_out = open_output_file(); bool done = false; pipe_while(!done) { chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk); pipe_wait(3); write_to_file(fd_out, chunk); } } These keywords have serial semantics — when elided or replaced with its serial counterpart, a legal serial code results, whose semantics is one of the legal interpretation of the parallel code [FLR98]. 14 On-the-Fly Pipelining of X264 The program controls the execution of pipe_wait and pipe_continue statements, thus supporting on-the-fly pipeline parallelism. I P P P I P P I P P P Program control can thus:  Skip stages;  Make cross edges data dependent; and  Vary the number of stages across iterations. We can pipeline the x264 video encoder using Cilk-P. 15 Pipelining X264 with Pthreads The scheduling logics are embedded in the application code. I P P P I P P I P P P The main control thread pthread_create pthread_join 16 Pipelining X264 with Pthreads The scheduling logics are embedded in the application code. I P P P I P P I P P P pthread_mutex_lock update my_var pthread_cond_broadcast pthread_mutex_unlock pthread_mutex_lock while(my_var < value) { pthread_cond_wait } pthread_mutex_unlock 17 Pipelining X264 with Pthreads The scheduling logics are embedded in the application code. I P P P I P P I P P P The cross-edge dependencies are enforced via data synchronization with locks and conditional variables. 18 X264 Performance Comparison 16 Speedup over serial execution 14 12 10 Pthreads 8 Cilk-P 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of processors (P) Cilk-P achieves comparable performance to Pthreads on x264 without explicit data synchronization. 19 Outline  On-the-Fly Pipeline Overview  The Pipeline Linguistics in Cilk-P  The PIPER Scheduler  A Work-Stealing Scheduler  Handling Runaway Pipeline  Avoiding Synchronization Overhead  Concluding Remarks 20 Guarantees of a Standard Work-Stealing Scheduler [BL99,ABP01] Definition. TP — execution time on P processors T1 — work T∞ — span T1 / T∞ — parallelism SP — stack space on P processors S1 — stack space of a serial execution Given a computation dag with fork-join parallelism, it achieves:  Time bound: TP ≤ T1 / P + O(T∞ + lg P) expected time  linear speedup when P ≪ T1 / T∞  Space bound: SP ≤ PS1 The Work-First Principle [FLR98]. Minimize the scheduling overhead borne by the work path (T1) and amortize it against the steal path (T∞). 21 A Work-Stealing Scheduler (Based on [BL99,ABP01]) i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... Execute 17 P : done P P : not done : ready Each worker maintains its own set of ready nodes. If executing a node enables: a. two nodes: mark one ready and execute the other one; b. one node: execute the enabled node; c. zero nodes: execute a node in its ready set. If the ready set is empty, steal from a randomly chosen worker. : executing 22 A Work-Stealing Scheduler (Based on [BL99,ABP01]) i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... Execute 18 P : done P P : not done : ready Each worker maintains its own set of ready nodes. If executing a node enables: a. two nodes: mark one ready and execute the other one; b. one node: execute the enabled node; c. zero nodes: execute a node in its ready set. If the ready set is empty, steal from a randomly chosen worker. : executing 23 A Work-Stealing Scheduler (Based on [BL99,ABP01]) i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... Steal! P P P P Each worker maintains its own set of ready nodes. If executing a node enables: a. two nodes: mark one ready and execute the other one; b. one node: execute the enabled node; c. zero nodes: execute a node in its ready set. If the ready set is empty, steal from a randomly chosen worker. A node has at most two outgoing edges, so a standard work-stealing scheduler just works ... well, almost. 24 Outline  On-the-Fly Pipeline Overview  The Pipeline Linguistics in Cilk-P  The PIPER Scheduler  A Work-Stealing Scheduler  Handling Runaway Pipeline  Avoiding Synchronization Overhead  Concluding Remarks 25 Runaway Pipeline i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... P P A runaway pipeline: where the scheduler allows many new iterations to be started before finishing old ones. Problem: Unbounded space usage! 26 Runaway Pipeline i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... K=4 A runaway pipeline: where the scheduler allows many new iterations to be started before finishing old ones. Steal! P P Problem: Unbounded space usage! Cilk-P automatically throttles pipelines by inserting a throttling edge between iteration i and iteration i+K, where K is the throttling limit. 27 Outline  On-the-Fly Pipeline Overview  The Pipeline Linguistics in Cilk-P  The PIPER Scheduler  A Work-Stealing Scheduler  Handling Runaway Pipeline  Avoiding Synchronization Overhead  Concluding Remarks 28 Synchronization Overhead i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... P If two predecessors of a node are executed by different workers, synchronization is necessary — whoever finishes last enables the node. P 29 Synchronization Overhead i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... check right! P check left! P If two predecessors of a node are executed by different workers, synchronization is necessary — whoever finishes last enables the node. At pipe_wait(j), iteration i must check left to see if stage j is done in iteration i-1. At the end of a stage, iteration i must check right to see if it enabled a node in iteration i+1. Cilk-P implements “lazy enabling” to mitigate the check-right overhead and “dependency folding” to mitigate the check-left overhead. 30 Lazy Enabling i0 i1 i2 i3 i4 i5 1 5 9 13 17 21 ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 ... check i2? Steal! P P P Idea: Be really really lazy about the check-right operation. Punt the responsibility of checking right onto a thief stealing or until the worker runs out of nodes to execute in its iteration. Lazy enabling is in accordance with the work-first principle [FLR98]. 31 PIPER's Guarantees Definition. TP — execution time on P processors T1 — work T∞ — span of the throttled dag T1 / T∞ — parallelism SP — stack space on P processors S1 — stack space of a serial execution K — throttling limit f — maximum frame size D — depth of nested pipelines  Time bound: TP ≤ T1 / P + O(T∞ + lg P) expected time  linear speedup when P ≪ T1 / T∞  Space bound: SP ≤ P(S1 + fDK) 32 Outline  On-the-Fly Pipeline Parallelism  The Pipeline Linguistics in Cilk-P  The PIPER Scheduler  Empirical Evaluation  Concluding Remarks 33 Experimental Setup  All experiments were ran on an AMD Opteron system with 4 quad-core 2GHz CPU’s having a total of 8 GBytes of memory.  Code compiled using GCC (or G++ for TBB) 4.4.5 using –O3 optimization (except for x264 which uses –O4 by default).  The Pthreaded implementation of ferret and dedup employ the oversubscription method that creates more than one thread per pipeline stage. We limit the number of cores used by the Pthreaded implementations using taskset but experimented to find the best configuration.  All benchmarks are throttled similarly.  Each data point shown is the average of 10 runs, typically with standard deviation less than a few percent. 34 Ferret Performance Comparison 16 Speedup over serial execution 14 12 10 Pthreads 8 TBB 6 Cilk-P 4 2 0 0 2 4 6 8 10 12 14 16 Number of processors (P) Throttling limit = 10P No performance penalty incurred for using the more general on-the-fly pipeline instead of a construct-and-run pipeline. 35 Dedup Performance Comparison 16 Speedup over serial execution 14 12 10 Pthreads 8 TBB 6 Cilk-P 4 2 0 0 2 4 6 8 10 12 14 16 Number of processors (P) Throttling limit = 4P Measured parallelism for Cilk-P (and TBB)’s pipeline is merely 7.4. The Pthreaded implementation has more parallelism due to unordered stages. 36 X264 Performance Comparison 16 Speedup over serial execution 14 12 10 Pthreads 8 Cilk-P 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of processors (P) Cilk-P achieves comparable performance to Pthreads on x264 without explicit data synchronization. 37 On-the-Fly Pipeline Parallelism in Cilk-P We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named Cilk-P, which features: simple linguistics that:  composable with Cilk's forkjoin primitives; AND  specifies on-the-fly pipelines;  has serial semantics; and  allow users to synchronize via control constructs. the PIPER scheduler that:  supports both pipeline and fork-join parallelism;  asymptotically efficient;  uses bounded space; and  empirically demonstrates low serial overhead and good scalability. Intel has created an experimental branch of its Cilk Plus runtime with support for on-the-fly pipelines based on Cilk-P: https://intelcilkruntime@bitbucket.org/intelcilkruntime/intel-cilk-runtime.git 38 39 Impact of Throttling We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance. How does throttling a pipeline computation affect its performance? ... ... ... ... If the dag is regular, then the work and span of the throttled dag asymptotically match that of the unthrottled dag. 40 Impact of Throttling We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance. How does throttling a pipeline computation affect its performance? ... ... ... T11/3 + 1 T11/3 + 1 (T12/3 + T11/3)/2 If the dag is irregular, then there are pipelines where no throttling scheduler can achieve speedup. 41 Dedup Performance Comparison 16 Speedup over serial execution 14 12 10 Pthreads TBB 8 Cilk-P 6 Modified Cilk-P 4 2 0 0 2 4 6 8 10 12 14 16 Number of processors (P) Throttling limit = 4P Modified Cilk-P uses a single worker thread for writing out output, like the Pthreaded implementation, which helps performance as well. 42 The Cilk Programming Model The named child function may execute in parallel with the parent caller. int fib(int n) { if(n < 2) { return n; } int x = cilk_spawn fib(n-1); int y = fib(n-2); Control cannot pass this cilk_sync; point until all spawned return (x + y); children have returned. } Cilk keywords grant permission for parallel execution. They do not command parallel execution. 43 Pipelining with TBB 1. Create a pipeline object. 2. Execute. … tbb::parallel_pipeline( INNER_PIPELINE_NUM_TOKENS, tbb::make_filter< void, one_chunk* > ( tbb::filter::serial, get_next_chunk) & & tbb::make_filter< one_chunk*, one_procd_chunk* > ( tbb::filter::parallel, deduplicate) & tbb::make_filter< one_procd_chunk*, one_procd_chunk* > ( tbb::filter::parallel, compress) & tbb::make_filter< one_procd_chunk*, void > ( tbb::filter::serial, write_to_file) ); … 44 Pipelining with Pthreads 1. Encode each stage in its own thread. 2. Assign threads to workers. void *Fragment(void *targs) { … chunk = get_next_chunk(); buf_insert(&send_buf, chunk); … } void *Compress(void *targs) { … chunk = buf_remove(&recv_buf); compress(chunk); buf_insert(&send_buf, chunk); … } void *Reorder(void *targs) { 3. Execute. void *Deduplicate(void *targs) { … chunk = buf_remove(&recv_buf); is_dup = deduplicate(chunk); if (!is_dup) buf_insert( &send_buf_compress, chunk); else buf_insert( &send_buf_reorder, chunk); … } … chunk = buf_remove(&recv_buf); write_or_enqueue(chunk); … } 45 Pipelining X264 with Pthreads The cross-edge dependencies are enforced via data synchronization with locks and conditional variables. I P P P I P P I P P P Encoding a video with 512 frames on 16 processors: Total # of invocations for: pthread_mutex_lock: 202776 pthread_cond_broadcast: 34816 pthread_cond_wait: 10068 in application code. 46

On-the-Fly Pipeline Parallelism

Related documents

Products

Support

On-the-Fly Pipeline Parallelism

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib