MapReduce As A Language for Parallel Computing Wenguang CHEN, Dehao CHEN Tsinghua University Future Architecture • Many alternatives – A few powerful cores( Intel/AMD, 2,3,4,6 …) – Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) – Heterogenous( CELL, 1/8; FPGA speedup … ) • But programming them is not easy • All use different programming model, some are (relatively) easy, some are extremely difficult – OpenMP, MPI, MapReduce – CUDA, Brooks – Verilog, System C What makes parallel computing so difficult • Parallelism identification and expression – Autoparallelizing has been failed so far • Complex synchronization may be required – Data races and deadlocks which are difficult to debug • Load balance… Map-Reduce is promising • Can only solve a subset of problems – But an important and fast growing subset, such as indexing • Easy to use – Programmers only need to write sequential code – The simplest practical parallel programming paradigm? • Dominated programming paradigm in Internet companies • Originally support distributed systems, now ported to GPU, CELL, multicore – But many dialects, which hurt the portability Limitations on GPUs • Rely on the CPU to allocate memory – How to support variant length data? • Combine size and offset information with the key/val pair – How to allocate output buffer on GPUs? • Two-pass scan—Get the count first, and then do real execution • Lack of lock support – How to synchronize to avoid write conflict? • Memory is pre-allocated, so that every thread knows where it should write to MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output MapReduce on Multi-core CPU (Mars[PACT‘08]) Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output Program Example • Word Count (Phoenix Implementation) … for (i = 0; i < args->length; i++) { curr_ltr = toupper(data[i]); switch (state) { case IN_WORD: data[i] = curr_ltr; if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) { data[i] = 0; emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1); state = NOT_IN_WORD; } break; … Program Example • Word Count (Mars Implementation) __device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) { …. do { …. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_COUNT_FUNC( wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n'); … } __device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize) { …. do { …. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_FUNC(word, &wordSize, wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n'); … } Pros and Cons • Load Balance – Phoenix: Static + Dynamic – Mars: Static, attribute same amount of map/reduce workload to each thread • Pre-allocation – Lock free – requires two-phase scan, which is not an efficient solution • Sorting----Bottleneck of Mars – Phoenix use insertion sorts dynamically during emitting – Mars use bitonic sort -- O(n*logn*logn) Map-Reduce as a Language, not a library • Can we have a portable Map-Reduce that could run across different architectures efficiently? • Promising – Map-Reduce already specify the parallelism well – No complex synchronizations in users code • But still difficult – Different architecture provides different features • Either portability and performance issues – Use compiler and runtime to cover the architecture differences, as what we have done in supporting high-level languages such as C Compiler, library &Runtime C X86 Power Sparc … Map-Reduce Cluster Map-Reduce Multicore Map-Reduce GPU library &Runtime library &Runtime library &Runtime Cluster library &Runtime Multicore Map-Reduce General GPU library &Runtime library &Runtime Cluster Multicore GPU Case study on nVidia GPU • Portability – Host function support • Annotating libc and inline – Dynamic memory allocation • Big problem, not support that in user code? • Performance – Memory Hierarchy Optimization( global, shared, readonly memory identification ) – Typed Language is preferrable( int4 type acceleration…) – Dynamic memory allocation(again!) More to explore • …