MapReduce CPU vs GPU

advertisement
MapReduce As A Language for
Parallel Computing
Wenguang CHEN, Dehao CHEN
Tsinghua University
Future Architecture
• Many alternatives
– A few powerful cores( Intel/AMD, 2,3,4,6 …)
– Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196,
256 … )
– Heterogenous( CELL, 1/8; FPGA speedup … )
• But programming them is not easy
• All use different programming model, some are
(relatively) easy, some are extremely difficult
– OpenMP, MPI, MapReduce
– CUDA, Brooks
– Verilog, System C
What makes parallel computing so
difficult
• Parallelism identification and expression
– Autoparallelizing has been failed so far
• Complex synchronization may be required
– Data races and deadlocks which are difficult to
debug
• Load balance…
Map-Reduce is promising
• Can only solve a subset of problems
– But an important and fast growing subset, such as indexing
• Easy to use
– Programmers only need to write sequential code
– The simplest practical parallel programming paradigm?
• Dominated programming paradigm in Internet
companies
• Originally support distributed systems, now ported
to GPU, CELL, multicore
– But many dialects, which hurt the portability
Limitations on GPUs
• Rely on the CPU to allocate memory
– How to support variant length data?
• Combine size and offset information with the key/val pair
– How to allocate output buffer on GPUs?
• Two-pass scan—Get the count first, and then do real
execution
• Lack of lock support
– How to synchronize to avoid write conflict?
• Memory is pre-allocated, so that every thread knows where
it should write to
MapReduce on Multi-core CPU
(Phoenix [HPCA'07])
Input
Split
Map
Partition
Reduce
Merge
Output
MapReduce on Multi-core CPU
(Mars[PACT‘08])
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
ReduceCount
Prefixsum
Allocate output buffer on GPU
Reduce
Output
Program Example
• Word Count (Phoenix Implementation)
…
for (i = 0; i < args->length; i++)
{
curr_ltr = toupper(data[i]);
switch (state)
{
case IN_WORD:
data[i] = curr_ltr;
if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) {
data[i] = 0;
emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1);
state = NOT_IN_WORD;
}
break;
…
Program Example
• Word Count (Mars Implementation)
__device__ void GPU_MAP_COUNT_FUNC
//(void *key, void *val, int keySize, int valSize)
{
….
do {
….
if (*line != ' ‘) line++;
else {
line++;
GPU_EMIT_INTER_COUNT_FUNC(
wordSize-1, sizeof(int));
while (*line == ' ‘) {
line++;
}
wordSize = 0;
}
} while (*line != '\n');
…
}
__device__ void GPU_MAP_FUNC//(void *key,
void val, int keySize, int valSize)
{
….
do {
….
if (*line != ' ‘) line++;
else {
line++;
GPU_EMIT_INTER_FUNC(word,
&wordSize, wordSize-1, sizeof(int));
while (*line == ' ‘) {
line++;
}
wordSize = 0;
}
} while (*line != '\n');
…
}
Pros and Cons
• Load Balance
– Phoenix: Static + Dynamic
– Mars: Static, attribute same amount of map/reduce
workload to each thread
• Pre-allocation
– Lock free
– requires two-phase scan, which is not an efficient solution
• Sorting----Bottleneck of Mars
– Phoenix use insertion sorts dynamically during emitting
– Mars use bitonic sort -- O(n*logn*logn)
Map-Reduce as a Language, not a
library
• Can we have a portable Map-Reduce that could run
across different architectures efficiently?
• Promising
– Map-Reduce already specify the parallelism well
– No complex synchronizations in users code
• But still difficult
– Different architecture provides different features
• Either portability and performance issues
– Use compiler and runtime to cover the architecture
differences, as what we have done in supporting high-level
languages such as C
Compiler,
library
&Runtime
C
X86
Power
Sparc
…
Map-Reduce
Cluster
Map-Reduce
Multicore
Map-Reduce
GPU
library
&Runtime
library
&Runtime
library
&Runtime
Cluster
library
&Runtime
Multicore
Map-Reduce
General
GPU
library
&Runtime
library
&Runtime
Cluster
Multicore
GPU
Case study on nVidia GPU
• Portability
– Host function support
• Annotating libc and inline
– Dynamic memory allocation
• Big problem, not support that in user code?
• Performance
– Memory Hierarchy Optimization( global, shared, readonly
memory identification )
– Typed Language is preferrable( int4 type acceleration…)
– Dynamic memory allocation(again!)
More to explore
• …
Download