HIS2013_smin

advertisement
Statistical Analysis and Machine Learning using
Hadoop
Seungjai Min
Samsung SDS
Knowing that…
 Hadoop/Map-Reduce has been successful in
analyzing unstructured web contents and social
media data
 Another source of big data is semi-structured
machine/device generated logs, which require nontrivial data massaging and extensive statistical data
mining
1
Question
 Is Hadoop/Map-Reduce the right framework to
implement statistical analysis (more than counting
and descriptive statistics) and machine learning
algorithms (which involve iterations)?
2
Answer and Contents of this talk
 Yes, Hadoop/Map-Reduce is the right framework
– Why is it better than MPI and CUDA?
– Map-Reduce Design Patterns
– Data Layout Patterns
 No, but there are better alternatives
– Spark/Shark (as an example)
– R/Hadoop (it is neither RHadoop nor Rhive)
3
Contents
 Programming Models
– Map-Reduce vs. MPI vs. Spark vs. CUDA(GPU)
 Map-Reduce Design Patterns
– Privatization Patterns (Summarization / Filtering / Clustering)
– Data Organization Patterns (Join / Transpose)
 Data Layout Patterns
– Row vs. Column vs. BLOB
 Summary
– How to choose the right programming model for your algorithm
4
Parallel Programming is Difficult
 Too many parallel programming models
(languages)
Titanium
Co-array Fortran
RapidMind
OpenMP
Cilk
Brook
CUDA
UPC
Chapel
P-threads
Erlang
PVM
OpenCL
Intel TBB
MPI
Fortress
X10
5
MPI Framework
 Assembly Language of the Parallel Programming
myN = N / nprocs;
1
for (i=0; i<=myN; i++) {
A[i] = initialize(i);
}
100 101
200 201
300 301
400
left_index = …;
right_index = …;
MPI_Send(pid-1, A[left_index], sizeof(int), …);
MPI_Recv(pid+1, A[right_index], sizeof(int), …);
for (i=0; i<=myN; i++)
{
B[i] = (A[i]+A[i+1])/2.0;
}
6
Map-Reduce Framework
 Parallel Programming for the masses!
Map/Combine/Partition
Shuffle
Sort/Reduce
input
Map
key/val
key/val
Reduce
output
input
Map
key/val
key/val
Reduce
output
input
Map
key/val
key/val
Reduce
output
7
Map-Reduce vs. MPI
 Similarity
– Programming model
• Processes not threads
• Address spaces are separate (data communications are explicit)
– Data locality
• “owner computes” rule dictates that computations are sent to
where data is not the other way round
8
Map-Reduce vs. MPI
 Differences
Map-Reduce
MPI
Expressing Parallelism
Embarrassingly Parallel Almost all parallel
(Filter +Reduction)
forms but not good for
task parallelism
Data Communication
Under the hood
Explicit / User-provided
Data Layout (Locality)
Under the hood
Explicit / User-control
Fault Tolerance
Under the hood
None (as of MPI1.0 and
MPI 2.0)
9
GPU
 GPGPU (General Purpose Graphic Processing Units)
 10~50 times faster than CPU if an algorithm fits this model
 Good for embarrassingly parallel algorithms (e.g. image)
 Costs ($2K~$3.5K) and Performance (2 Quad-cores vs. One
GPU)
GPU
Multi-core CPUs
Shared memory
Local Mem Local Mem
$
$
$
CPU
CPU
CPU
Global Memory
10
Programming CUDA
cudaArray* cu_array;
// Allocate array
cudaMalloc(&cu_array, cudaCreateChannelDesc<float>(), width, height);
// Copy image data to array
cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice);
// Bind the array to the texture
cudaBindTexture(tex, cu_array);
dim3 blockDim(16, 16, 1);
dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height);
cudaUnbindTexture(tex);
__global__ void kernel(float* odata, int height, int width)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
float c = texfetch(tex, x, y); odata[y*width+x] = c;
}
Hard to program/debug  hard to find good engineers  hard to maintain codes
11
Design Patterns in Parallel Programming
 Privatization Idiom
p_sum = 0;
#pragma omp parallel private(p_sum)
{
#pragma omp for
for (i=1; i<=N; i++)
{
p_sum += A[i];
}
#pragma omp critical
{
sum += p_sum;
}
}
12
Design Patterns in Parallel Programming
 Reduction Idiom
#define N 400
#pragma omp parallel for
for (i=1; i<=N; i++) {
A[i] = 1;
}
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i=1; i<=N; i++)
{
sum += A[i]; // dependency
}
printf(“sum = %d\n”, sum);
13
Design Patterns in Parallel Programming
 Induction Idiom
x = K;
for (i=0; i<N; i++) {
A[i] = x++;
}
x = K;
for (i=0; i<N; i++) {
A[i] = x + i;
}
14
Design Patterns in Parallel Programming
 Privatization Idiom
p_sum = 0;
#pragma omp parallel private(p_sum)
{
#pragma omp for
for (i=1; i<=N; i++)
{
p_sum += A[i];
}
#pragma omp critical
{
sum += p_sum;
}
Perfect fit for MapReduce framework
Map
Map
Map
Reduce
}
15
MapReduce Design Patterns
 Book written by Donald Miner & Adam Shook
1. Summarization patterns
2. Filtering patterns
3. Data organization patterns
4. Join patterns
5. Meta-patterns
6. Input and output patterns
16
Design Patterns
 Linear Regression (1-dimension)
Y = bX + e
y
b
yi
xi
x
y1
x1
e
y2
x2
e
y3
x3
=b*
1
+
2
e
3
y4
x4
e
y5
x5
e
4
5
17
Design Patterns
 Linear Regression (2-dimension)
Y = bX + e
n
y
m
x1
y1
x11
x21
e
y2
x12
x22
e
y3
= b * x13
x23
y4
x14
x24
e
y5
x15
x25
e
1
+
2
e
3
4
5
x2
m: # of observations
n : # of dimension
18
Design Patterns
 Linear Regression (distributing on 4 nodes)
n/4
Ù
yij = å xik xkj +
k=1
n/2
åx
x +
ik kj
n
k= +1
4
3*n/4
åx
n
å
x +
ik kj
n
k= +1
2
k=
xik xkj
3*n
+1
4
n
n
m
XTX =
n
*
m =
n
19
Design Patterns
 Linear Regression
n
(XTX)-1 = inverse of
n
 If n2 is sufficiently small enough  Apache math library
 n should be kept small  Avoid curse of dimensionalty
20
Design Patterns
 Join
age name
…
…
…
ID
time
dst
…
…
…
100
25
Bob
…
…
…
100
7:28
CA
…
…
…
210
31
John
…
…
…
100
8:03
IN
…
…
…
360
46
Kim
…
…
…
210
4:26 WA
…
…
…
ID
Inner join
A.ID A.age A.name …
…
…
B.ID B.time B.dst
…
…
…
100
25
Bob
…
…
…
100
7:28
CA
…
…
…
100
25
Bob
…
…
…
100
8:03
IN
…
…
…
210
31
John
…
…
…
210
4:26
WA
…
…
…
21
Design Patterns
 Join
Network
overhead
100
25
Bob
…
…
…
210
31
John
…
…
…
360
46
Kim
…
…
…
…
…
…
…
…
…
100
7:28
CA
…
…
…
100
8:03
IN
…
…
…
210
4:26
WA
…
…
…
…
…
…
…
…
…
Reduce-side Join
100
25
Bob
…
100
25
Bob
…
…
Map
Reduce
Map
Reduce
210
31
John
Map
Reduce
360
46
Kim
…
22
Performance Overhead (1)
 Map-Reduce suffers from Disk I/O bottlenecks
Map/Combine/Partition
Shuffle
Reduce
input
Map
key/val
key/val
Reduce
output
input
Map
key/val
key/val
Reduce
output
input
Map
key/val
key/val
Reduce
output
Disk
I/O
Disk
I/O
23
Performance Overhead (2)

Iterative algorithms & Map-Reduce Chaining
Join
Groupby
Decision-Tree
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Disk
I/O
Disk
I/O
24
HBase Caching
 HBase provides Scanner caching and Block caching
– Scanner caching
•
setCaching(int cache);
•
tells the scanner how many rows to fetch at a time
– Block caching
•
setCacheBlocks(true);
 HBase caching helps read/write performance but
not sufficient to solve our problem
25
Spark / Shark
 Spark
– In-memory computing framework
– An Apache incubator project
– RDD (Resilient Distributed Datasets)
– A fault-tolerant framework
– Targets iterative machine learning algorithms
 Shark
– Data warehouse for Spark
– Compatible with Apache Hive
26
Spark / Shark
 Scheduling
Map
Reduce
Map
Reduce
Hadoop
Spark
Linux
- Stand-alone Spark
- No fine-grained scheduling
within Spark
Map
Reduce
Spark
Hadoop
Mesos
Linux
- No fine-grained scheduling
btw Hadoop and Spark
Hadoop
Spark
Mesos / YARN
Linux
- Mesos: Hadoop dependency
- YARN
27
Time-Series Data Layout Patterns
Column
Ti1
Ti2
Ti3
Ti4
Ti5
Ti6
Ti7
Ti8
BLOB
(uncompressed)
Row
Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8
…
bin
…
+ : no conversion
- : slow read
+ : fast read/write
- : slow conversion
+ : fast read/write
- : slow search
28
Time-Series Data Layout Patterns
Column
RDB is columnar
RDB
Ti1
Ti2
Ti3
Ti4
Ti5
Ti6
Ti7
Ti8
Ti9
Row
Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8
…
…
When loading/unloading from/to RDB,
it is really important to decide whether to store in column or row format
29
R and Hadoop
 R is memory-based  Cannot run data that cannot
fit inside a memory
 R is not thread-safe  Cannot run in a multi-
threaded environment
 Creating a distributed version of each and every R
function  Cannot take advantage of 3500 R
packages that are already built!
30
Running R from Hadoop
 What if the data are wide and fat?
6000~7000
1M
t1
t2
t3
t4
…
t1M
 Pros: can re-use R packages with no modification
 Cons: cannot handle large data that cannot fit into memory
– But, do we need large number of time-series data to predict the
future?
31
Not so big data
 “Nobody ever got fired for using Hadoop on a cluster?”
– HOTCDP’12 paper
– Average Map-Reduce like jobs handle less than 14 GB
 Time-series analysis for data forecasting
– Sampling every minute for two-years to forecasting next year
 less than 2M rows
– It becomes big when sampling at sub-second resolution
32
Statistical Analysis and Machine Learning Library
Big
Filtering
Chain, Iterative
Map-Reduce
Spark
+ SQL (Hive / Shark / Impala / …)
Small, but many
Small
R on Hadoop
R on a single server
33
Summary
 Map-Reduce is surprisingly efficient framework for
most filter-and-reduce operations
 As for data massaging (data pre-processing), in-
memory capability with SQL support is a must
 Calling R from Hadoop can be quite useful when
analyzing many but, not-so-big data and is a fastest
way to increase your list of statistical and machine
learning functions
34
Thank you!
35
Download