Statistical Analysis and Machine Learning using Hadoop Seungjai Min Samsung SDS Knowing that… Hadoop/Map-Reduce has been successful in analyzing unstructured web contents and social media data Another source of big data is semi-structured machine/device generated logs, which require nontrivial data massaging and extensive statistical data mining 1 Question Is Hadoop/Map-Reduce the right framework to implement statistical analysis (more than counting and descriptive statistics) and machine learning algorithms (which involve iterations)? 2 Answer and Contents of this talk Yes, Hadoop/Map-Reduce is the right framework – Why is it better than MPI and CUDA? – Map-Reduce Design Patterns – Data Layout Patterns No, but there are better alternatives – Spark/Shark (as an example) – R/Hadoop (it is neither RHadoop nor Rhive) 3 Contents Programming Models – Map-Reduce vs. MPI vs. Spark vs. CUDA(GPU) Map-Reduce Design Patterns – Privatization Patterns (Summarization / Filtering / Clustering) – Data Organization Patterns (Join / Transpose) Data Layout Patterns – Row vs. Column vs. BLOB Summary – How to choose the right programming model for your algorithm 4 Parallel Programming is Difficult Too many parallel programming models (languages) Titanium Co-array Fortran RapidMind OpenMP Cilk Brook CUDA UPC Chapel P-threads Erlang PVM OpenCL Intel TBB MPI Fortress X10 5 MPI Framework Assembly Language of the Parallel Programming myN = N / nprocs; 1 for (i=0; i<=myN; i++) { A[i] = initialize(i); } 100 101 200 201 300 301 400 left_index = …; right_index = …; MPI_Send(pid-1, A[left_index], sizeof(int), …); MPI_Recv(pid+1, A[right_index], sizeof(int), …); for (i=0; i<=myN; i++) { B[i] = (A[i]+A[i+1])/2.0; } 6 Map-Reduce Framework Parallel Programming for the masses! Map/Combine/Partition Shuffle Sort/Reduce input Map key/val key/val Reduce output input Map key/val key/val Reduce output input Map key/val key/val Reduce output 7 Map-Reduce vs. MPI Similarity – Programming model • Processes not threads • Address spaces are separate (data communications are explicit) – Data locality • “owner computes” rule dictates that computations are sent to where data is not the other way round 8 Map-Reduce vs. MPI Differences Map-Reduce MPI Expressing Parallelism Embarrassingly Parallel Almost all parallel (Filter +Reduction) forms but not good for task parallelism Data Communication Under the hood Explicit / User-provided Data Layout (Locality) Under the hood Explicit / User-control Fault Tolerance Under the hood None (as of MPI1.0 and MPI 2.0) 9 GPU GPGPU (General Purpose Graphic Processing Units) 10~50 times faster than CPU if an algorithm fits this model Good for embarrassingly parallel algorithms (e.g. image) Costs ($2K~$3.5K) and Performance (2 Quad-cores vs. One GPU) GPU Multi-core CPUs Shared memory Local Mem Local Mem $ $ $ CPU CPU CPU Global Memory 10 Programming CUDA cudaArray* cu_array; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc<float>(), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); odata[y*width+x] = c; } Hard to program/debug hard to find good engineers hard to maintain codes 11 Design Patterns in Parallel Programming Privatization Idiom p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } 12 Design Patterns in Parallel Programming Reduction Idiom #define N 400 #pragma omp parallel for for (i=1; i<=N; i++) { A[i] = 1; } sum = 0; #pragma omp parallel for reduction(+:sum) for (i=1; i<=N; i++) { sum += A[i]; // dependency } printf(“sum = %d\n”, sum); 13 Design Patterns in Parallel Programming Induction Idiom x = K; for (i=0; i<N; i++) { A[i] = x++; } x = K; for (i=0; i<N; i++) { A[i] = x + i; } 14 Design Patterns in Parallel Programming Privatization Idiom p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } Perfect fit for MapReduce framework Map Map Map Reduce } 15 MapReduce Design Patterns Book written by Donald Miner & Adam Shook 1. Summarization patterns 2. Filtering patterns 3. Data organization patterns 4. Join patterns 5. Meta-patterns 6. Input and output patterns 16 Design Patterns Linear Regression (1-dimension) Y = bX + e y b yi xi x y1 x1 e y2 x2 e y3 x3 =b* 1 + 2 e 3 y4 x4 e y5 x5 e 4 5 17 Design Patterns Linear Regression (2-dimension) Y = bX + e n y m x1 y1 x11 x21 e y2 x12 x22 e y3 = b * x13 x23 y4 x14 x24 e y5 x15 x25 e 1 + 2 e 3 4 5 x2 m: # of observations n : # of dimension 18 Design Patterns Linear Regression (distributing on 4 nodes) n/4 Ù yij = å xik xkj + k=1 n/2 åx x + ik kj n k= +1 4 3*n/4 åx n å x + ik kj n k= +1 2 k= xik xkj 3*n +1 4 n n m XTX = n * m = n 19 Design Patterns Linear Regression n (XTX)-1 = inverse of n If n2 is sufficiently small enough Apache math library n should be kept small Avoid curse of dimensionalty 20 Design Patterns Join age name … … … ID time dst … … … 100 25 Bob … … … 100 7:28 CA … … … 210 31 John … … … 100 8:03 IN … … … 360 46 Kim … … … 210 4:26 WA … … … ID Inner join A.ID A.age A.name … … … B.ID B.time B.dst … … … 100 25 Bob … … … 100 7:28 CA … … … 100 25 Bob … … … 100 8:03 IN … … … 210 31 John … … … 210 4:26 WA … … … 21 Design Patterns Join Network overhead 100 25 Bob … … … 210 31 John … … … 360 46 Kim … … … … … … … … … 100 7:28 CA … … … 100 8:03 IN … … … 210 4:26 WA … … … … … … … … … Reduce-side Join 100 25 Bob … 100 25 Bob … … Map Reduce Map Reduce 210 31 John Map Reduce 360 46 Kim … 22 Performance Overhead (1) Map-Reduce suffers from Disk I/O bottlenecks Map/Combine/Partition Shuffle Reduce input Map key/val key/val Reduce output input Map key/val key/val Reduce output input Map key/val key/val Reduce output Disk I/O Disk I/O 23 Performance Overhead (2) Iterative algorithms & Map-Reduce Chaining Join Groupby Decision-Tree Map Reduce Map Reduce Map Reduce Map Reduce Map Reduce Map Reduce Map Reduce Map Reduce Map Reduce Disk I/O Disk I/O 24 HBase Caching HBase provides Scanner caching and Block caching – Scanner caching • setCaching(int cache); • tells the scanner how many rows to fetch at a time – Block caching • setCacheBlocks(true); HBase caching helps read/write performance but not sufficient to solve our problem 25 Spark / Shark Spark – In-memory computing framework – An Apache incubator project – RDD (Resilient Distributed Datasets) – A fault-tolerant framework – Targets iterative machine learning algorithms Shark – Data warehouse for Spark – Compatible with Apache Hive 26 Spark / Shark Scheduling Map Reduce Map Reduce Hadoop Spark Linux - Stand-alone Spark - No fine-grained scheduling within Spark Map Reduce Spark Hadoop Mesos Linux - No fine-grained scheduling btw Hadoop and Spark Hadoop Spark Mesos / YARN Linux - Mesos: Hadoop dependency - YARN 27 Time-Series Data Layout Patterns Column Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 BLOB (uncompressed) Row Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … bin … + : no conversion - : slow read + : fast read/write - : slow conversion + : fast read/write - : slow search 28 Time-Series Data Layout Patterns Column RDB is columnar RDB Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 Ti9 Row Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … … When loading/unloading from/to RDB, it is really important to decide whether to store in column or row format 29 R and Hadoop R is memory-based Cannot run data that cannot fit inside a memory R is not thread-safe Cannot run in a multi- threaded environment Creating a distributed version of each and every R function Cannot take advantage of 3500 R packages that are already built! 30 Running R from Hadoop What if the data are wide and fat? 6000~7000 1M t1 t2 t3 t4 … t1M Pros: can re-use R packages with no modification Cons: cannot handle large data that cannot fit into memory – But, do we need large number of time-series data to predict the future? 31 Not so big data “Nobody ever got fired for using Hadoop on a cluster?” – HOTCDP’12 paper – Average Map-Reduce like jobs handle less than 14 GB Time-series analysis for data forecasting – Sampling every minute for two-years to forecasting next year less than 2M rows – It becomes big when sampling at sub-second resolution 32 Statistical Analysis and Machine Learning Library Big Filtering Chain, Iterative Map-Reduce Spark + SQL (Hive / Shark / Impala / …) Small, but many Small R on Hadoop R on a single server 33 Summary Map-Reduce is surprisingly efficient framework for most filter-and-reduce operations As for data massaging (data pre-processing), in- memory capability with SQL support is a must Calling R from Hadoop can be quite useful when analyzing many but, not-so-big data and is a fastest way to increase your list of statistical and machine learning functions 34 Thank you! 35