CO-CLUSTERING USING CUDA Co-Clustering Explained Problem: Large binary matrix of samples (rows) and features (columns) What samples should be grouped together? Why? What are shared features? Co-clustering provides you the “why” explicitly Correlated sample/feature pair Row cluster: s1 and s3 are in a group Column cluster: distinguishing features are 2,3, and 5 Co-Clustering - Details Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 Partition entire matrix into row groups, col groups Minimize length of encoding of resulting partitioned matrix Competing code length factors: number of row groups & col groups, homogeneity of clusters Iterate over rows, rearrange and sub-partition to find better encoding using heuristic Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups)2*iters) Credit: Chakrabarti et. al, KDD 2004 Implementation - Basics Initial matrix generation : CPU Initial random row/column group assignment: CPU Memory structures very simple, arrays of ints Implementation – Stats step 1 Statistics calculations: Calculates statistics for each row of each column group Statistic is number of 1’s in a column group Straight-forward parallelization (each thread works on one row at a time), global memory Column Groups 2 3 1 Row Groups 3 5 1 1 4 3 2 Stat(Row 3, ColumnGroup 3) = 1 Room For Improvement Calculate row statistics according to histogram algorithm from text book Block columns Assign one thread block to each block Compute shared memory histograms within block Merge back to global memory when finished Implementation – Stats step 2 Calculates cost for each row group of each column group Column Groups Essentially a reduce on the per2 3 1 Row Groups row data 3 Block the rows, assign block to 5 thread block 1 Use shared memory and 1 atomics to build histogram of 4 all rows in a given row group Merge shared histogram with global histogram for that row group Iterate over all row groups 3 2 Stat(RowGroup 1, ColumnGroup 3) = 2 Implementation – Row/Col Group Optimization For each row, find optimal group it could belong to Parallelized straight-forwardly, one row per thread, loop and stride to get all rows Each row calculation goes through all row groups, determines global cost of moving to that row group Move all rows to their optimal group Recompute statistics Repeat for column groups Continue alternating row/column groupings until convergence Room For Improvement Parallelization could be more sophisticated Could block the rows and compute the cost of the row joining each row group in parallel Using shared memory atomics to identify minimum cost In practice, this algorithm heavily favors a small number of row and column groups The parllelization would be therefore be small Implementation Outer Loop After local minimum is found, change initial number of row and column groups and retry Change number of row groups or number of column groups, up or down Continue changing number of row or column groups in that direction until cost fails to decrease Try both directions in both dimensions before stopping Outer loop performed on CPU Room for Improvement Outer loop could parallelize inner loop actions over different GPUs Each could explore the different dimensions and directions in parallel Implementation – CPU + Validation CPU implementation performed all steps described earlier, but sequentially Validation Used CPU implementation of statistics calculations to validate GPU stats calculations CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable Timing Measurements Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation Measured by cuda events under GPU implementation Development Lessons Learned CUDA and structured data is a bad idea Even structs of arrays are impossible to deal with Host-side pointer math on device pointers does not work CUDA API has REALLY unfriendly error messages __device__ variables declared globally must be passed to kernels Take care to do very, very little through that API Runtime errors otherwise You can malloc and free shared memory in device code as of 3.2 Development Lessons Learned Cont Visual Studio CUDA integration leaves a lot to be desired All optimizations removed, still can’t set breakpoints everywhere Many variables show as freed No in-IDE, real-time, in editor compile errors But, Visual Studio does give nice auto-complete, auto-definition navigation No CUDA linker => separate files must be directly #include’d Experiment - Environment Float.cs.drexel.edu CPU: 4 quad-core Intel Xeon L5360 processors @2.13 Ghz GPU: 2 Nvidia GeForce GTX 580 GPUs @1544Mhz Experiment - Description Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and 10000 Larger matrices caused memory problems GPU tested with varying block and thread counts Num blocks: 10, 100, 5000 Num threads: 10, 100, 1024 (max) Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these Experiment Results Speedup - 10 Blocks 80 70 60 50 Num Threads 10 40 100 1024 30 20 10 0 100 1000 Matrix Order 10000 Experiment Results For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups ~= num_col_groups Slowdown occurs when matrix order exceeds total number of threads and more must be done serially Experiment - Results Speedup - 100 Blocks 80 70 60 50 Num Threads 10 40 100 1024 30 20 10 0 100 1000 Matrix Order 10000 Experiment Results Speedup - 5000 Blocks 80 70 60 50 Num Threads 10 40 100 1024 30 20 10 0 100 1000 Matrix Order 10000 Experiment Results Interestingly, the maximum speedup was the same in all block counts Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case? Maxing out thread count is counter-productive in smaller matrices Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly Experiment Results Effficiency - 10 Blocks 0.08 0.07 0.06 0.05 Num Threads 10 0.04 100 1024 0.03 0.02 0.01 0 100 1000 Matrix Order 10000 Experiment Results Efficiency - 100 Blocks 0.05 0.045 0.04 0.035 Num Threads 0.03 10 0.025 100 0.02 1024 0.015 0.01 0.005 0 100 1000 Matrix Order 10000 Experiment Results Efficiency - 5000 Blocks 0.0012 0.001 0.0008 Num Threads 10 0.0006 100 1024 0.0004 0.0002 0 100 1000 Matrix Order 10000 Experiment Results Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads