James A. Edwards, Uzi Vishkin University of Maryland Introduction Lossless data compression Common tool better use of memory (e.g., disk space) and network bandwidth. Burrows-Wheeler (BW) compression e.g., bzip2 Relatively higher compression ratio (pro) but slower (con) Snappy (Google) lower compression ratios but fast. Example For MPI on large machines speed is critical. Our motivation fast and high compression ratio Unexpected Prior work unknown to us made empirical follow-up … stronger Assumption throughout: fixed constant-size alphabet State of the field Irregular algorithms: prevalent in CS curriculum and daily work (open-ended problems/programs). Yet, very limited support on today’s parallel hardware. Even more limited with strong scaling Low support for irregular parallel code in HW SW developers limit themselves to regular algorithms HW benchmarks optimize HW for regular code … Namely, parallel data compression is of general interest as an undeniable application representing a big underrepresented “application crowd” “Truly Parallel” BW compression Existing parallel approach: break input into blocks, compress blocks in parallel Practical drawback: good compression & speed only with large input Theory drawback: not really parallel Truly parallel: compress entire input using a parallel algorithm Works for both large and small inputs Can be combined with block-based approach Applications of small inputs: Faster (decompression) & greater compression better use of main memory [ISCA05] & cache [ISCA12] Warehouse-scale computers. Bandwidth between various pairs of nodes can be extremely different; for MPI, MapReduce low bandwidth between pairs debilitating [HP 5th ed.] (i.e., Snappy was a solution) Attempts at truly parallel BW compression A 2011 survey paper [Eirola] stipulates that parallelizing BW could hardly work on GPGPU, and decompression would fall behind further. Portions require “very random memory accessing” “…it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.” The best GPGPU result: even more painful In 2012, Patel et al. concurrently attempted to develop parallel code for BW compression on GPUs but their best result was 2.8X slowdown. Patel reported separately 1.2X speedup for decompression (hence, not referenced in SPAA13 version.) Stages of BW compression & decompression S Block-Sorting Transform (BST) SBST Move-toFront (MTF) encoding SMTF Huffman encoding SBW SMTF Huffman decoding SBW Compression S Inverse Block-Sorting Transform (IBST) SBST Move-toFront (MTF) decoding Decompression Inverse Block-Sorting Transform Serial algorithm: i 0 1 2 3 4 5 6 BST 1. Sort characters of S ; the SBST[i] a n n b $ a a sorted order T[i] forms a ring i → T[i] T[i] 1 5 6 4 0 2 3 2. Starting with $, traverse the ring to recover S 0 1 4 Parallel algorithm: Linked ring 1. Use parallel integer sorting 5 3 to find T[i] i → T[i] 2. Use parallel list ranking to 2 6 traverse the ring Both steps require O(log n) i 4 0 1 5 2 6 3 (END) time and O(n) work rank[i] 6 5 4 3 2 1 0 On current parallel HW list ranking gets you – why we SBST[i] $ a n a n a b chose this step S (read right to left) Conclusion and where to go from here? Despite being originally described as a serial algorithm, BW compression can be accomplished by a parallel algorithm. Material for a few good exercises on prefix sum & list ranking? For a more detailed description of our algorithm, see reference [4] in our brief announcement. This algorithm demonstrates the importance of parallel primitives such as prefix sums and list ranking. Requires support of fine-grained, irregular parallelism and sometimes also strong scaling Issues on all current parallel hardware. Indeed: While recent work from UC Davis (2012) on parallel BW compression on GPUs that we missed taxed ~20% of our originality (same Step 2), It failed to achieve any speedup on compression. Instead a slowdown of 2.8x. For decompression: 1.2X speedup. On the UMD experimental Explicit Multi-Threading (XMT) architecture, we achieved speedups of 25x for compression and 13x for decompression [5]. On balance UC Davis paper huge gift: 70x vs. GPU for compression and 11X for decompression. Where to go from here? Remaining options for the community Figure out how to do it on current HW Or, bash PRAM Or, the alternative we pursued Develop a parallel algorithm that will work well on buildable HW designed to support the bestestablished parallel algorithmic theory Final thought connecting to several other SPAA presentations This is an example where MPI on large systems works in tandem with PRAM-like support on small systems. Intra-node (of a large system) use PRAM compression & decompression algorithms for inter-node MPI messages Counter-argument to an often unstated position. That we need the same parallel programming model at very large and small scales References [4] J. A. Edwards and U. Vishkin. Parallel algorithms for Burrows-Wheeler compression and decompression. TR, UMD, 2012. http://hdl.handle.net/1903/13299. [5] J. A. Edwards and U. Vishkin. Empirical speedup study of truly parallel data compression. TR, UMD, 2013. http://hdl.handle.net/1903/13890. Block-Sorting Transform (BST) Goal: bring occurrences of banana$ characters together Serial algorithm: Input to BST Form a list of all rotations of the input string 2. Sort the list lexicographically 3. Take the last column of the list as output 1. Equivalent to sorting the suffixes of the input string banana$ anana$b nana$ba ana$ban na$bana a$banan $banana Sort $banana a$banan ana$ban anana$b banana$ na$bana nana$ba List of rotations annb$aa Output of BST Block-Sorting Transform (BST) Parallel algorithm: 1. Find the suffix tree of S (O(log2 n) time, O(n) work)) 2. Find the suffix array SA of S by traversing the suffix tree (Euler tour technique: O(log n) time, O(n) work) 3. Permute characters according to SA (O(1) time, O(n) work) 6 0 5 4 2 1 3 i 0 1 2 3 4 5 6 S[i] b a n a n a $ SA[i] 6 5 3 1 0 4 2 S[SA[i]-1] a n n b $ a a Move-to-Front (MTF) encoding Goal: Assign low codes to repeated characters Serial algorithm: Maintain list of characters in order last seen Parallel algorithm: use prefix sums to compute the MTF list for each character (O(log n) time, O(n) work) Associative binary operator: X + Y = Y concat (X – Y) i 0 1 2 3 SBST[i] a n n b Li j L0[j] 0 $ 1 a 2 b 3 n j L1[j] 0 a 1 $ 2 b 3 n j L2[j] 0 n 1 a 2 $ 3 b j L3[j] 0 n 1 a 2 $ 3 b 1 3 0 3 SMTF[i] a,$,b,n b,n,a,$ $,a,b,n b,n n b b,n,a $,a a a,$ $ assumed prefix n,a a n a,$ b,n n b SBST $ a,$ a a a Move-to-Front (MTF) decoding encoding, with the following changes Serial: The MTF lists are used in reverse Parallel: Instead of combining MTF lists, combine permutation functions SMTF Permutation function Same algorithm as 1 3 0 1 2 3 1 0 2 3 0 1 2 3 1 0 0 1 + 2 2 3 3 0 1 2 3 0 1 2 3 0 3 0 1 2 0 1 2 3 3 0 1 2 3 0 1 2 3 3 0 1 2 3 0 1 2 3 1 0 2 = 0 1 2 3 3 1 0 2 Huffman Encoding Goal: Assign shorter bit strings to more-frequent MTF codes The parallelization of this step is already well known Serial algorithm: 1. Count frequencies of characters 2. Build Huffman table based on frequencies 3. Encode characters using the table Parallel algorithm: 1. Use integer sorting to count frequencies (O(log n) time, O(n) work) 2. Build Huffman table using the (standard, heap-based) serial algorithm (O(1) time and work) 3. (a) Compute the prefix sums of the code lengths to determine where in the output to write the code for each character (O(log n) time, O(n) work) (b) Actually write the output (O(1) time, O(n) work) Huffman Decoding Serial algorithm: Read through compressed data, decoding one character at a time Parallel algorithm: partition input and apply serial algorithm to each partition Problem: Decoding cannot start in the middle of the codeword for a character Solution: Identify a set of valid starting bits using prefix sums (O(log n) time, O(n) work) 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 Huffman Decoding How to identify valid starting positions: Divide the input string into partitions of length l (the length of the longest Huffman codeword) 1. Assign a processor to each bit in the input. Processor i decodes the compressed input starting at index i and stops when it crosses a partition boundary, recording the index where it stopped. (O(1) time, O(n) work) Now each partition has l pointers entering it, all of which originate from the immediately preceding partition. 2. Use prefix sums to merge consecutive pointers. (O(log n) time, O(n) work) Now each partition still has l pointers entering it, but they all originate from the first partition. 3. For each bit in the input, mark it as a valid starting position if and only if the pointer that points to that bit originates from the first bit (index 0) of the first partition (O(1) time, O(n) work) Lossless data compression on GPGPU architectures (2011) Inverse BST: “Problems would possibly arise from poor GPU performance of the very random memory accessing caused by the scattering of characters throughout the string.” MTF decoding: “Speeding up decoding on GPGPU platforms might be more challenging since the character lookup is already constant time on serial implementations, and starting decoding from multiple places is difficult since the state of the stack is not known at the other places.” Huffman decoding: “Here again, decompression is harder. This is due to the fact that the decoder doesn’t know where one codeword ends and another begins before it has decoded the whole prior input.” “As for the codeword tables for the VLE, it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.”