ASYMMETRIC NUMERAL SYSTEMS AS ACCURATE RELACEMENT FOR HUFFMAN CODING Huffman coding – fast, but operates on integer number of bits: (or any prefix codes, approximates probabilities with powers of ½, Golomb, Elias, unary) getting suboptimal compression ratio Arithmetic coding (or Range coding) – accurate probabilities, but many times slower (needs more computation, battery drain) Asymmetric Numeral Systems (ANS) – accurate and faster than Huffman: we construct low state automaton optimized for given probability distribution (also allows for simultaneous encryption) JarosΕaw Duda Warszawa, 31-01-2015 1 Huffman vs ANS in compressors (LZ + entropy coder): from Matt Mahoney benchmarks http://mattmahoney.net/dc/text.html Enwiki8 πππ bytes 26,915,461 27,111,239 27,835,431 29,758,785 30,979,376 34,907,478 36,445,248 36,556,552 40,024,854 40,565,268 encode time decode time [ns/byte] [ns/byte] 582 2.8 378 10 1004 31 228 30 19 2.7 24 3.5 101 17 171 50 7.7 3.8 54 31 lzturbo 1.2 –39 LZA 0.70b –mx9 –b7 –h7 WinRAR 5.00 –ma5 –m5 WinRAR 5.00 –ma5 –m2 lzturbo 1.2 –32 zhuff 0.97 –c2 gzip 1.3.5 –9 pkzip 2.0.4 –ex ZSTD 0.0.1 WinRAR 5.00 –ma5 –m1 zhuff, ZSTD (Yann Collet): LZ4 + tANS (switched from Huffman) lzturbo (Hamid Bouzidi): LZ + tANS (switched from Huffman) LZA (Nania Francesco): LZ + rANS (switched from range coding) e.g. lzturbo vs gzip: better compression, 5x faster encoding, 6x faster decoding saving time and energy in very frequent task 2 We need π bits of information to choose one of ππ possibilities. For length π 0/1 sequences with ππ of “1”, how many bits we need to choose one? Entropy coder: encodes sequence with (ππ )π =0..π−1 probability distribution using asymptotically at least π― = ∑π ππ π₯π (π/ππ ) bits/symbol (π» ≤ lg(π)) Seen as weighted average: symbol of probability π contains π₯π (π/π) bits. Encoding (ππ ) distribution with entropy coder optimal for (ππ ) distribution costs Δπ» = ∑π ππ lg(1/ππ ) − ∑π ππ lg(1/ππ ) = ππ ∑π ππ lg ( ) ππ ≈ 1 ln(4) ∑π (ππ −ππ )2 ππ more bits/symbol - so called Kullback-Leibler “distance” (not symmetric). 3 Huffman coding (HC), prefix codes: most of everyday compression, e.g. zip, gzip, cab, jpeg, gif, png, tiff, pdf, mp3… example of Huffman penalty for truncated π(1 − π)π distribution (can’t use less than 1 bits/symbol) Zlibh (the fastest generally available implementation): Encoding ~ 320 MB/s Decoding ~ 300-350 MB/s (/core, 3GHz) Range coding (RC): large alphabet arithmetic coding, needs multiplication, e.g. 7-Zip, VP Google video codecs (e.g. YouTube, Hangouts). Encoding ~ 100-150 MB/s tradeoffs Decoding ~ 80 MB/s (binary) Arithmetic coding (AC): H.264, H.265 video, ultracompressors e.g. PAQ Encoding/decoding ~ 20-30MB/s ratio → π = 0.5 π = 0.6 π = 0.7 π = 0.8 π = 0.9 π = 0.95 π = 0.99 8/H 4.001 4.935 6.344 8.851 15.31 26.41 96.95 zlibh 3.99 4.78 5.58 6.38 7.18 7.58 7.90 FSE 4.00 4.93 6.33 8.84 15.29 26.38 96.43 Huffman: 1byte → at least 1 bit ratio ≤ 8 here Asymmetric Numeral Systems (ANS) tabled (tANS) - without multiplication FSE implementation of tANS: Encoding ~ 350 MB/s Decoding ~ 500 MB/s RC → ANS: ~7x decoding speedup, no multiplication (switched e.g. in LZA compressor) HC → ANS means better compression and ~ 1.5x decoding speedup (e.g. zhuff, lzturbo) 4 Operating on fractional number of bits restricting range π to length πΏ subrange contains π₯π (π΅/π³) bits number π₯ contains π₯π (π) bits adding symbol of probability π - containing π₯π (π/π) bits lg(π/πΏ) + lg(1/π) ≈ lg(πΏ/πΏ′ ) for π³′ ≈ π ⋅ π³ lg(π₯) + lg(1/π) = lg(π₯′) for π′ ≈ π/π 5 uABS - uniform binary variant (π = {0,1}) - extremely accurate Assume binary alphabet, π β ππ«(π), denote ππ = {π < π: π(π) = π} ≈ πππ For uniform symbol distribution we can choose: ππ = ⌈ππ⌉ ππ = π − ππ = π − ⌈ππ⌉ π (π₯) = 1 if there is jump on next position: π = π (π₯) = ⌈(π₯ + 1)π⌉ − ⌈π₯π⌉ decoding function: π·(π₯) = (π , π₯π ) its inverse – coding function: π₯+1 πΆ (0, π₯) = ⌈1−π⌉ − 1 π₯ πΆ (1, π₯) = ⌊ ⌋ π For π = Pr(1) = 1 − Pr(0) = 0.3: 6 uABS xp = (uint64_t)x * p0; out[i]=((xp & mask) >= p0); xp >>= 16; x = out[i] ? xp : x - xp; rABS arithmetic coding xf = x & mask; //32bit split = low + ((uint64_t) (hi - low) * p0 >> 16); xn = p0 * (x >> 16); out[i] = (x > split); if (xfrac < p0) { x = xn + xf; out[i] = 0 } if (out[i]) {low = split + 1} else {hi = split;} else {x-=xn+p0;out[i] = 1 } if (x < 0x10000) { x = (x << 16) | *ptr++; //32bit x, 16bit renormalization, mask = 0xffff } if ((low ^ hi) < 0x10000) { x = (x << 16) | *ptr++; low <<= 16; hi = (hi << 16) | mask } 7 RENORMALIZATION to prevent π₯ → ∞ Enforce π₯ ∈ πΌ = {πΏ, … ,2πΏ − 1} by transmitting lowest bits to bitstream “buffer” π contains π₯π (π) bits of information – produce bits when accumulated Symbol of ππ = π contains lg(1/π) bits: lg(π₯) → lg(π₯) + lg(1/π) modulo 1 8 9 tABS test all possible symbol distributions for binary alphabet store tables for quantized probabilities (p) e.g. 16 ⋅ 16 = 256 bytes ← for 16 state, Δπ» ≈ 0.001 H.264 “M decoder” (arith. coding) tABS π‘ = πππππππππππππ[π][π]; π = π‘. πππ€π + readBits(π‘. πππ΅ππ‘π ); useSymbol(π‘. π π¦ππππ); no branches, no bit-by-bit renormalization state is single number (e.g. SIMD) 10 Additional tANS advantage – simultaneous encryption we can use huge freedom while initialization: choosing symbol distribution – slightly disturb π(π) using PRNG initialized with cryptographic key ADVANTAGES comparing to standard (symmetric) cryptography: standard, e.g. DES, AES ANS based cryptography (initialized) based on XOR, permutations highly nonlinear operations bit blocks fixed length pseudorandomly varying lengths “brute force” just start decoding perform initialization first for new or QC attacks to test cryptokey cryptokey, fixed to need e.g. 0.1s speed online calculation most calculations while initialization entropy operates on bits operates on any input distribution 11 Entropy coder is usually the bottleneck of compression ANS offers fast accurate processing of large alphabet without multiplication accurate and faster replacement for Huffman coding, - many (~7) times faster replacement for Range coding, - faster decoding in adaptive binary case, - can include checksum for free (if final decoding state is right), - can simultaneously encrypt the message. Perfect for: DCT/wavelet coefficients, lossless image compression Perfect with: Lempel-Ziv, Burrows-Wheeler Transform Further research: - applications – ANS improves both speed and compression ratio, - finding symbol distribution π (π₯): with minimal Δπ» (and quickly), - optimal compression of used probability distribution to reduce headers, - maybe finding other low state entropy coding family, like forward decoded, - applying cryptographic capabilities (also without entropy coding): cheap compression+encryption: HDD, wearables, sensors, IoT, RFIDs. - ... ? 12