Asymmetric Numeral Systems (ANS)

advertisement
ASYMMETRIC NUMERAL SYSTEMS:
ADDING FRACTIONAL BITS TO HUFFMAN CODER
Huffman coding
– fast, but operates on integer number of bits:
approximates probabilities with powers of ½,
getting inferior compression rate
Arithmetic coding
– accurate, but many times slower
(computationally more demanding)
Asymmetric Numeral Systems (ANS) – accurate and faster than Huffman
we construct low state automaton optimized for given probability distribution
Jarek Duda
Kraków, 28-04-2014
1
We need 𝒏 bits of information to choose one of πŸπ’ possibilities.
For length 𝑛 0/1 sequences with 𝑝𝑛 of “1”, how many bits we need to choose one?
Entropy coder: encodes sequence with (𝑝𝑖 )𝑖=1..π‘š probability distribution
using asymptotically at least 𝑯 = ∑π’Ž
π’Š=𝟏 π’‘π’Š π₯𝐠(𝟏/π’‘π’Š ) bits/symbol (𝐻 ≤ lg(π‘š))
Seen as weighted average: symbol of probability 𝒑 contains π₯𝐠(𝟏/𝒑) bits.
Encoding (𝑝𝑖 ) distribution with entropy coder optimal for (π‘žπ‘– ) distribution costs
Δ𝐻 = ∑𝑖 𝑝𝑖 lg(1/π‘žπ‘– ) − ∑𝑖 𝑝𝑖 lg(1/𝑝𝑖 ) =
𝑝𝑖
∑𝑖 𝑝𝑖 lg ( )
π‘žπ‘–
≈
1
ln(4)
∑𝑖
(𝑝𝑖 −π‘žπ‘– )2
𝑝𝑖
more bits/symbol - so called Kullback-Leibler “distance” (not symmetric).
2
Symbol frequencies from a version of the British
National Corpus containing 67 million words
( http://www.yorku.ca/mack/uist2011.html#f1 ):
Uniform: π₯𝐠(πŸπŸ•) ≈ πŸ’. πŸ•πŸ“ bits/symbol (π‘₯ → 27π‘₯ + 𝑠)
Huffman uses: 𝐻 ′ = ∑𝑖 𝑝𝑖 π’“π’Š ≈ πŸ’. 𝟏𝟐 bits/symbol
Shannon: 𝐻 = ∑𝑖 𝑝𝑖 π₯𝐠(𝟏/π’‘π’Š ) ≈ πŸ’. πŸŽπŸ– bits/symbol
We can theoretically improve by ≈ 𝟏% here
πš«π‘― ≈ 𝟎. πŸŽπŸ’ bits/symbol
order 1 Markov: ~3.3 bits/symbol
order 2: ~3.1 bits/symbol, word: ~2.1 bits/symbol
Currently the best text compression: “durlica’kingsize”
(http://mattmahoney.net/dc/text.html )
109 bytes of text from Wikipedia (enwik9)
into 127784888 bytes: ≈ 1bit/symbol
… (lossy) video compression: ~ 1000× reduction
3
Huffman codes – encode symbols as bit sequences
Perfect if symbol probabilities are powers of 2
e.g.
p(A) =1/2, p(B)=1/4,
p(C)=1/4
0
A
1
0
1
B
C
A
C
0
11
B
10
We can reduce Δ𝐻 by grouping π‘š symbols together (alphabet size 2π‘š or 3π‘š ):
Generally, the maximal depth (𝑅 = max π‘Ÿπ‘– ) grows proportionally to π‘š here,
𝑖
Δ𝐻 drops approximately proportionally to 1/π‘š, πš«π‘― ∝ 𝟏/𝑹 (= 1/lg(𝐿))
for ANS: πš«π‘― ∝ πŸ’−𝑹
(= 1/𝐿2 for 𝐿 = 2R is the number of states)
4
Fast Huffman decoding step for maximal depth 𝑅: let 𝑋 contain last 𝑅 bits
(requires π‘‘π‘’π‘π‘œπ‘‘π‘–π‘›π‘”π‘‡π‘Žπ‘π‘™π‘’ of size proportional to 𝑳 = πŸπ‘Ή ):
𝑑 = π‘‘π‘’π‘π‘œπ‘‘π‘–π‘›π‘”π‘‡π‘Žπ‘π‘™π‘’[𝑋];
// 𝑋 ∈ {0, … , 2𝑅 − 1} is current state
useSymbol(𝑑. π‘ π‘¦π‘šπ‘π‘œπ‘™);
// use or store decoded symbol
𝑋 = 𝑑. 𝑛𝑒𝑀𝑋 + readBits(𝑑. 𝑛𝑏𝐡𝑖𝑑𝑠);
// state transition
where 𝑑. 𝑛𝑒𝑀𝑋 for Huffman are unused bits of the state, shifted to oldest position:
𝑑. 𝑛𝑒𝑀𝑋 = (𝑋 β‰ͺ 𝑛𝑏𝐡𝑖𝑑𝑠) & (2𝑅 − 1)
(mod(π‘Ž, 2𝑅 ) = π‘Ž & (2𝑅 − 1))
tANS: the same decoder, different 𝑑. 𝑛𝑒𝑀𝑋: not only shifts the unused bits,
but also modifies them accordingly to the remaining fractional bits (lg(1/𝑝)):
𝒙 = 𝑿 + 𝑳 ∈ {𝑳, … , πŸπ‘³ − 𝟏} buffer containing ≈ π₯𝐠(𝒙) ∈ [𝑹, 𝑹 + 𝟏) bits:
5
Operating on fractional number of bits
restricting range 𝐿 to length 𝑙 subrange
number π‘₯
contains lg(𝐿/𝑙) bits
contains lg(π‘₯) bits
adding symbol of probability 𝑝 - containing lg(1/𝑝) bits
lg(𝐿/𝑙) + lg(1/𝑝) ≈ lg(𝐿/𝑙′ ) bits
lg(π‘₯) + lg(1/𝑝) = lg(π‘₯′)
for 𝒍′ ≈ 𝒑 ⋅ 𝒍
for 𝒙′ ≈ 𝒙/𝒑
6
Asymmetric numeral systems – redefine even/odd numbers:
symbol distribution: 𝑠: β„• → π’œ
(π’œ – alphabet, e.g. {0,1})
( 𝑠(π‘₯ ) = mod(π‘₯, 𝑏) for base 𝑏 numeral system: 𝐢 (𝑠, π‘₯ ) = 𝑏π‘₯ + 𝑠)
Should be still uniformly distributed – but with density 𝒑𝒔 :
# { 0 ≤ 𝑦 < π‘₯: 𝑠(𝑦) = 𝑠} ≈ π‘₯𝑝𝑠
then π‘₯ becomes π‘₯-th appearance of given symbol:
𝐷(π‘₯′) = (𝑠(π‘₯ ′ ), # {0 ≤ 𝑦 < π‘₯ ′ : 𝑠(𝑦) = 𝑠(π‘₯ ′ )})
𝐢(𝐷(π‘₯ ′ )) = π‘₯′
𝐷(𝐢 (𝑠, π‘₯)) = (𝑠, π‘₯)
π‘₯ ′ ≈ π‘₯/𝑝𝑠
example: range asymmetric binary systems (rABS)
Pr(0) = 1/4 Pr(1) = 3/4 - take base 4 system and merge 3 digits,
cyclic (0123) symbol distribution 𝑠 becomes cyclic (0111):
𝑠(π‘₯) = 0 if mod(π‘₯, 4) = 0, else 1
to decode or encode 1, localize quadruple (⌊π‘₯/4⌋ or ⌊π‘₯/3⌋)
𝑖𝑓 𝑠(π‘₯) = 0, 𝐷(π‘₯) = (0, ⌊π‘₯/4⌋) else 𝐷(π‘₯) = (1,3⌊π‘₯/4⌋ + mod(π‘₯, 4) − 1)
𝐢 (0, π‘₯) = 4π‘₯
𝐢 (1, π‘₯) = 4⌊π‘₯/3⌋ + mod(π‘₯, 3) + 1
7
rANS - range variant for large alphabet π’œ = {0, . . , m − 1}
assume Pr(𝑠) = 𝑓𝑠 /2𝑛
𝑐𝑠 : = 𝑓0 + 𝑓1 + β‹― + 𝑓𝑠−1
start with base 2𝑛 numeral system and merge length 𝑓𝑠 ranges
for π‘₯ ∈ {0,1, … , 2𝑛 − 1}, 𝒔(𝒙) = 𝐦𝐚𝐱{𝒔: 𝒄𝒔 ≤ 𝒙}
encoding: 𝐢 (𝑠, π‘₯ ) = ⌊π‘₯/𝑓𝑠 ⌋ β‰ͺ 𝑛 + mod(π‘₯, 𝑓𝑠 ) + 𝑐𝑠
decoding: 𝑠 = 𝑠(π‘₯ & (2𝑛 − 1)) (e.g. tabled, alias method)
𝐷(π‘₯ ) = (𝑠, 𝑓𝑠 ⋅ (π‘₯ ≫ 𝑛) + ( π‘₯ & (2𝑛 − 1)) − 𝑐𝑠 )
Similar to Range Coding, but decoding has 1 multiplication (instead of 2), and
state is 1 number (instead of 2), making it convenient for SIMD vectorization.
( https://github.com/rygorous/ryg_rans ).
Additionally, we can use alias method to store 𝑠 for very precise
probabilities. It is also convenient for dynamical updating.
‘Alias’ method: rearrange probability distribution into π‘š buckets:
containing the primary symbol and eventually a single ‘alias’ symbol
8
uABS - uniform binary variant (π’œ = {0,1}) - extremely accurate
Assume binary alphabet, 𝒑 ≔ 𝐏𝐫(𝟏), denote 𝒙𝒔 = {π’š < 𝒙: 𝒔(π’š) = 𝒔} ≈ 𝒙𝒑𝒔
For uniform symbol distribution we can choose:
π’™πŸ = ⌈𝒙𝒑⌉
π’™πŸŽ = 𝒙 − π’™πŸ = 𝒙 − ⌈𝒙𝒑⌉
𝑠(π‘₯) = 1 if there is jump on next position: 𝑠 = 𝑠(π‘₯) = ⌈(π‘₯ + 1)𝑝⌉ − ⌈π‘₯𝑝⌉
decoding function: 𝐷 (π‘₯) = (𝑠, π‘₯𝑠 )
its inverse – coding function:
π‘₯+1
𝐢 (0, π‘₯) = ⌈ ⌉ − 1
1−𝑝
π‘₯
𝐢 (1, π‘₯) = ⌊ ⌋
𝑝
For 𝑝 = Pr(1) = 1 − Pr(0) = 0.3:
9
Stream version – renormalization
Currently: we encode using succeeding 𝐢 functions into huge number π‘₯,
then decode (in reverse direction!) using succeeding 𝐷.
Like in arithmetic coding, we need renormalization to limit working precision enforce 𝒙 ∈ 𝑰 = {𝑳, … , 𝒃𝑳 − 𝟏} by transferring base 𝒃 youngest digits:
ANS decoding step from state π‘₯
(𝑠, π‘₯) = 𝐷(π‘₯);
useSymbol(𝑠);
while 𝒙 < 𝑳, 𝒙 = 𝒃𝒙 + π«πžπšππƒπ’π π’π­();
encoding step for symbol 𝑠 from state π‘₯
while 𝒙 > π’Žπ’‚π’™π‘Ώ[𝒔]
// = 𝒃𝑳𝒔 − 𝟏
{writeDigit(mod(𝒙, 𝒃)); 𝒙 = ⌊𝒙/𝒃⌋};
π‘₯ = 𝐢(𝑠, π‘₯);
For unique decoding,
we need to ensure that there is a single way to perform above loops:
𝐼 = {𝐿, … , 𝑏𝐿 − 1}, 𝐼𝑠 = {𝐿𝑠 , … , 𝑏𝐿𝑠 − 1}
where 𝐼𝑠 = {π‘₯: 𝐢 (𝑠, π‘₯) ∈ 𝐼}
Fulfilled e.g. for
- rABS/rANS when 𝑝𝑠 = 𝑓𝑠 /2𝑛 has 1/𝐿 accuracy: 2𝑛 divides 𝐿,
- uABS when 𝑝 has 1/𝐿 accuracy: 𝑏⌈𝐿𝑝⌉ = ⌈𝑏𝐿𝑝⌉,
- in tabled variants (tABS/tANS) it will be fulfilled by construction.
10
Single step of stream version:
to get π‘₯ ∈ 𝐼 to 𝐼𝑠 = {𝐿𝑠 , … , 𝑏𝐿𝑠 − 1}, we need to transfer π‘˜ digits:
𝑠
π‘₯ → (𝐢(𝑠, ⌊π‘₯/𝑏 π‘˜ ⌋), mod(π‘₯, 𝑏 π‘˜ ))
where π‘˜ = ⌊log 𝑏 (π‘₯/𝐿𝑠 )⌋
for given 𝑠, observe that π‘˜ can obtain one of two values:
π‘˜ = π‘˜π‘  or π‘˜ = π‘˜π‘  − 1
for π’Œπ’” = −⌊π₯𝐨𝐠 𝒃 (𝒑𝒔 )⌋ = −⌊log 𝑏 (𝐿𝑠 /𝐿)⌋
e. g. : 𝑏 = 2, π‘˜π‘  = 3, 𝐿𝑠 = 13, 𝐿 = 66, π‘π‘˜π‘  π‘₯ + 3 = 115, π‘₯ = 14, 𝑝𝑠 = 13/66:
11
General picture:
encoder prepares before consuming succeeding symbol
decoder produces symbol, then consumes succeeding digits
Decoding is in reversed direction: we have stack of symbols (LIFO)
- the final state has to be stored, but we can write information in initial state
- encoding should be made in backward direction,
but use context from perspective of future forward decoding.
12
In single step (𝐼 = {𝐿, … , 𝑏𝐿 − 1}): π₯𝐠(𝒙) → ≈ π₯𝐠(𝒙) + π₯𝐠(𝟏/𝒑)
modulo π₯𝐠(𝒃)
Three sources of unpredictability/chaosity:
1) Asymmetry: behavior strongly dependent on chosen
symbol – small difference changes decoded symbol
and so the entire behavior.
2) Ergodicity: usually log 𝑏 (1/𝑝) is irrational –
succeeding iterations cover entire range.
3) Diffusivity: 𝐢(𝑠, π‘₯) is close but not exactly π‘₯/𝑝𝑠 – there
is additional ‘diffusion’ around expected value
So lg(π‘₯) ∈ [lg(𝐿) , lg(𝐿) + lg(𝑏))
has nearly uniform distribution –
π‘₯ has approximately:
𝐏𝐫(𝒙) ∝ 𝟏/𝒙
probability distribution – contains
lg(1/Pr(π‘₯)) ≈ lg(π‘₯) + const
bits of information.
13
Redundancy: instead of 𝑝𝑠 we use π‘žπ‘  = π‘₯/𝐢(𝑠, π‘₯) probability:
2
2
(
)
(𝑝𝑠 − π‘žπ‘  )
(πœ–π‘  π‘₯ )
1
1
Δ𝐻 ≈
∑
…
Δ𝐻 ≈
∑ Pr(π‘₯)
ln(4)
𝑝𝑠
ln(4)
𝑝𝑠
𝑠
𝑠,π‘₯
where inaccuracy: πœ–π‘  (π‘₯) = 𝑝𝑠 − π‘₯/𝐢(𝑠, π‘₯) drops like 1/𝐿:
Δ𝐻 ≤
1
ln(4)⋅π‘³πŸ
1
1
𝑝
1−𝑝
( +
)
for uABS, Δ𝐻 ≤
π‘š
ln(4)⋅π‘³πŸ
∑𝑠
1
𝑝𝑠
for rANS
Denoting 𝑳 = πŸπ‘Ή , πš«π‘― ≈ πŸ’−𝑹 and grows proportionally to (𝐚π₯π©π‘πšπ›πžπ­ 𝐬𝐒𝐳𝐞)𝟐 .
for uABS:
14
Tabled variants: tANS, tABS
(choose 𝐿 = 2𝑅 )
𝐼 = {𝐿, … ,2𝐿 − 1}, 𝐼𝑠 = {𝐿𝑠 , … ,2𝐿𝑠 − 1}
where 𝐼𝑠 = {π‘₯: 𝐢 (𝑠, π‘₯) ∈ 𝐼}
we will choose 𝑠 symbol distribution (symbol spread):
#{𝒙 ∈ 𝑰: 𝒔(𝒙) = 𝒔} = 𝑳𝒔 approximating probabilities: 𝑝𝑠 ≈ 𝐿𝑠 /𝐿
Fast pseudorandom spread (https://github.com/Cyan4973/FiniteStateEntropy ):
𝑠𝑑𝑒𝑝 = 5/8 𝐿 + 3;
// 𝑠𝑑𝑒𝑝 chosen to cover the whole range
𝑋 = 0;
// 𝑿 = 𝒙 − 𝑳 ∈ {𝟎, … , 𝑳 − 𝟏} for table handling
for 𝑠 = 0 to π‘š − 1 do
{π‘ π‘¦π‘šπ‘π‘œπ‘™[𝑋] = 𝑠; 𝑋 = mod(𝑋 + 𝑠𝑑𝑒𝑝, 𝐿); }
// π’”π’šπ’Žπ’ƒπ’π’[𝑿] = 𝒔(𝑿 + 𝑳)
Then finding table for
{𝑑 = π‘‘π‘’π‘π‘œπ‘‘π‘–π‘›π‘”π‘‡π‘Žπ‘π‘™π‘’(𝑋); useSymbol(𝑑. π‘ π‘¦π‘šπ‘π‘œπ‘™ );
𝑋 = 𝑑. 𝑛𝑒𝑀𝑋 + readBits(𝑑. 𝑛𝑏𝐡𝑖𝑑𝑠)} decoding:
for 𝑠 = 0 to π‘š − 1 do 𝑛𝑒π‘₯𝑑[𝑠] = 𝐿𝑠 ;
// symbol appearance number
for 𝑋 = 0 to 𝐿 − 1 do
// fill all positions
{𝑑. π‘ π‘¦π‘šπ‘π‘œπ‘™ = π‘ π‘¦π‘šπ‘π‘œπ‘™[𝑋];
// from symbol spread
π‘₯ = 𝑛𝑒π‘₯𝑑[𝑑. π‘ π‘¦π‘šπ‘π‘œπ‘™] + +;
// use symbol and shift appearance
𝑑. 𝑛𝑏𝐡𝑖𝑑𝑠 = 𝑅 − ⌊lg(π‘₯ )⌋;
// 𝑳 = πŸπ‘Ή
𝑑. 𝑛𝑒𝑀𝑋 = (π‘₯ β‰ͺ 𝑑. 𝑛𝑏𝐡𝑖𝑑𝑠) − 𝐿; π‘‘π‘’π‘π‘œπ‘‘π‘–π‘›π‘”π‘‡π‘Žπ‘π‘™π‘’[𝑋] = 𝑑; }
15
Precise initialization (heuresis)
0.5+𝑖
𝑁𝑠 = {
: 𝑖 = 0, … , 𝐿𝑠 − 1}
𝑝𝑠
are uniform – we need to shift
them to natural numbers.
(priority queue with put, getmin𝑣)
for 𝑠 = 0 to 𝑛 − 1 do
put((0.5/𝑝𝑠 , 𝑠));
for 𝑋 = 0 to 𝐿 − 1 do
{(𝑣, 𝑠) = getmin𝑣;
put((𝑣 + 1/𝑝𝑠 , 𝑠));
π‘ π‘¦π‘šπ‘π‘œπ‘™ [𝑋] = 𝑠;
}
πš«π‘― drops like 𝟏/π‘³πŸ
L ∝ alphabet size
16
tABS and tuning
test all possible
symbol distributions
for binary alphabet
store tables for
quantized probabilities (p)
e.g. 16 ⋅ 16 = 256 bytes
← for 16 state, Δ𝐻 ≈ 0.001
H.264 “M decoder” (arith. coding)
tABS
𝑑 = π‘‘π‘’π‘π‘œπ‘‘π‘–π‘›π‘”π‘‡π‘Žπ‘π‘™π‘’[𝑝][𝑋];
𝑋 = 𝑑. 𝑛𝑒𝑀𝑋 + readBits(𝑑. 𝑛𝑏𝐡𝑖𝑑𝑠);
useSymbol(𝑑. π‘ π‘¦π‘šπ‘π‘œπ‘™);
no branches,
no bit-by-bit renormalization
state is single number (e.g. SIMD)
17
Additional tANS advantage – simultaneous encryption
we can use huge freedom while initialization: choosing symbol distribution –
slightly disturb 𝒔(𝒙) using PRNG initialized with cryptographic key
ADVANTAGES comparing to standard (symmetric) cryptography:
standard, e.g. DES, AES ANS based cryptography (initialized)
based on
XOR, permutations
highly nonlinear operations
bit blocks
fixed length
pseudorandomly varying lengths
“brute force”
just start decoding
perform initialization first for new
or QC attacks
to test cryptokey
cryptokey, fixed to need e.g. 0.1s
speed
online calculation
most calculations while initialization
entropy
operates on bits
operates on any input distribution
18
Summary – we can construct accurate and extremely fast entropy coders:
- accurate (and faster) replacement for Huffman coding,
- many times faster replacement for Range coding,
- faster decoding in adaptive binary case (but more complex encoding),
- can simultaneously encrypt the message.
Perfect for: DCT/wavelet coefficients, lossless image compression
Perfect with: Lempel-Ziv, Burrows-Wheeler Transform
... and many others ...
Further research:
- finding symbol distribution 𝑠(π‘₯): with minimal Δ𝐻 (and quickly),
- tune accordingly to 𝑝𝑠 ≈ 𝐿𝑠 /𝐿 approximation (e.g. very low probable
symbols should have single appearance at the end of table),
- optimal compression of used probability distribution to reduce headers,
- maybe finding other low state entropy coding family, like forward decoded,
- applying cryptographic capabilities (also without entropy coding),
- ... ?
19
Download