# Basics of information theory and information complexity a tutorial

```Basics of information theory
and information complexity
a tutorial
Mark Braverman
Princeton University
June 1, 2013
1
Part I: Information theory
• Information theory, in its modern format
was introduced in the 1940s to study the
problem of transmitting data over physical
channels.
communication channel
Alice
Bob
2
Quantifying “information”
• Information is measured in bits.
• The basic notion is Shannon’s entropy.
• The entropy of a random variable is the
(typical) number of bits needed to remove
the uncertainty of the variable.
• For a discrete variable:
≔ ∑ Pr  =  log 1/Pr⁡[ = ]
3
Shannon’s entropy
• Important examples and properties:
– If  =  is a constant, then   = 0.
– If  is uniform on a finite set  of possible
values, then   = log .
– If  is supported on at most  values, then
X ≤ log .
– If  is a random variable determined by ,
then   ≤ ().
4
Conditional entropy
• For two (potentially correlated) variables
, , the conditional entropy of  given  is
the amount of uncertainty left in  given :
≔ ~ H X Y = y .
• One can show   =   + (|).
• This important fact is knows as the chain
rule.
• If  ⊥ , then
=   +    =   +   .
5
Example
•
•
•
•
= 1 , 2 , 3
= 1 ⊕ 2 , 2 ⊕ 4 , 3 ⊕ 4 , 5
Where 1 , 2 , 3 , 4 , 5 ∈ {0,1}.
Then
–   = 3;   = 4;   = 5;
–    = 1 =   −   ;
–    = 2 =   −   .
6
Mutual information
•  = 1 , 2 , 3
•  = 1 ⊕ 2 , 2 ⊕ 4 , 3 ⊕ 4 , 5
()
(|)
(; )
1
1 ⊕ 2
2 ⊕ 3
(|)
()
4
5
7
Mutual information
• The mutual information is defined as
;  =   −    =   − (|)
• “By how much does knowing  reduce the
entropy of ?”
• Always non-negative  ;  ≥ 0.
• Conditional mutual information:
;   ≔    − (|)
• Chain rule for mutual information:
;  =  ;  + (; |)
8
• Simple intuitive interpretation.
Example – a biased coin
• A coin with -Heads or Tails bias is tossed
several times.
• Let  ∈ {, } be the bias, and suppose
that a-priori both options are equally likely:
= 1.
• How many tosses needed to find ?
• Let 1 , … ,  be a sequence of tosses.
9
What do we learn about ?
•  ; 1 2 =  ; 1 +  ; 2 1 =
; 1 +  ⁡1 ; 2 −  1 ; 2
≤  ; 1 +  ⁡1 ; 2 =
; 1 +  ; 2 +  1 ; 2
=  ; 1 +  ; 2 = 2 ⋅  ; 1 .⁡
• Similarly,
; 1 …  ≤  ⋅  ; 1 .
• To determine  with constant accuracy,
need 0 < c <  ; 1 …  ≤  ⋅  ; 1 .
•  = Ω(1/(; 1 )).
10
Kullback–Leibler (KL)-Divergence
• A distance metric between distributions on
the same space.
• Plays a key role in information theory.
[]
( ∥ ) ≔
log
.
[]

• ( ∥ ) ≥ 0, with equality when  = .
• Caution: ( ∥ ) ≠ ( ∥ )!
11
Properties of KL-divergence
• Connection to mutual information:
;  = ∼ (= ∥ ).
• If  ⊥ , then = = , and both sides are 0.
• Pinsker’s inequality:
−  1 = ( ( ∥ )).
• Tight!
(1/2+ ∥ 1/2 ) = Θ( 2 ).
12
Back to the coin example
•  ; 1 = ∼ (1,= ∥ 1 ) =
1 ∥ 1 = Θ  2 .
2
• =Ω
1
;1
±
=
2
1
Ω( 2).

• “Follow the information learned from the
coin tosses”
• Can be done using combinatorics, but the
information-theoretic language is more
natural for expressing what’s going on.
Back to communication
• The reason Information Theory is so
important for communication is because
operationalize.
• Can attach operational meaning to
Shannon’s entropy:   ≈ “the cost of
transmitting ”.
• Let   be the (expected) cost of
transmitting a sample of .
14
= ()?
• Not quite.
• Let trit  ∈ 1,2,3 .
5
3
•   = ≈ 1.67.
1
2
0
10
3
11
•   = log 3 ≈ 1.58.
• It is always the case that   ≥ ().
15
But   and () are close
• Huffman’s coding:   ≤   + 1.
• This is a compression result: “an
uninformative message turned into a short
one”.
• Therefore:   ≤   ≤   + 1.
16
Shannon’s noiseless coding
• The cost of communicating many copies of
scales as ().
• Shannon’s source coding theorem:
– Let    be the cost of transmitting
independent copies of . Then the
amortized transmission cost
lim (  )/ =   .
→∞
• This equation gives () operational
meaning.
17
⁡
1 , … ,  , …
() per copy
to transmit ’s
communication
channel
18
() is nicer than ()
•   is additive for independent variables.
• Let 1 , 2 ∈ {1,2,3} be independent trits.
•  1 2 = log 9 = 2 log 3.
•  1 2 =
29
9
5
3
<  1 +  2 = 2 × =
30
.
9
• Works well with concepts such as channel
capacity.
19
“Proof” of Shannon’s noiseless
coding
•  ⋅   =    ≤    ≤    + 1.
entropy
Compression
(Huffman)
• Therefore lim (  )/ =   .
→∞
20
Operationalizing other quantities
• Conditional entropy    :
• (cf. Slepian-Wolf Theorem).
1 , … ,  , … (|) per copy
to transmit ’s
communication
channel
1 , … ,  , …
Operationalizing other quantities
• Mutual information  ;  :
1 , … ,  , …
1 , … ,  , …
;  per copy
to sample ’s
communication
channel
Information theory and entropy
• Allows us to formalize intuitive notions.
• Operationalized in the context of one-way
transmission and related problems.
• Has nice properties (additivity, chain rule…)
• Next, we discuss extensions to more
interesting communication scenarios.
23
Communication complexity
• Focus on the two party randomized setting.
Shared randomness R
A & B implement a
functionality (, ).
X
Y
F(X,Y)
A
e.g.  ,  = “ = ? ”
B
24
Communication complexity
Goal: implement a functionality (, ).
A protocol (, ) computing (, ):
Shared randomness R
m1(X,R)
m2(Y,m1,R)
m3(X,m1,m2,R)
X
Y
A
Communication costF(X,Y)
= #of bits exchanged.
B
Communication complexity
• Numerous applications/potential
applications (some will be discussed later
today).
• Considerably more difficult to obtain lower
bounds than transmission (still much easier
than other models of computation!).
26
Communication complexity
• (Distributional) communication complexity
with input distribution  and error :
, ,  .⁡ Error ≤  w.r.t. .
• (Randomized/worst-case) communication
complexity: (, ). Error ≤  on all inputs.
• Yao’s minimax:
,  = max (, , ).

27
Examples
• ,  ∈ 0,1  .
• Equality  ,  ≔ 1= .
•  ,  ≈
1
log .

•  , 0 ≈ .
28
Equality
•⁡is “ = ? ”.
•⁡is a distribution where w.p. ½⁡ =  and w.p.
½ (, ) are random.
Y
X
MD5(X) [128 bits]
X=Y? [1 bit]
A
Error?
• Shows that  , , 2−129 ≤ 129.⁡
B
Examples
• ,  ∈ 0,1  .
• Inner⁡product⁡ ,  ≔ ∑  ⋅  ⁡(⁡2).
•  , 0 =  − ().
In fact, using information complexity:
•  ,  =  −  ().
30
Information complexity
• Information complexity (, )::
communication complexity  ,
as
• Shannon’s entropy () ::
transmission cost ()
31
Information complexity
• The smallest amount of information Alice
and Bob need to exchange to solve .
• How is information measured?
• Communication cost of a protocol?
– Number of bits exchanged.
• Information cost of a protocol?
– Amount of information revealed.
32
Basic definition 1: The
information cost of a protocol
• Prior distribution: ,  ∼ .
Y
X
Protocol
Protocol π
transcript Π
B
A
(, ) ⁡ = ⁡(Π; |) ⁡⁡ + ⁡⁡(Π; |)⁡
Example
•⁡is “ = ? ”.
•⁡is a distribution where w.p. ½⁡ =  and w.p.
½ (, ) are random.
Y
X
MD5(X) [128 bits]
X=Y? [1 bit]
A
B
(, ) ⁡ = ⁡(Π; |) + (Π; |) ⁡ ≈⁡1 + 64.5 = 65.5 bits
Prior  matters a lot for
information cost!
• If  = 1
,
a singleton,
,  = 0.
35
Example
•⁡is “ = ? ”.
•⁡is a distribution where (, ) are just
uniformly random.
Y
X
MD5(X) [128 bits]
X=Y? [1 bit]
A
B
(, ) ⁡ = ⁡(Π; |) + (Π; |) ⁡ ≈⁡0 + 128 = 128 bits
Basic definition 2: Information
complexity
• Communication complexity:
, ,  ≔
min
⁡
⁡ℎ⁡⁡≤
Needed!
• Analogously:
, ,  ≔
.
inf
⁡
⁡ℎ⁡⁡≤
(, ).
37
Prior-free information complexity
• Using minimax can get rid of the prior.
,  = max (, , ).

• For information
,  ≔
inf
⁡
⁡ℎ⁡⁡≤
max ⁡(, ).

38
Ex: The information complexity
of Equality
• What is (, 0)?
• Consider the following protocol.
A non-singular in ×
X in {0,1}n
A1·X
Y in {0,1}n
Continue for
n steps, or
A1·Y
until a disagreement is
A2·X
discovered.
A
A2·Y
B39
Analysis (sketch)
• If X≠Y, the protocol will terminate in O(1)
rounds on average, and thus reveal O(1)
information.
• If X=Y… the players only learn the fact that
X=Y (≤1 bit of information).
• Thus the protocol has O(1) information
complexity for any prior .
40
Operationalizing IC: Information
equals amortized communication
• Recall [Shannon]: lim (  )/ =   .
→∞

• Turns out: lim ( ,  , )/ =  , ,  ,
→∞
for  > 0. [Error  allowed on each copy]
• For  = 0: lim (  ,  , 0+ )/ =  , , 0 .
→∞
• [ lim (  ,  , 0)/ an interesting open
→∞
problem.]
41
Information = amortized
communication
• lim (  ,  , )/ =  , ,  .
→∞
• Two directions: “≤” and “≥”.
•  ⋅   =    ≤    ≤    + 1.
entropy
Compression
(Huffman)
The “≤” direction
• lim (  ,  , )/ ≤  , ,  .
→∞
,  is close to  , ,  .
• Show how to compress many copies of
into a protocol whose communication cost
is close to its information cost.
• More on compression later.
43
The “≥” direction
• lim (  ,  , )/ ≥  , ,  .
→∞
• Use the fact that
, ,

≥
, ,

.
,  ,
=  , ,  .

44
complexity
• Let 1 (1 , 1 ) and 2 (2 , 2 ) be two twoparty tasks.
• E.g. “Solve (, ) with error ≤  w.r.t. ”
• Then
1 × 2 , 1 × 2 =  1 , 1 +  2 , 2
• “≤” is easy.
• “≥” is the interesting direction.
1 , 1 +  2 , 2 ≤
1 × 2 , 1 × 2
• Start from a protocol  for 1 × 2 with
prior 1 × 2 , whose information cost is .
• Show how to construct two protocols 1
for 1 with prior 1 and 2 for 2 with prior
2 , with information costs 1 and 2 ,
respectively, such that 1 + 2 = .
46
( 1 , 2 , 1 , 2 )
1 1 , 1
• Publicly sample 2 ∼ 2
• Bob privately samples
2 ∼ 2 |X2
• Run ( 1 , 2 , 1 , 2 )
2 2 , 2
• Publicly sample 1 ∼ 1
• Alice privately samples
1 ∼ 1 |Y1
• Run ( 1 , 2 , 1 , 2 )
Analysis - 1
1 1 , 1
• Publicly sample 2 ∼ 2
• Bob privately samples
2 ∼ 2 |X2
• Run ( 1 , 2 , 1 , 2 )
• Alice learns about 1 :
Π; 1 X1 X2 )
• Bob learns about 1 :
Π; 1 1 2 2 ).
• 1 =  Π; 1 X1 X2 ) +  Π; 1 1 2 2 ).
48
Analysis - 2
2 2 , 2
• Publicly sample 1 ∼ 1
• Alice privately samples
1 ∼ 1 |Y1
• Run ( 1 , 2 , 1 , 2 )
• Alice learns about 2 :
Π; 2 X1 X2 Y1 )
• Bob learns about 2 :
Π; 2 1 2 ).
• 2 =  Π; 2 X1 X2 Y1 ) +  Π; 2 1 2 ).
49
1 + 2
=  Π; 1 X1 X2 ) +  Π; 1 1 2 2 )
+  Π; 2 X1 X2 Y1 ) +  Π; 2 1 2 )
=  Π; 1 X1 X2 ) +  Π; 2 X1 X2 Y1 ) +
Π; 2 1 2 ) +  Π; 1 1 2 2 ) =
Π; 1 2 1 2 +  Π; 2 1 1 2 = .
50
Summary
• Operationalized via “Information =
amortized communication”.
• lim (  ,  , )/ =  , ,  .
→∞
• Seems to be the “right” analogue of
entropy for interactive computation.
51
Entropy vs. Information Complexity
Entropy
IC
Yes
Yes
Operationalized
(  )/
Compression?
lim
→∞
Huffman:
≤  +1
,  ,
lim
→∞

???!
Can interactive communication
be compressed?
• Is it true that  , ,  ≤  , ,  + (1)?
• Less ambitiously:
, , () =   , ,  ?
• (Almost) equivalently: Given a protocol  with
,  = , can Alice and Bob simulate  using
communication?
• Not known in general…
53
Direct sum theorems
• Let  be any functionality.
• Let () be the cost of implementing .
• Let  be the functionality of implementing
independent copies of .
• The direct sum problem:
“Does (  ) ⁡ ≈ ⁡ ∙ ()?”
• In most cases it is obvious that (  ) ≤  ∙ ().
54
Direct sum – randomized
communication complexity
• Is it true that
(, , ) ⁡ = ⁡Ω( ∙ (, , ))?
• Is it true that (, ) ⁡ = ⁡Ω( ∙ (, ))?
55
Direct product – randomized
communication complexity
• Direct sum
(, , ) ⁡ = ⁡Ω( ∙ (, , ))?
• Direct product
(, , (1 − ) ) ⁡ = ⁡Ω( ∙ (, , ))?
56
Direct sum for randomized CC
and interactive compression
Direct sum:
• (, , ) ⁡ = ⁡Ω( ∙ (, , ))?
In the limit:
•  ⋅ (, , ) ⁡ = ⁡Ω( ∙ (, , ))?
Interactive compression:
•  , ,  = ( , , ) ?
Same question!
57
The big picture
(, , )
(=direct sum)
for information
interactive
compression?
(, , )
direct sum for
communication?
(,  , )/
information =
amortized
communication
(,  , )/
Current results for compression
A protocol ⁡that has  bits of
communication, conveys  bits of information
over prior , and works in  rounds can be
simulated:
• Using ( + ) bits of communication.
• Using
⋅  bits of communication.
• Using 2() bits of communication.
• If  =  ×  , then using ( polylog )
bits of communication.
59
Their direct sum counterparts
• (, , ) ⁡ = ⁡ Ω(1/2 ∙ (, , )).
• (, ) ⁡ = ⁡ Ω(1/2 ∙ (, )).
For product distributions  =  ×  ,
• (, , ) ⁡ = ⁡ Ω( ∙ (, , )).
When the number of rounds is bounded by
≪ , a direct sum theorem holds.
60
Direct product
• The best one can hope for is a statement of the
type:
(, , 1 − 2−  ) ⁡ = Ω( ∙ (, , 1/3)).
• Can prove:
, , 1 − 2−  = Ω(1/2 ∙ (, , 1/3)).
61
Proof 2: Compressing a one-round
protocol
•
•

•
Say Alice speaks:  ,  =  ;   .
Recall KL-divergence:
;   =  ⁡  ∥  =  ⁡  ∥
Bottom line:
– Alice has  ; Bob has  ;
– Goal: sample from  using ∼ ( ∥  )
communication.
62
The dart board
• Interpret the public
randomness as
q1
u
random points in
× [0,1], where  1
is the universe of all
possible messages.
• First message under
the histogram of
is distributed ∼ . 0
q2
u
q3
u
q4
u
q5
u
q6
u
u3
u4
u1
u2
u5

q7 ….
u
Proof Idea
• Sample using (log⁡1/ + ( ∥ ))
communication with statistical error ε.
Public randomness:
1
q1
u
MX
q2
u
q3
u
u3
~|U| samples
q4
u
q5
u
q6
u
MY
1
q7 ….
u
u3
u1
u1
u2
u4
u4
u2
0
MX
0
u4
MY
64
Proof Idea
• Sample using (log⁡1/ + ( ∥ ))
communication with statistical error ε.
1
MX
1
MY
u2
u4
0
MX
u4
h1(u4) h2(u4)

u2
0
MY
65
Proof Idea
• Sample using (log⁡1/ + ( ∥ ))
communication with statistical error ε.
1
MX
1
2MY
u4
0
MX
u2
u4
u4
h3(u4) h4(u4)… hlog 1/ ε(u4)

0
u4
MY
h1(u4), h2(u4)
66
Analysis
⁡

2 (4),
• If (4) ≈
then the protocol
will reach round  of doubling.
• There will be ≈ 2 candidates.
• About  + log⁡1/ hashes to narrow to one.
• The contribution of 4 to cost:
– (4)⁡(log⁡⁡(4)/⁡(4) ⁡ + ⁡log⁡1/).
( ∥  ) ≔

()
() log
.
()
Done!
67
External information cost
• (, )⁡~⁡.
X
C
Y
Protocol
Protocol π
transcript π
B
A
(, ) ⁡ = ⁡(Π; )⁡
Example
•F is “X=Y?”.
•μ is a distribution where w.p. ½ X=Y and w.p. ½
(X,Y) are random.
Y
X
MD5(X)
X=Y?
B
A
(, ) ⁡ = ⁡(Π; ) ⁡ = ⁡129⁡⁡
69
External information cost
• It is always the case that
,  ≥  ,  .
• If  =  ×  is a product distribution,
then
,  =  ,  .
70
External information complexity
•  , ,  ≔
inf
⁡
⁡ℎ⁡⁡≤
(, ).
• Can it be operationalized?
71
Operational meaning of  ?
• Conjecture: Zero-error communication
scales like external information:
,  , 0
lim
=  , , 0 ?
→∞

• Recall:
,  , 0+
lim
=  , , 0 .
→∞

72
Example – transmission with a
strong prior
• ,  ∈ 0,1
•  is such that  ∈ 0,1 , and  =  with a
very high probability (say 1 − 1/ ).
•  ,  =  is just the “transmit ” function.
• Clearly,  should just have Alice send  to Bob.
•  , , 0 =  ,  =
1

= o(1).
•  , , 0 =  ,  = 1.
73
Example – transmission with a
strong prior
•  , , 0 =  ,  =
1

= o(1).
•  , , 0 =  ,  = 1.
•    ,  , 0+ =   .
•    ,  , 0 = Ω().
Other examples, e.g. the two-bit AND function fit
into this picture.
74
Information
complexity
Interactive
coding
Information
theory in TCS
75
Interactive coding theory
• So far focused the discussion on noiseless
coding.
• What if the channel has noise?
• [What kind of noise?]
• In the non-interactive case, each channel
has a capacity .
76
Channel capacity
• The amortized number of channel uses
needed to send  over a noisy channel of
capacity  is

• Decouples the task from the channel!
77
Example: Binary Symmetric
Channel
• Each bit gets independently flipped with
probability  < 1/2.
• One way capacity 1 −   .
1−
0

1
0

1−
1
78
Interactive channel capacity
• Not clear one can decouple channel from task
in such a clean way.
• Capacity much harder to calculate/reason
1−
0
• Example: Binary symmetric channel. 0

• One way capacity 1 −   .
1
1
• Interactive (for simple pointer jumping, 1 −
[Kol-Raz’13]):
1−Θ

.
Information theory in communication
complexity and beyond
• A natural extension would be to multi-party
communication complexity.
• Some success in the number-in-hand case.
• Explicit bounds for ≥ log  players would
imply explicit  0 circuit lower bounds.
80
Naïve multi-party information cost
XZ
YZ
A
XY
B
C
,  = (Π; |)+ (Π; |)+ (Π; |)
81
Naïve multi-party information cost
,  = (Π; |)+ (Π; |)+ (Π; |)
• Doesn’t seem to work.
• Secure multi-party computation [BenOr,Goldwasser, Wigderson], means that
anything can be computed at near-zero
information cost.
• Although, these construction require the
players to share private channels/randomness.
82
Communication and beyond…
• The rest of today:
–
–
–
–
•
•
•
•
Data structures;
Streaming;
Distributed computing;
Privacy.
Exact communication complexity bounds.
Extended formulations lower bounds.
Parallel repetition?
…
83
Thank You!
84
Open problem: Computability of IC
• Given the truth table of  ,  ,  and ,
compute  , ,  .⁡
• Via  , ,  = lim (  ,  , )/ can
→∞
compute a sequence of upper bounds.
• But the rate of convergence as a function
of  is unknown.
85
Open problem: Computability of IC
• Can compute the -round⁡ , ,
information complexity of .
• But the rate of convergence as a function
of  is unknown.
• Conjecture:
1
, ,  −  , ,  = ,, 2 .

• This is the relationship for the two-bit AND.
86
```