Basics of information theory and information complexity a tutorial

advertisement
Basics of information theory
and information complexity
a tutorial
Mark Braverman
Princeton University
June 1, 2013
1
Part I: Information theory
• Information theory, in its modern format
was introduced in the 1940s to study the
problem of transmitting data over physical
channels.
communication channel
Alice
Bob
2
Quantifying “information”
• Information is measured in bits.
• The basic notion is Shannon’s entropy.
• The entropy of a random variable is the
(typical) number of bits needed to remove
the uncertainty of the variable.
• For a discrete variable:
𝐻 𝑋 ≔ ∑ Pr 𝑋 = π‘₯ log 1/Pr⁑[𝑋 = π‘₯]
3
Shannon’s entropy
• Important examples and properties:
– If 𝑋 = π‘₯ is a constant, then 𝐻 𝑋 = 0.
– If 𝑋 is uniform on a finite set 𝑆 of possible
values, then 𝐻 𝑋 = log 𝑆.
– If 𝑋 is supported on at most 𝑛 values, then
𝐻 X ≤ log 𝑛.
– If π‘Œ is a random variable determined by 𝑋,
then 𝐻 π‘Œ ≤ 𝐻(𝑋).
4
Conditional entropy
• For two (potentially correlated) variables
𝑋, π‘Œ, the conditional entropy of 𝑋 given π‘Œ is
the amount of uncertainty left in 𝑋 given π‘Œ:
𝐻 𝑋 π‘Œ ≔ 𝐸𝑦~π‘Œ H X Y = y .
• One can show 𝐻 π‘‹π‘Œ = 𝐻 π‘Œ + 𝐻(𝑋|π‘Œ).
• This important fact is knows as the chain
rule.
• If 𝑋 ⊥ π‘Œ, then
𝐻 π‘‹π‘Œ = 𝐻 𝑋 + 𝐻 π‘Œ 𝑋 = 𝐻 𝑋 + 𝐻 π‘Œ .
5
Example
•
•
•
•
𝑋 = 𝐡1 , 𝐡2 , 𝐡3
π‘Œ = 𝐡1 ⊕ 𝐡2 , 𝐡2 ⊕ 𝐡4 , 𝐡3 ⊕ 𝐡4 , 𝐡5
Where 𝐡1 , 𝐡2 , 𝐡3 , 𝐡4 , 𝐡5 ∈π‘ˆ {0,1}.
Then
– 𝐻 𝑋 = 3; 𝐻 π‘Œ = 4; 𝐻 π‘‹π‘Œ = 5;
– 𝐻 𝑋 π‘Œ = 1 = 𝐻 π‘‹π‘Œ − 𝐻 π‘Œ ;
– 𝐻 π‘Œ 𝑋 = 2 = 𝐻 π‘‹π‘Œ − 𝐻 𝑋 .
6
Mutual information
• 𝑋 = 𝐡1 , 𝐡2 , 𝐡3
• π‘Œ = 𝐡1 ⊕ 𝐡2 , 𝐡2 ⊕ 𝐡4 , 𝐡3 ⊕ 𝐡4 , 𝐡5
𝐻(𝑋)
𝐻(𝑋|π‘Œ)
𝐼(𝑋; π‘Œ)
𝐡1
𝐡1 ⊕ 𝐡2
𝐡2 ⊕ 𝐡3
𝐻(π‘Œ|𝑋)
𝐻(π‘Œ)
𝐡4
𝐡5
7
Mutual information
• The mutual information is defined as
𝐼 𝑋; π‘Œ = 𝐻 𝑋 − 𝐻 𝑋 π‘Œ = 𝐻 π‘Œ − 𝐻(π‘Œ|𝑋)
• “By how much does knowing 𝑋 reduce the
entropy of π‘Œ?”
• Always non-negative 𝐼 𝑋; π‘Œ ≥ 0.
• Conditional mutual information:
𝐼 𝑋; π‘Œ 𝑍 ≔ 𝐻 𝑋 𝑍 − 𝐻(𝑋|π‘Œπ‘)
• Chain rule for mutual information:
𝐼 π‘‹π‘Œ; 𝑍 = 𝐼 𝑋; 𝑍 + 𝐼(π‘Œ; 𝑍|𝑋)
8
• Simple intuitive interpretation.
Example – a biased coin
• A coin with πœ€-Heads or Tails bias is tossed
several times.
• Let 𝐡 ∈ {𝐻, 𝑇} be the bias, and suppose
that a-priori both options are equally likely:
𝐻 𝐡 = 1.
• How many tosses needed to find 𝐡?
• Let 𝑇1 , … , π‘‡π‘˜ be a sequence of tosses.
• Start with π‘˜ = 2.
9
What do we learn about 𝐡?
• 𝐼 𝐡; 𝑇1 𝑇2 = 𝐼 𝐡; 𝑇1 + 𝐼 𝐡; 𝑇2 𝑇1 =
𝐼 𝐡; 𝑇1 + 𝐼 𝐡⁑𝑇1 ; 𝑇2 − 𝐼 𝑇1 ; 𝑇2
≤ 𝐼 𝐡; 𝑇1 + 𝐼 𝐡⁑𝑇1 ; 𝑇2 =
𝐼 𝐡; 𝑇1 + 𝐼 𝐡; 𝑇2 + 𝐼 𝑇1 ; 𝑇2 𝐡
= 𝐼 𝐡; 𝑇1 + 𝐼 𝐡; 𝑇2 = 2 ⋅ 𝐼 𝐡; 𝑇1 .⁑
• Similarly,
𝐼 𝐡; 𝑇1 … π‘‡π‘˜ ≤ π‘˜ ⋅ 𝐼 𝐡; 𝑇1 .
• To determine 𝐡 with constant accuracy,
need 0 < c < 𝐼 𝐡; 𝑇1 … π‘‡π‘˜ ≤ π‘˜ ⋅ 𝐼 𝐡; 𝑇1 .
• π‘˜ = Ω(1/𝐼(𝐡; 𝑇1 )).
10
Kullback–Leibler (KL)-Divergence
• A distance metric between distributions on
the same space.
• Plays a key role in information theory.
𝑃[π‘₯]
𝐷(𝑃 βˆ₯ 𝑄) ≔
𝑃 π‘₯ log
.
𝑄[π‘₯]
π‘₯
• 𝐷(𝑃 βˆ₯ 𝑄) ≥ 0, with equality when 𝑃 = 𝑄.
• Caution: 𝐷(𝑃 βˆ₯ 𝑄) ≠ 𝐷(𝑄 βˆ₯ 𝑃)!
11
Properties of KL-divergence
• Connection to mutual information:
𝐼 𝑋; π‘Œ = 𝐸𝑦∼π‘Œ 𝐷(π‘‹π‘Œ=𝑦 βˆ₯ 𝑋).
• If 𝑋 ⊥ π‘Œ, then π‘‹π‘Œ=𝑦 = 𝑋, and both sides are 0.
• Pinsker’s inequality:
𝑃 − 𝑄 1 = 𝑂( 𝐷(𝑃 βˆ₯ 𝑄)).
• Tight!
𝐷(𝐡1/2+πœ€ βˆ₯ 𝐡1/2 ) = Θ(πœ€ 2 ).
12
Back to the coin example
• 𝐼 𝐡; 𝑇1 = 𝐸𝑏∼𝐡 𝐷(𝑇1,𝐡=𝑏 βˆ₯ 𝑇1 ) =
𝐷 𝐡1 βˆ₯ 𝐡1 = Θ πœ€ 2 .
2
• π‘˜=Ω
1
𝐼 𝐡;𝑇1
±πœ€
=
2
1
Ω( 2).
πœ€
• “Follow the information learned from the
coin tosses”
• Can be done using combinatorics, but the
information-theoretic language is more
natural for expressing what’s going on.
Back to communication
• The reason Information Theory is so
important for communication is because
information-theoretic quantities readily
operationalize.
• Can attach operational meaning to
Shannon’s entropy: 𝐻 𝑋 ≈ “the cost of
transmitting 𝑋”.
• Let 𝐢 𝑋 be the (expected) cost of
transmitting a sample of 𝑋.
14
𝐻 𝑋 = 𝐢(𝑋)?
• Not quite.
• Let trit 𝑇 ∈π‘ˆ 1,2,3 .
5
3
• 𝐢 𝑇 = ≈ 1.67.
1
2
0
10
3
11
• 𝐻 𝑇 = log 3 ≈ 1.58.
• It is always the case that 𝐢 𝑋 ≥ 𝐻(𝑋).
15
But 𝐻 𝑋 and 𝐢(𝑋) are close
• Huffman’s coding: 𝐢 𝑋 ≤ 𝐻 𝑋 + 1.
• This is a compression result: “an
uninformative message turned into a short
one”.
• Therefore: 𝐻 𝑋 ≤ 𝐢 𝑋 ≤ 𝐻 𝑋 + 1.
16
Shannon’s noiseless coding
• The cost of communicating many copies of 𝑋
scales as 𝐻(𝑋).
• Shannon’s source coding theorem:
– Let 𝐢 𝑋 𝑛 be the cost of transmitting 𝑛
independent copies of 𝑋. Then the
amortized transmission cost
lim 𝐢(𝑋 𝑛 )/𝑛 = 𝐻 𝑋 .
𝑛→∞
• This equation gives 𝐻(𝑋) operational
meaning.
17
𝐻 𝑋 β‘π‘œπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘Žπ‘™π‘–π‘§π‘’π‘‘
𝑋1 , … , 𝑋𝑛 , …
𝐻(𝑋) per copy
to transmit 𝑋’s
communication
channel
18
𝐻(𝑋) is nicer than 𝐢(𝑋)
• 𝐻 𝑋 is additive for independent variables.
• Let 𝑇1 , 𝑇2 ∈π‘ˆ {1,2,3} be independent trits.
• 𝐻 𝑇1 𝑇2 = log 9 = 2 log 3.
• 𝐢 𝑇1 𝑇2 =
29
9
5
3
< 𝐢 𝑇1 + 𝐢 𝑇2 = 2 × =
30
.
9
• Works well with concepts such as channel
capacity.
19
“Proof” of Shannon’s noiseless
coding
• 𝑛 ⋅ 𝐻 𝑋 = 𝐻 𝑋 𝑛 ≤ 𝐢 𝑋 𝑛 ≤ 𝐻 𝑋 𝑛 + 1.
Additivity of
entropy
Compression
(Huffman)
• Therefore lim 𝐢(𝑋 𝑛 )/𝑛 = 𝐻 𝑋 .
𝑛→∞
20
Operationalizing other quantities
• Conditional entropy 𝐻 𝑋 π‘Œ :
• (cf. Slepian-Wolf Theorem).
𝑋1 , … , 𝑋𝑛 , … 𝐻(𝑋|π‘Œ) per copy
to transmit 𝑋’s
communication
channel
π‘Œ1 , … , π‘Œπ‘› , …
Operationalizing other quantities
• Mutual information 𝐼 𝑋; π‘Œ :
π‘Œ1 , … , π‘Œπ‘› , …
𝑋1 , … , 𝑋𝑛 , …
𝐼 𝑋; π‘Œ per copy
to sample π‘Œ’s
communication
channel
Information theory and entropy
• Allows us to formalize intuitive notions.
• Operationalized in the context of one-way
transmission and related problems.
• Has nice properties (additivity, chain rule…)
• Next, we discuss extensions to more
interesting communication scenarios.
23
Communication complexity
• Focus on the two party randomized setting.
Shared randomness R
A & B implement a
functionality 𝐹(𝑋, π‘Œ).
X
Y
F(X,Y)
A
e.g. 𝐹 𝑋, π‘Œ = “𝑋 = π‘Œ? ”
B
24
Communication complexity
Goal: implement a functionality 𝐹(𝑋, π‘Œ).
A protocol πœ‹(𝑋, π‘Œ) computing 𝐹(𝑋, π‘Œ):
Shared randomness R
m1(X,R)
m2(Y,m1,R)
m3(X,m1,m2,R)
X
Y
A
Communication costF(X,Y)
= #of bits exchanged.
B
Communication complexity
• Numerous applications/potential
applications (some will be discussed later
today).
• Considerably more difficult to obtain lower
bounds than transmission (still much easier
than other models of computation!).
26
Communication complexity
• (Distributional) communication complexity
with input distribution πœ‡ and error πœ€:
𝐢𝐢 𝐹, πœ‡, πœ€ .⁑ Error ≤ πœ€ w.r.t. πœ‡.
• (Randomized/worst-case) communication
complexity: 𝐢𝐢(𝐹, πœ€). Error ≤ πœ€ on all inputs.
• Yao’s minimax:
𝐢𝐢 𝐹, πœ€ = max 𝐢𝐢(𝐹, πœ‡, πœ€).
πœ‡
27
Examples
• 𝑋, π‘Œ ∈ 0,1 𝑛 .
• Equality 𝐸𝑄 𝑋, π‘Œ ≔ 1𝑋=π‘Œ .
• 𝐢𝐢 𝐸𝑄, πœ€ ≈
1
log .
πœ€
• 𝐢𝐢 𝐸𝑄, 0 ≈ 𝑛.
28
Equality
•πΉβ‘is “𝑋 = π‘Œ? ”.
•πœ‡β‘is a distribution where w.p. ½β‘𝑋 = π‘Œ and w.p.
½ (𝑋, π‘Œ) are random.
Y
X
MD5(X) [128 bits]
X=Y? [1 bit]
A
Error?
• Shows that 𝐢𝐢 𝐸𝑄, πœ‡, 2−129 ≤ 129.⁑
B
Examples
• 𝑋, π‘Œ ∈ 0,1 𝑛 .
• Inner⁑product⁑𝐼𝑃 𝑋, π‘Œ ≔ ∑𝑖 𝑋𝑖 ⋅ π‘Œπ‘– ⁑(π‘šπ‘œπ‘‘β‘2).
• 𝐢𝐢 𝐼𝑃, 0 = 𝑛 − π‘œ(𝑛).
In fact, using information complexity:
• 𝐢𝐢 𝐼𝑃, πœ€ = 𝑛 − π‘œπœ€ (𝑛).
30
Information complexity
• Information complexity 𝐼𝐢(𝐹, πœ€)::
communication complexity 𝐢𝐢 𝐹, πœ€
as
• Shannon’s entropy 𝐻(𝑋) ::
transmission cost 𝐢(𝑋)
31
Information complexity
• The smallest amount of information Alice
and Bob need to exchange to solve 𝐹.
• How is information measured?
• Communication cost of a protocol?
– Number of bits exchanged.
• Information cost of a protocol?
– Amount of information revealed.
32
Basic definition 1: The
information cost of a protocol
• Prior distribution: 𝑋, π‘Œ ∼ πœ‡.
Y
X
Protocol
Protocol π
transcript Π
B
A
𝐼𝐢(πœ‹, πœ‡) ⁑ = ⁑𝐼(Π; π‘Œ|𝑋) ⁑⁑ + ⁑⁑𝐼(Π; 𝑋|π‘Œ)⁑
what Alice learns about Y + what Bob learns about X
Example
•πΉβ‘is “𝑋 = π‘Œ? ”.
•πœ‡β‘is a distribution where w.p. ½β‘𝑋 = π‘Œ and w.p.
½ (𝑋, π‘Œ) are random.
Y
X
MD5(X) [128 bits]
X=Y? [1 bit]
A
B
𝐼𝐢(πœ‹, πœ‡) ⁑ = ⁑𝐼(Π; π‘Œ|𝑋) + 𝐼(Π; 𝑋|π‘Œ) ⁑ ≈⁑1 + 64.5 = 65.5 bits
what Alice learns about Y + what Bob learns about X
Prior πœ‡ matters a lot for
information cost!
• If πœ‡ = 1
π‘₯,𝑦
a singleton,
𝐼𝐢 πœ‹, πœ‡ = 0.
35
Example
•πΉβ‘is “𝑋 = π‘Œ? ”.
•πœ‡β‘is a distribution where (𝑋, π‘Œ) are just
uniformly random.
Y
X
MD5(X) [128 bits]
X=Y? [1 bit]
A
B
𝐼𝐢(πœ‹, πœ‡) ⁑ = ⁑𝐼(Π; π‘Œ|𝑋) + 𝐼(Π; 𝑋|π‘Œ) ⁑ ≈⁑0 + 128 = 128 bits
what Alice learns about Y + what Bob learns about X
Basic definition 2: Information
complexity
• Communication complexity:
𝐢𝐢 𝐹, πœ‡, πœ€ ≔
min
πœ‹β‘π‘π‘œπ‘šπ‘π‘’π‘‘π‘’π‘ 
πΉβ‘π‘€π‘–π‘‘β„Žβ‘π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿβ‘≤πœ€
Needed!
• Analogously:
𝐼𝐢 𝐹, πœ‡, πœ€ ≔
πœ‹.
inf
πœ‹β‘π‘π‘œπ‘šπ‘π‘’π‘‘π‘’π‘ 
πΉβ‘π‘€π‘–π‘‘β„Žβ‘π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿβ‘≤πœ€
𝐼𝐢(πœ‹, πœ‡).
37
Prior-free information complexity
• Using minimax can get rid of the prior.
• For communication, we had:
𝐢𝐢 𝐹, πœ€ = max 𝐢𝐢(𝐹, πœ‡, πœ€).
πœ‡
• For information
𝐼𝐢 𝐹, πœ€ ≔
inf
πœ‹β‘π‘π‘œπ‘šπ‘π‘’π‘‘π‘’π‘ 
πΉβ‘π‘€π‘–π‘‘β„Žβ‘π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿβ‘≤πœ€
max ⁑𝐼𝐢(πœ‹, πœ‡).
πœ‡
38
Ex: The information complexity
of Equality
• What is 𝐼𝐢(𝐸𝑄, 0)?
• Consider the following protocol.
A non-singular in 𝒁𝒏×𝒏
X in {0,1}n
A1·X
Y in {0,1}n
Continue for
n steps, or
A1·Y
until a disagreement is
A2·X
discovered.
A
A2·Y
B39
Analysis (sketch)
• If X≠Y, the protocol will terminate in O(1)
rounds on average, and thus reveal O(1)
information.
• If X=Y… the players only learn the fact that
X=Y (≤1 bit of information).
• Thus the protocol has O(1) information
complexity for any prior πœ‡.
40
Operationalizing IC: Information
equals amortized communication
• Recall [Shannon]: lim 𝐢(𝑋 𝑛 )/𝑛 = 𝐻 𝑋 .
𝑛→∞
𝑛
• Turns out: lim 𝐢𝐢(𝐹 , πœ‡π‘› , πœ€)/𝑛 = 𝐼𝐢 𝐹, πœ‡, πœ€ ,
𝑛→∞
for πœ€ > 0. [Error πœ€ allowed on each copy]
• For πœ€ = 0: lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , 0+ )/𝑛 = 𝐼𝐢 𝐹, πœ‡, 0 .
𝑛→∞
• [ lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , 0)/𝑛 an interesting open
𝑛→∞
problem.]
41
Information = amortized
communication
• lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , πœ€)/𝑛 = 𝐼𝐢 𝐹, πœ‡, πœ€ .
𝑛→∞
• Two directions: “≤” and “≥”.
• 𝑛 ⋅ 𝐻 𝑋 = 𝐻 𝑋 𝑛 ≤ 𝐢 𝑋 𝑛 ≤ 𝐻 𝑋 𝑛 + 1.
Additivity of
entropy
Compression
(Huffman)
The “≤” direction
• lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , πœ€)/𝑛 ≤ 𝐼𝐢 𝐹, πœ‡, πœ€ .
𝑛→∞
• Start with a protocol πœ‹ solving 𝐹, whose
𝐼𝐢 πœ‹, πœ‡ is close to 𝐼𝐢 𝐹, πœ‡, πœ€ .
• Show how to compress many copies of πœ‹
into a protocol whose communication cost
is close to its information cost.
• More on compression later.
43
The “≥” direction
• lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , πœ€)/𝑛 ≥ 𝐼𝐢 𝐹, πœ‡, πœ€ .
𝑛→∞
• Use the fact that
𝐢𝐢 𝐹 𝑛 ,πœ‡π‘› ,πœ€
𝑛
≥
𝐼𝐢 𝐹 𝑛 ,πœ‡π‘› ,πœ€
𝑛
.
• Additivity of information complexity:
𝐼𝐢 𝐹 𝑛 , πœ‡π‘› , πœ€
= 𝐼𝐢 𝐹, πœ‡, πœ€ .
𝑛
44
Proof: Additivity of information
complexity
• Let 𝑇1 (𝑋1 , π‘Œ1 ) and 𝑇2 (𝑋2 , π‘Œ2 ) be two twoparty tasks.
• E.g. “Solve 𝐹(𝑋, π‘Œ) with error ≤ πœ€ w.r.t. πœ‡”
• Then
𝐼𝐢 𝑇1 × π‘‡2 , πœ‡1 × πœ‡2 = 𝐼𝐢 𝑇1 , πœ‡1 + 𝐼𝐢 𝑇2 , πœ‡2
• “≤” is easy.
• “≥” is the interesting direction.
𝐼𝐢 𝑇1 , πœ‡1 + 𝐼𝐢 𝑇2 , πœ‡2 ≤
𝐼𝐢 𝑇1 × π‘‡2 , πœ‡1 × πœ‡2
• Start from a protocol πœ‹ for 𝑇1 × π‘‡2 with
prior πœ‡1 × πœ‡2 , whose information cost is 𝐼.
• Show how to construct two protocols πœ‹1
for 𝑇1 with prior πœ‡1 and πœ‹2 for 𝑇2 with prior
πœ‡2 , with information costs 𝐼1 and 𝐼2 ,
respectively, such that 𝐼1 + 𝐼2 = 𝐼.
46
πœ‹( 𝑋1 , 𝑋2 , π‘Œ1 , π‘Œ2 )
πœ‹1 𝑋1 , π‘Œ1
• Publicly sample 𝑋2 ∼ πœ‡2
• Bob privately samples
π‘Œ2 ∼ πœ‡2 |X2
• Run πœ‹( 𝑋1 , 𝑋2 , π‘Œ1 , π‘Œ2 )
πœ‹2 𝑋2 , π‘Œ2
• Publicly sample π‘Œ1 ∼ πœ‡1
• Alice privately samples
𝑋1 ∼ πœ‡1 |Y1
• Run πœ‹( 𝑋1 , 𝑋2 , π‘Œ1 , π‘Œ2 )
Analysis - πœ‹1
πœ‹1 𝑋1 , π‘Œ1
• Publicly sample 𝑋2 ∼ πœ‡2
• Bob privately samples
π‘Œ2 ∼ πœ‡2 |X2
• Run πœ‹( 𝑋1 , 𝑋2 , π‘Œ1 , π‘Œ2 )
• Alice learns about π‘Œ1 :
𝐼 Π; π‘Œ1 X1 X2 )
• Bob learns about 𝑋1 :
𝐼 Π; 𝑋1 π‘Œ1 π‘Œ2 𝑋2 ).
• 𝐼1 = 𝐼 Π; π‘Œ1 X1 X2 ) + 𝐼 Π; 𝑋1 π‘Œ1 π‘Œ2 𝑋2 ).
48
Analysis - πœ‹2
πœ‹2 𝑋2 , π‘Œ2
• Publicly sample π‘Œ1 ∼ πœ‡1
• Alice privately samples
𝑋1 ∼ πœ‡1 |Y1
• Run πœ‹( 𝑋1 , 𝑋2 , π‘Œ1 , π‘Œ2 )
• Alice learns about π‘Œ2 :
𝐼 Π; π‘Œ2 X1 X2 Y1 )
• Bob learns about 𝑋2 :
𝐼 Π; 𝑋2 π‘Œ1 π‘Œ2 ).
• 𝐼2 = 𝐼 Π; π‘Œ2 X1 X2 Y1 ) + 𝐼 Π; 𝑋2 π‘Œ1 π‘Œ2 ).
49
Adding 𝐼1 and 𝐼2
𝐼1 + 𝐼2
= 𝐼 Π; π‘Œ1 X1 X2 ) + 𝐼 Π; 𝑋1 π‘Œ1 π‘Œ2 𝑋2 )
+ 𝐼 Π; π‘Œ2 X1 X2 Y1 ) + 𝐼 Π; 𝑋2 π‘Œ1 π‘Œ2 )
= 𝐼 Π; π‘Œ1 X1 X2 ) + 𝐼 Π; π‘Œ2 X1 X2 Y1 ) +
𝐼 Π; 𝑋2 π‘Œ1 π‘Œ2 ) + 𝐼 Π; 𝑋1 π‘Œ1 π‘Œ2 𝑋2 ) =
𝐼 Π; π‘Œ1 π‘Œ2 𝑋1 𝑋2 + 𝐼 Π; 𝑋2 𝑋1 π‘Œ1 π‘Œ2 = 𝐼.
50
Summary
• Information complexity is additive.
• Operationalized via “Information =
amortized communication”.
• lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , πœ€)/𝑛 = 𝐼𝐢 𝐹, πœ‡, πœ€ .
𝑛→∞
• Seems to be the “right” analogue of
entropy for interactive computation.
51
Entropy vs. Information Complexity
Entropy
IC
Additive?
Yes
Yes
Operationalized
𝐢(𝑋 𝑛 )/𝑛
Compression?
lim
𝑛→∞
Huffman:
𝐢 𝑋 ≤𝐻 𝑋 +1
𝐢𝐢 𝐹 𝑛 , πœ‡π‘› , πœ€
lim
𝑛→∞
𝑛
???!
Can interactive communication
be compressed?
• Is it true that 𝐢𝐢 𝐹, πœ‡, πœ€ ≤ 𝐼𝐢 𝐹, πœ‡, πœ€ + 𝑂(1)?
• Less ambitiously:
𝐢𝐢 𝐹, πœ‡, 𝑂(πœ€) = 𝑂 𝐼𝐢 𝐹, πœ‡, πœ€ ?
• (Almost) equivalently: Given a protocol πœ‹ with
𝐼𝐢 πœ‹, πœ‡ = 𝐼, can Alice and Bob simulate πœ‹ using
𝑂 𝐼 communication?
• Not known in general…
53
Direct sum theorems
• Let 𝐹 be any functionality.
• Let 𝐢(𝐹) be the cost of implementing 𝐹.
• Let 𝐹𝑛 be the functionality of implementing 𝑛
independent copies of 𝐹.
• The direct sum problem:
“Does 𝐢(𝐹 𝑛 ) ⁑ ≈ ⁑𝑛 βˆ™ 𝐢(𝐹)?”
• In most cases it is obvious that 𝐢(𝐹 𝑛 ) ≤ 𝑛 βˆ™ 𝐢(𝐹).
54
Direct sum – randomized
communication complexity
• Is it true that
𝐢𝐢(𝐹𝑛, πœ‡π‘›, πœ€) ⁑ = ⁑Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€))?
• Is it true that 𝐢𝐢(𝐹𝑛, πœ€) ⁑ = ⁑Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ€))?
55
Direct product – randomized
communication complexity
• Direct sum
𝐢𝐢(𝐹𝑛, πœ‡π‘›, πœ€) ⁑ = ⁑Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€))?
• Direct product
𝐢𝐢(𝐹𝑛, πœ‡π‘›, (1 − πœ€)𝑛 ) ⁑ = ⁑Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€))?
56
Direct sum for randomized CC
and interactive compression
Direct sum:
• 𝐢𝐢(𝐹𝑛, πœ‡π‘›, πœ€) ⁑ = ⁑Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€))?
In the limit:
• 𝑛 ⋅ 𝐼𝐢(𝐹, πœ‡, πœ€) ⁑ = ⁑Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€))?
Interactive compression:
• 𝐢𝐢 𝐹, πœ‡, πœ€ = 𝑂(𝐼𝐢 𝐹, πœ‡, πœ€) ?
Same question!
57
The big picture
𝐼𝐢(𝐹, πœ‡, πœ€)
additivity
(=direct sum)
for information
interactive
compression?
𝐢𝐢(𝐹, πœ‡, πœ€)
direct sum for
communication?
𝐼𝐢(𝐹𝑛, πœ‡π‘› , πœ€)/𝑛
information =
amortized
communication
𝐢𝐢(𝐹𝑛, πœ‡π‘› , πœ€)/𝑛
Current results for compression
A protocol πœ‹β‘that has 𝐢 bits of
communication, conveys 𝐼 bits of information
over prior πœ‡, and works in π‘Ÿ rounds can be
simulated:
• Using 𝑂(𝐼 + π‘Ÿ) bits of communication.
• Using 𝑂
𝐼 ⋅ 𝐢 bits of communication.
• Using 2𝑂(𝐼) bits of communication.
• If πœ‡ = πœ‡π‘‹ × πœ‡π‘Œ , then using 𝑂(𝐼 polylog 𝐢)
bits of communication.
59
Their direct sum counterparts
• 𝐢𝐢(𝐹𝑛, πœ‡π‘›, πœ€) ⁑ = ⁑ Ω(𝑛1/2 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€)).
• 𝐢𝐢(𝐹𝑛, πœ€) ⁑ = ⁑ Ω(𝑛1/2 βˆ™ 𝐢𝐢(𝐹, πœ€)).
For product distributions πœ‡ = πœ‡π‘‹ × πœ‡π‘Œ ,
• 𝐢𝐢(𝐹𝑛, πœ‡π‘›, πœ€) ⁑ = ⁑ Ω(𝑛 βˆ™ 𝐢𝐢(𝐹, πœ‡, πœ€)).
When the number of rounds is bounded by
π‘Ÿ β‰ͺ 𝑛, a direct sum theorem holds.
60
Direct product
• The best one can hope for is a statement of the
type:
𝐢𝐢(𝐹𝑛, πœ‡π‘›, 1 − 2−𝑂 𝑛 ) ⁑ = Ω(𝑛 βˆ™ 𝐼𝐢(𝐹, πœ‡, 1/3)).
• Can prove:
𝐢𝐢 𝐹𝑛, πœ‡π‘›, 1 − 2−𝑂 𝑛 = Ω(𝑛1/2 βˆ™ 𝐢𝐢(𝐹, πœ‡, 1/3)).
61
Proof 2: Compressing a one-round
protocol
•
•
𝐼
•
Say Alice speaks: 𝐼𝐢 πœ‹, πœ‡ = 𝐼 𝑀; 𝑋 π‘Œ .
Recall KL-divergence:
𝑀; 𝑋 π‘Œ = πΈπ‘Œ ⁑𝐷 π‘€π‘‹π‘Œ βˆ₯ π‘€π‘Œ = πΈπ‘Œ ⁑𝐷 𝑀𝑋 βˆ₯ π‘€π‘Œ
Bottom line:
– Alice has 𝑀𝑋 ; Bob has π‘€π‘Œ ;
– Goal: sample from 𝑀𝑋 using ∼ 𝐷(𝑀𝑋 βˆ₯ π‘€π‘Œ )
communication.
62
The dart board
• Interpret the public
randomness as
q1
u
random points in
π‘ˆ × [0,1], where π‘ˆ 1
is the universe of all
possible messages.
• First message under
the histogram of 𝑀
is distributed ∼ 𝑀. 0
q2
u
q3
u
q4
u
q5
u
q6
u
u3
u4
u1
u2
u5
π‘ˆ
q7 ….
u
Proof Idea
• Sample using 𝑂(log⁑1/πœ€ + 𝐷(𝑀𝑋 βˆ₯ π‘€π‘Œ))
communication with statistical error ε.
Public randomness:
1
q1
u
MX
q2
u
q3
u
u3
~|U| samples
q4
u
q5
u
q6
u
MY
1
q7 ….
u
u3
u1
u1
u2
u4
u4
u2
0
MX
0
u4
MY
64
Proof Idea
• Sample using 𝑂(log⁑1/πœ€ + 𝐷(𝑀𝑋 βˆ₯ π‘€π‘Œ))
communication with statistical error ε.
1
MX
1
MY
u2
u4
0
MX
u4
h1(u4) h2(u4)

u2
0
MY
65
Proof Idea
• Sample using 𝑂(log⁑1/πœ€ + 𝐷(𝑀𝑋 βˆ₯ π‘€π‘Œ))
communication with statistical error ε.
1
MX
1
2MY
u4
0
MX
u2
u4
u4
h3(u4) h4(u4)… hlog 1/ ε(u4)

0
u4
MY
h1(u4), h2(u4)
66
Analysis
⁑
π‘˜
2 π‘€π‘Œ(𝑒4),
• If 𝑀𝑋(𝑒4) ≈
then the protocol
will reach round π‘˜ of doubling.
• There will be ≈ 2π‘˜ candidates.
• About π‘˜ + log⁑1/πœ€ hashes to narrow to one.
• The contribution of 𝑒4 to cost:
– 𝑀𝑋(𝑒4)⁑(log⁑𝑀𝑋⁑(𝑒4)/π‘€π‘Œβ‘(𝑒4) ⁑ + ⁑log⁑1/πœ€).
𝐷(𝑀𝑋 βˆ₯ π‘€π‘Œ ) ≔
𝑒
𝑀𝑋 (𝑒)
𝑀𝑋 (𝑒) log
.
π‘€π‘Œ (𝑒)
Done!
67
External information cost
• (𝑋, π‘Œ)⁑~β‘πœ‡.
X
C
Y
Protocol
Protocol π
transcript π
B
A
𝐼𝐢𝑒π‘₯𝑑(πœ‹, πœ‡) ⁑ = ⁑𝐼(Π; π‘‹π‘Œ)⁑
what Charlie learns about (X,Y)
Example
•F is “X=Y?”.
•μ is a distribution where w.p. ½ X=Y and w.p. ½
(X,Y) are random.
Y
X
MD5(X)
X=Y?
B
A
𝐼𝐢𝑒π‘₯𝑑(πœ‹, πœ‡) ⁑ = ⁑𝐼(Π; π‘‹π‘Œ) ⁑ = ⁑129⁑𝑏𝑖𝑑𝑠⁑
what Charlie learns about (X,Y)
69
External information cost
• It is always the case that
𝐼𝐢𝑒π‘₯𝑑 πœ‹, πœ‡ ≥ 𝐼𝐢 πœ‹, πœ‡ .
• If πœ‡ = πœ‡π‘‹ × πœ‡π‘Œ is a product distribution,
then
𝐼𝐢𝑒π‘₯𝑑 πœ‹, πœ‡ = 𝐼𝐢 πœ‹, πœ‡ .
70
External information complexity
• 𝐼𝐢𝑒π‘₯𝑑 𝐹, πœ‡, πœ€ ≔
inf
πœ‹β‘π‘π‘œπ‘šπ‘π‘’π‘‘π‘’π‘ 
πΉβ‘π‘€π‘–π‘‘β„Žβ‘π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿβ‘≤πœ€
𝐼𝐢𝑒π‘₯𝑑 (πœ‹, πœ‡).
• Can it be operationalized?
71
Operational meaning of 𝐼𝐢𝑒π‘₯𝑑 ?
• Conjecture: Zero-error communication
scales like external information:
𝐢𝐢 𝐹 𝑛 , πœ‡π‘› , 0
lim
= 𝐼𝐢𝑒π‘₯𝑑 𝐹, πœ‡, 0 ?
𝑛→∞
𝑛
• Recall:
𝐢𝐢 𝐹 𝑛 , πœ‡π‘› , 0+
lim
= 𝐼𝐢 𝐹, πœ‡, 0 .
𝑛→∞
𝑛
72
Example – transmission with a
strong prior
• 𝑋, π‘Œ ∈ 0,1
• πœ‡ is such that 𝑋 ∈π‘ˆ 0,1 , and 𝑋 = π‘Œ with a
very high probability (say 1 − 1/ 𝑛).
• 𝐹 𝑋, π‘Œ = 𝑋 is just the “transmit 𝑋” function.
• Clearly, πœ‹ should just have Alice send 𝑋 to Bob.
• 𝐼𝐢 𝐹, πœ‡, 0 = 𝐼𝐢 πœ‹, πœ‡ = 𝐻
1
𝑛
= o(1).
• 𝐼𝐢𝑒π‘₯𝑑 𝐹, πœ‡, 0 = 𝐼𝐢𝑒π‘₯𝑑 πœ‹, πœ‡ = 1.
73
Example – transmission with a
strong prior
• 𝐼𝐢 𝐹, πœ‡, 0 = 𝐼𝐢 πœ‹, πœ‡ = 𝐻
1
𝑛
= o(1).
• 𝐼𝐢𝑒π‘₯𝑑 𝐹, πœ‡, 0 = 𝐼𝐢𝑒π‘₯𝑑 πœ‹, πœ‡ = 1.
• 𝐢𝐢 𝐹 𝑛 , πœ‡π‘› , 0+ = π‘œ 𝑛 .
• 𝐢𝐢 𝐹 𝑛 , πœ‡π‘› , 0 = Ω(𝑛).
Other examples, e.g. the two-bit AND function fit
into this picture.
74
Additional directions
Information
complexity
Interactive
coding
Information
theory in TCS
75
Interactive coding theory
• So far focused the discussion on noiseless
coding.
• What if the channel has noise?
• [What kind of noise?]
• In the non-interactive case, each channel
has a capacity 𝐢.
76
Channel capacity
• The amortized number of channel uses
needed to send 𝑋 over a noisy channel of
capacity 𝐢 is
𝐻 𝑋
𝐢
• Decouples the task from the channel!
77
Example: Binary Symmetric
Channel
• Each bit gets independently flipped with
probability πœ€ < 1/2.
• One way capacity 1 − 𝐻 πœ€ .
1−πœ€
0
πœ€
1
0
πœ€
1−πœ€
1
78
Interactive channel capacity
• Not clear one can decouple channel from task
in such a clean way.
• Capacity much harder to calculate/reason
about.
1−πœ€
0
• Example: Binary symmetric channel. 0
πœ€
• One way capacity 1 − 𝐻 πœ€ .
1
1
• Interactive (for simple pointer jumping, 1 − πœ€
[Kol-Raz’13]):
1−Θ
𝐻 πœ€
.
Information theory in communication
complexity and beyond
• A natural extension would be to multi-party
communication complexity.
• Some success in the number-in-hand case.
• What about the number-on-forehead?
• Explicit bounds for ≥ log 𝑛 players would
imply explicit 𝐴𝐢𝐢 0 circuit lower bounds.
80
Naïve multi-party information cost
XZ
YZ
A
XY
B
C
𝐼𝐢 πœ‹, πœ‡ = 𝐼(Π; 𝑋|π‘Œπ‘)+ 𝐼(Π; π‘Œ|𝑋𝑍)+ 𝐼(Π; 𝑍|π‘‹π‘Œ)
81
Naïve multi-party information cost
𝐼𝐢 πœ‹, πœ‡ = 𝐼(Π; 𝑋|π‘Œπ‘)+ 𝐼(Π; π‘Œ|𝑋𝑍)+ 𝐼(Π; 𝑍|π‘‹π‘Œ)
• Doesn’t seem to work.
• Secure multi-party computation [BenOr,Goldwasser, Wigderson], means that
anything can be computed at near-zero
information cost.
• Although, these construction require the
players to share private channels/randomness.
82
Communication and beyond…
• The rest of today:
–
–
–
–
•
•
•
•
Data structures;
Streaming;
Distributed computing;
Privacy.
Exact communication complexity bounds.
Extended formulations lower bounds.
Parallel repetition?
…
83
Thank You!
84
Open problem: Computability of IC
• Given the truth table of 𝐹 𝑋, π‘Œ , πœ‡ and πœ€,
compute 𝐼𝐢 𝐹, πœ‡, πœ€ .⁑
• Via 𝐼𝐢 𝐹, πœ‡, πœ€ = lim 𝐢𝐢(𝐹 𝑛 , πœ‡π‘› , πœ€)/𝑛 can
𝑛→∞
compute a sequence of upper bounds.
• But the rate of convergence as a function
of 𝑛 is unknown.
85
Open problem: Computability of IC
• Can compute the π‘Ÿ-roundβ‘πΌπΆπ‘Ÿ 𝐹, πœ‡, πœ€
information complexity of 𝐹.
• But the rate of convergence as a function
of π‘Ÿ is unknown.
• Conjecture:
1
πΌπΆπ‘Ÿ 𝐹, πœ‡, πœ€ − 𝐼𝐢 𝐹, πœ‡, πœ€ = 𝑂𝐹,πœ‡,πœ€ 2 .
π‘Ÿ
• This is the relationship for the two-bit AND.
86
Download