Lecture 11 - Department of Computer Science and Engineering

advertisement
Lecture 11. Information theoretical argument
An interesting technique to prove lower bound is to use some information theoretical
argument. Since introduced by [CSWY01], information complexity has been studied and
applied to prove strong lower bounds of communication complexity. In this lecture, we will
first introduce/review some basic notions such as entropy and mutual information, and then
show how to use them to prove a linear lower bound for randomized communication
complexity of Disjointness function [BYJKS04], one of the most important functions in
communication complexity.
1.
Review of basic notions in information theory
Let’s first review some basic concepts in information theory. All these can be found in the
standard textbook such as [CT06] (Chapter 1). Suppose that 𝑋 is a discrete random variable
with alphabet X and probability mass function 𝑝(π‘₯) = Pr⁑[𝑋 = π‘₯]. Following the standard
notation in information theory, we sometimes also use the capital letter 𝑋 to denote the
distribution. By writing π‘₯ ← 𝑋, we mean to draw a sample π‘₯ from the distribution 𝑋. The
entropy 𝐻(𝑋) of a discrete random variable 𝑋 is defined by
𝐻(𝑋) = ∑ 𝑝(π‘₯) log 2 𝑝(π‘₯).
π‘₯∈X
(Here we use the convention that 0 log 2 0 = 0.) One intuition is that entropy measures the
amount of the uncertainty of a random variable. The more entropy 𝑋 has, the more uncertain
it is. In particular, if 𝐻(𝑋) = 0 then 𝑋 is fixed/deterministic. The maximum entropy for
any random variable on X is log 2 |X|:
0 ≤ 𝐻(𝑋) ≤ log 2 |X|,
with upper bound achieved by the uniform distribution.
Sometimes we have more than one random variable. If (𝑋, π‘Œ) is a pair of random variables
on X × Y, distributed according to a joint distribution 𝑝(π‘₯, 𝑦), then the total entropy is
simple:
𝐻(𝑋, π‘Œ) =
∑
𝑝(π‘₯, 𝑦) log 2 𝑝(π‘₯, 𝑦).
π‘₯∈X,𝑦∈Y
We can also define the conditional entropy 𝐻(π‘Œ|𝑋) = 𝐸π‘₯←𝑋 [𝐻(π‘Œ|𝑋 = π‘₯)], namely the
expected entropy of the conditional distribution π‘Œ|𝑋 = π‘₯. Thus, 𝐻(π‘Œ|𝑋) ≥ 0. Another basic
fact is the chain rule:
𝐻(𝑋, π‘Œ) = 𝐻(𝑋) + 𝐻(π‘Œ|𝑋).⁑
The chain rule also works with conditional entropy:
𝐻(𝑋, π‘Œ|𝑍) = 𝐻(𝑋|𝑍) + 𝐻(π‘Œ|𝑋, 𝑍).⁑
And if we have more variables, it also works:
𝐻(𝑋1, … , 𝑋𝑛 |𝑍) = 𝐻(𝑋1 |𝑍) + 𝐻(𝑋2|𝑋1 , 𝑍) + β‹― + 𝐻(𝑋𝑛 |𝑋1 , … 𝑋𝑛−1 , 𝑍)
≤ 𝐻(𝑋1 |𝑍) + 𝐻(𝑋2|𝑍) + β‹― + 𝐻(𝑋𝑛 |𝑍),
where the inequality is usually referred to as the subadditivity of entropy. The relative entropy
of two distributions 𝑝 and π‘ž is
𝐷(𝑝||π‘ž) = ∑ 𝑝(π‘₯) log 2
π‘₯∈X
𝑝(π‘₯)
⁑.
π‘ž(π‘₯)
For a joint distribution (𝑋, π‘Œ) = 𝑝(π‘₯, 𝑦), the mutual information between 𝑋 and π‘Œ is
𝐼(𝑋; π‘Œ) = 𝐷(𝑝(π‘₯, 𝑦)||𝑝(π‘₯)𝑝(𝑦)) = 𝐻(𝑋) − 𝐻(𝑋|π‘Œ).
It is a good exercise to verify that the above equality holds. But the latter one has a clear
explanation: the mutual information is the amount of uncertainty of 𝑋 minus that when π‘Œ is
known. In this way, it measures how much π‘Œ contains the information of 𝑋. It is not hard to
verify that it’s symmetric: 𝐼(𝑋; π‘Œ) = 𝐻(𝑋) − 𝐻(𝑋|π‘Œ) = 𝐻(π‘Œ) − 𝐻(π‘Œ|𝑋), and that it’s
nonnegative:
𝐼(𝑋, π‘Œ) ≥ 0. That is, 𝐼(𝑋, π‘Œ) ≤ min{𝐻(𝑋), 𝐻(π‘Œ)}
The conditional mutual information is defined by
𝐼(𝑋, π‘Œ|𝐷) = 𝐸𝑑 [𝐼(𝑋, π‘Œ|𝐷 = 𝑑)]
It is not hard to see that
𝐼(𝑋, π‘Œ|𝐷) ≤ min{𝐻(𝑋|𝐷), 𝐻(π‘Œ|𝐷)} ≤ log 2 |X|
and
𝐼(𝑋; π‘Œ|𝐷) = 𝐻(𝑋|𝐷) − 𝐻(𝑋|π‘Œπ·).
Lemma 1.1. If 𝑋 = 𝑋1 … 𝑋𝑛 , π‘Œ = π‘Œ1 … π‘Œπ‘› , and 𝐷 = 𝐷1 … 𝐷𝑛 , and (𝑋𝑖 , π‘Œπ‘– , 𝐷𝑖 )’s are
independent, then
𝑛
𝐼(𝑋; π‘Œ|𝐷) ≥ ∑ 𝐼(𝑋𝑖 ; π‘Œ|𝐷)
𝑖=1
Proof. 𝐼(𝑋; π‘Œ|𝐷) = 𝐻(𝑋|𝐷) − 𝐻(𝑋|π‘Œπ·)
= ∑𝑖 𝐻(𝑋𝑖 |𝐷) − 𝐻(𝑋|π‘Œπ·)
// 𝐻(𝑋|𝐷) = ∑𝑖 𝐻(𝑋𝑖 |𝐷): independence of (𝑋𝑖 , 𝐷𝑖 )’s
≥ ∑𝑖 𝐻(𝑋𝑖 |𝐷) − ∑𝑖 𝐻(𝑋𝑖 |π‘Œπ·)
// 𝐻(𝑋|π‘Œπ·) ≤ ∑𝑖 𝐻(𝑋𝑖 |π‘Œπ·): subadditivity of entropy
= ∑𝑛𝑖=1 𝐼(𝑋𝑖 ; π‘Œ|𝐷)
2.
β–‘
Linear lower bound for the Disjointness function
The information complexity is an approach to prove lower bounds for communication
complexity. The basic idea is that for some functions, any small-error protocol has to contain
enough information about the input. Thus the communication cost, namely the length of the
protocol, is also large.
Now we focus on Disjointness function, where both Alice and Bob have input set {0,1}𝑛 and
𝑓(π‘₯, 𝑦) = 1 iff ∃𝑖 ∈ [𝑛], π‘₯𝑖 = 𝑦𝑖 = 1. We sometimes use subscript – 𝑖 to denote the set
[𝑛] − {𝑖}. Denote by π‘…πœ– (𝑓) the πœ€-error private-coin communication complexity of 𝑓. We
want to prove the following.
Theorem 2.1. π‘…πœ– (𝑓) = Ω(𝑛).
First, we define a distribution on inputs. Again, we’ll use capital letters (𝑋, π‘Œ) to denote both
the random variables and the distribution. We also introduce a third random variable 𝐷 with
the sample space {0,1}𝑛 . The joint distribution (𝑋, π‘Œ, 𝐷) is defined by πœ‡ 𝑛 , where
πœ‡(π‘₯𝑖 , 𝑦𝑖 , 𝑑𝑖 ) ⁑ = ⁑ (000, 010, 001, 101)⁑each with probability 1/4. That is, (𝑋𝑖 , π‘Œπ‘– , 𝐷𝑖 )’s are
independent for different 𝑖’s, and each (𝑋𝑖 , π‘Œπ‘– , 𝐷𝑖 ) is distributed according to πœ‡.
Now take a private-coin protocol with minimum worst-case communication cost among all
protocols that have πœ€-error in the worst case. Consider the messages over all rounds, namely
the entire transcript of the conversation. Denote it by 𝛀, and its length (i.e. the number of bits)
by |𝛀|. Note that since the input (𝑋, π‘Œ) is a random variable, and the protocol is randomized,
the induced transcript 𝛀 is also a random variable. By the relation among mutual
information, entropy and the size of sample space, we have
π‘…πœ– (𝐷𝑖𝑠𝑗𝑛 ) = |𝛀| ≥ 𝐻(𝛀|𝐷) ≥ 𝐼(𝛀; π‘‹π‘Œ|𝐷).
Note that both 𝑋 and π‘Œ can be decomposed into n bits: 𝑋 = 𝑋1 … 𝑋𝑛 and π‘Œ = π‘Œ1 … π‘Œπ‘› .
Each 𝑋𝑖 and π‘Œπ‘– are also random variables, and note that (𝑋𝑖 , π‘Œπ‘– , 𝐷𝑖 ) for different i are
independent. Thus we can apply Lemma 1.1 and have
𝑛
𝐼(𝛀; π‘‹π‘Œ|𝐷) ≥ ∑ 𝐼(𝛀; 𝑋𝑖 π‘Œπ‘– |𝐷).
𝑖=1
Since 𝐼(𝛀; 𝑋𝑖 π‘Œπ‘– |𝐷) = 𝐄𝑑−𝑖 [𝐼(𝛀; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 )], we have
𝑛
𝐼(𝛀; π‘‹π‘Œ|𝐷) ≥ ∑ 𝐄𝑑−𝑖 [𝐼(𝛀; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 )].
𝑖=1
For each fixed 𝑑−𝑖 ∈ {0,1}𝑛−1 , we design a worst-case ε-error private-coin protocol 𝛀𝑖 (𝑑−𝑖 ),
for the function 𝑔(π‘₯𝑖 , 𝑦𝑖 ) = π‘₯𝑖 ∧ 𝑦𝑖 , as follows. Note that the protocol should work for all
possible inputs (π‘₯𝑖 , 𝑦𝑖 ), not only those in the support of (𝑋𝑖 , π‘Œπ‘– ).
𝛀𝑖 (𝑑−𝑖 ): On input (π‘₯𝑖 , 𝑦𝑖 ),
1.
Generate (𝑋’−𝑖 , π‘Œ’−𝑖 ) from πœ‡ 𝑛−1 |𝐷−𝑖 = 𝑑−𝑖 .
2.
Run protocol 𝛀 on (π‘₯𝑖 𝑋’−𝑖 , 𝑦𝑖 π‘Œ’−𝑖 ) and output the answer.
Lemma 2.2.
1.
𝛀𝑖 (𝑑−𝑖 ) is private-coin,
2.
𝛀𝑖 (𝑑−𝑖 ) has πœ€-error in the worst case.
3.
𝐼(𝛀; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 ) = 𝐼(𝛀𝑖 (𝑑−𝑖 ); 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 ).
Proof.
Note that once 𝑑−𝑖 is fixed, then each (𝑋𝑗 , π‘Œπ‘— )⁑(𝑗 ∈ [𝑛] − 𝑖) is a product distribution
1.
(over Alice and Bob’s spaces). Indeed, if 𝑑𝑗 = 0, then (𝑋𝑗 , π‘Œπ‘— ) ∼ 0 × π‘ˆ where π‘ˆ is
the uniform distribution on {0,1}. If 𝑑𝑗 = 1, then (𝑋𝑗 , π‘Œπ‘— ) ∼ π‘ˆ × 0.
Since πœ‡ only puts weight on 0-inputs, 𝐷𝑖𝑠𝑗𝑛 (π‘₯𝑖 𝑋’−𝑖 , 𝑦𝑖 π‘Œ’−𝑖 ) = π‘₯𝑖 ∧ 𝑦𝑖 , and thus
2.
𝛀𝑖 (𝑑−𝑖 ) is correct on (π‘₯𝑖 , 𝑦𝑖 ) iff 𝛀 is correct on (π‘₯𝑖 𝑋’−𝑖 , 𝑦𝑖 π‘Œ’−𝑖 ). Thus the error
probability of 𝛀𝑖 (𝑑−𝑖 ) on (π‘₯𝑖 , 𝑦𝑖 ) is the average (over 𝑋’−𝑖 and π‘Œ’−𝑖 ) error prob of Γ
on (π‘₯𝑖 𝑋’−𝑖 , 𝑦𝑖 π‘Œ’−𝑖 ). Since Γ has error probability at most πœ€ on all inputs, so is 𝛀𝑖 (𝑑−𝑖 )
on all its possible inputs (π‘₯𝑖 , 𝑦𝑖 ).
𝛀𝑖 does nothing but invoking 𝛀 on the same input distribution πœ‡ 𝑛 .
3.
β–‘
Now we have
𝑛
𝐼(𝛀; π‘‹π‘Œ|𝐷) ≥ ∑ 𝐄𝑑−𝑖 [𝐼(𝛀𝑖 (𝑑−𝑖 ); 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 )].
𝑖=1
Next we show that any worst-case ε-error private-coin protocol for 𝑔 has to contain a
constant amount of information about (𝑋𝑖 , π‘Œπ‘– ) (conditioned on 𝐷𝑖 ).
Lemma 2.3. For any worst-case ε-error private-coin protocol for 𝑔 with transcript 𝛱,
⁑𝐼(𝛱; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 ) = Ω(1).
Proof. Denote the transcript for input (π‘Ž, 𝑏) by π›±π‘Žπ‘ . First, we show that if the protocol
doesn’t contain enough information, then 𝜫𝟎𝟎 is close to both 𝜫𝟎𝟏 and 𝜫𝟏𝟎 . Thus
𝜫𝟎𝟏 is close to 𝜫𝟏𝟎 . Consider the Hellinger distance:
1
2
β„Ž(𝑝, π‘ž)2 = 1 − ∑ √𝑝𝑖 π‘žπ‘– = ∑(√𝑝𝑖 − √π‘žπ‘– ) .
2
𝑖
𝑖
We have
1
1
⁑⁑⁑⁑⁑𝐼(𝛱; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 ) = 2 𝐼(𝛱; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 = 0) + 2 𝐼(𝛱; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 = 1)
1
1
≥ 2 β„Ž(𝛱00 , 𝛱01 )2 + 2 β„Ž(𝛱00 , 𝛱10 )2 ⁑
1
≥ 4 β„Ž(𝛱10 , 𝛱01 )2 ⁑
// For 𝐡 ∼ π‘ˆ, 𝐼(𝐡, 𝑝𝐡 ) ⁑ ≥ β‘β„Ž(𝑝0 , 𝑝1 )2
// Cauchy-Schwartz and Triangle
Then, we will show that any communication protocol enjoys the property that
𝒉(π›±π’™π’š ⁑, 𝛱𝒙’π’š’ ) ⁑ = ⁑𝒉(π›±π’™π’š′ ⁑, 𝛱𝒙′π’š ) for any 𝒙, π’š, 𝒙’, π’š’. Thus π›±πŸŽπŸŽ is close to π›±πŸπŸ .
The property is proven by writing down π‘ƒπ‘Ÿ[𝛱π‘₯𝑦 = 𝛾] and π‘ƒπ‘Ÿ[𝛱π‘₯’𝑦’ = 𝛾] for any fixed 𝛾.
It’s not hard to see that π‘ƒπ‘Ÿ[𝛱π‘₯𝑦 = 𝛾] is product of some function of π‘₯ and some function of
𝑦. So by switching the function for π‘₯ and that for 𝑦’, we have that
π‘ƒπ‘Ÿ[𝛱π‘₯𝑦 = 𝛾]π‘ƒπ‘Ÿ[𝛱π‘₯’𝑦’ = 𝛾] = π‘ƒπ‘Ÿ[𝛱π‘₯’𝑦 = 𝛾]π‘ƒπ‘Ÿ[𝛱π‘₯𝑦’ = 𝛾].
Then by the multiplicative nature of the definition of β„Ž(𝑝, π‘ž), one gets β„Ž(𝛱π‘₯𝑦 , 𝛱π‘₯’𝑦’ ) =
β„Ž(𝛱π‘₯𝑦’ ⁑, 𝛱π‘₯’𝑦 ).
Now the contradiction comes: π’ˆ(𝟎, 𝟎) ≠ π’ˆ(𝟏, 𝟏), so π›±πŸŽπŸŽ should be far from π›±πŸπŸ . The
best probability gap to distinguish 𝛱π‘₯𝑦 and 𝛱π‘₯’𝑦’ is ‖𝛀00 − 𝛀11 β€–1, which is upper bounded
in terms of β„Ž(𝑝, π‘ž) by the following fact: ‖𝑝 − π‘žβ€–1 ≤ β„Ž(𝑝, π‘ž)√2 − β„Ž(𝑝, π‘ž)2 .
β–‘
Putting everything together, we get
π‘…πœ– (𝐷𝑖𝑠𝑗𝑛 ) = |𝛀| ≥ 𝐼(𝛀; π‘‹π‘Œ|𝐷)
≥ ∑𝑛𝑖=1 𝐸𝑑−𝑖 [𝐼(𝛀; 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 )]
= ∑𝑛𝑖=1 𝐸𝑑−𝑖 [𝐼(𝛀𝑖 (𝑑−𝑖 ); 𝑋𝑖 π‘Œπ‘– |𝐷𝑖 )]
= ∑𝑛𝑖=1 Ω(1) = Ω(𝑛).
β–‘
References
[BYJKS04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, An information
statistics approach to data stream and communication complexity. Journal of Computer
and System Sciences, 68(4), pp. 702-732, 2004.
[CSWY01] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao, Informational
complexity and the direct sum problem for simultaneous message complexity, in
Proceedings of the 42nd Annual Symposium on Foundations of Computer Science, pp.
270-278, 2001.
[CT06] Thomas Cover, Joy Thomas. Elements of Information Theory, Second Edition,
Wiley InterScience, 2006.
Download