Lecture 11 - Department of Computer Science and Engineering

Lecture 11. Information theoretical argument An interesting technique to prove lower bound is to use some information theoretical argument. Since introduced by [CSWY01], information complexity has been studied and applied to prove strong lower bounds of communication complexity. In this lecture, we will first introduce/review some basic notions such as entropy and mutual information, and then show how to use them to prove a linear lower bound for randomized communication complexity of Disjointness function [BYJKS04], one of the most important functions in communication complexity. 1. Review of basic notions in information theory Let’s first review some basic concepts in information theory. All these can be found in the standard textbook such as [CT06] (Chapter 1). Suppose that 𝑋 is a discrete random variable with alphabet X and probability mass function 𝑝(𝑥) = Pr⁡[𝑋 = 𝑥]. Following the standard notation in information theory, we sometimes also use the capital letter 𝑋 to denote the distribution. By writing 𝑥 ← 𝑋, we mean to draw a sample 𝑥 from the distribution 𝑋. The entropy 𝐻(𝑋) of a discrete random variable 𝑋 is defined by 𝐻(𝑋) = ∑ 𝑝(𝑥) log 2 𝑝(𝑥). 𝑥∈X (Here we use the convention that 0 log 2 0 = 0.) One intuition is that entropy measures the amount of the uncertainty of a random variable. The more entropy 𝑋 has, the more uncertain it is. In particular, if 𝐻(𝑋) = 0 then 𝑋 is fixed/deterministic. The maximum entropy for any random variable on X is log 2 |X|: 0 ≤ 𝐻(𝑋) ≤ log 2 |X|, with upper bound achieved by the uniform distribution. Sometimes we have more than one random variable. If (𝑋, 𝑌) is a pair of random variables on X × Y, distributed according to a joint distribution 𝑝(𝑥, 𝑦), then the total entropy is simple: 𝐻(𝑋, 𝑌) = ∑ 𝑝(𝑥, 𝑦) log 2 𝑝(𝑥, 𝑦). 𝑥∈X,𝑦∈Y We can also define the conditional entropy 𝐻(𝑌|𝑋) = 𝐸𝑥←𝑋 [𝐻(𝑌|𝑋 = 𝑥)], namely the expected entropy of the conditional distribution 𝑌|𝑋 = 𝑥. Thus, 𝐻(𝑌|𝑋) ≥ 0. Another basic fact is the chain rule: 𝐻(𝑋, 𝑌) = 𝐻(𝑋) + 𝐻(𝑌|𝑋).⁡ The chain rule also works with conditional entropy: 𝐻(𝑋, 𝑌|𝑍) = 𝐻(𝑋|𝑍) + 𝐻(𝑌|𝑋, 𝑍).⁡ And if we have more variables, it also works: 𝐻(𝑋1, … , 𝑋𝑛 |𝑍) = 𝐻(𝑋1 |𝑍) + 𝐻(𝑋2|𝑋1 , 𝑍) + ⋯ + 𝐻(𝑋𝑛 |𝑋1 , … 𝑋𝑛−1 , 𝑍) ≤ 𝐻(𝑋1 |𝑍) + 𝐻(𝑋2|𝑍) + ⋯ + 𝐻(𝑋𝑛 |𝑍), where the inequality is usually referred to as the subadditivity of entropy. The relative entropy of two distributions 𝑝 and 𝑞 is 𝐷(𝑝||𝑞) = ∑ 𝑝(𝑥) log 2 𝑥∈X 𝑝(𝑥) ⁡. 𝑞(𝑥) For a joint distribution (𝑋, 𝑌) = 𝑝(𝑥, 𝑦), the mutual information between 𝑋 and 𝑌 is 𝐼(𝑋; 𝑌) = 𝐷(𝑝(𝑥, 𝑦)||𝑝(𝑥)𝑝(𝑦)) = 𝐻(𝑋) − 𝐻(𝑋|𝑌). It is a good exercise to verify that the above equality holds. But the latter one has a clear explanation: the mutual information is the amount of uncertainty of 𝑋 minus that when 𝑌 is known. In this way, it measures how much 𝑌 contains the information of 𝑋. It is not hard to verify that it’s symmetric: 𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋), and that it’s nonnegative: 𝐼(𝑋, 𝑌) ≥ 0. That is, 𝐼(𝑋, 𝑌) ≤ min{𝐻(𝑋), 𝐻(𝑌)} The conditional mutual information is defined by 𝐼(𝑋, 𝑌|𝐷) = 𝐸𝑑 [𝐼(𝑋, 𝑌|𝐷 = 𝑑)] It is not hard to see that 𝐼(𝑋, 𝑌|𝐷) ≤ min{𝐻(𝑋|𝐷), 𝐻(𝑌|𝐷)} ≤ log 2 |X| and 𝐼(𝑋; 𝑌|𝐷) = 𝐻(𝑋|𝐷) − 𝐻(𝑋|𝑌𝐷). Lemma 1.1. If 𝑋 = 𝑋1 … 𝑋𝑛 , 𝑌 = 𝑌1 … 𝑌𝑛 , and 𝐷 = 𝐷1 … 𝐷𝑛 , and (𝑋𝑖 , 𝑌𝑖 , 𝐷𝑖 )’s are independent, then 𝑛 𝐼(𝑋; 𝑌|𝐷) ≥ ∑ 𝐼(𝑋𝑖 ; 𝑌|𝐷) 𝑖=1 Proof. 𝐼(𝑋; 𝑌|𝐷) = 𝐻(𝑋|𝐷) − 𝐻(𝑋|𝑌𝐷) = ∑𝑖 𝐻(𝑋𝑖 |𝐷) − 𝐻(𝑋|𝑌𝐷) // 𝐻(𝑋|𝐷) = ∑𝑖 𝐻(𝑋𝑖 |𝐷): independence of (𝑋𝑖 , 𝐷𝑖 )’s ≥ ∑𝑖 𝐻(𝑋𝑖 |𝐷) − ∑𝑖 𝐻(𝑋𝑖 |𝑌𝐷) // 𝐻(𝑋|𝑌𝐷) ≤ ∑𝑖 𝐻(𝑋𝑖 |𝑌𝐷): subadditivity of entropy = ∑𝑛𝑖=1 𝐼(𝑋𝑖 ; 𝑌|𝐷) 2. □ Linear lower bound for the Disjointness function The information complexity is an approach to prove lower bounds for communication complexity. The basic idea is that for some functions, any small-error protocol has to contain enough information about the input. Thus the communication cost, namely the length of the protocol, is also large. Now we focus on Disjointness function, where both Alice and Bob have input set {0,1}𝑛 and 𝑓(𝑥, 𝑦) = 1 iff ∃𝑖 ∈ [𝑛], 𝑥𝑖 = 𝑦𝑖 = 1. We sometimes use subscript – 𝑖 to denote the set [𝑛] − {𝑖}. Denote by 𝑅𝜖 (𝑓) the 𝜀-error private-coin communication complexity of 𝑓. We want to prove the following. Theorem 2.1. 𝑅𝜖 (𝑓) = Ω(𝑛). First, we define a distribution on inputs. Again, we’ll use capital letters (𝑋, 𝑌) to denote both the random variables and the distribution. We also introduce a third random variable 𝐷 with the sample space {0,1}𝑛 . The joint distribution (𝑋, 𝑌, 𝐷) is defined by 𝜇 𝑛 , where 𝜇(𝑥𝑖 , 𝑦𝑖 , 𝑑𝑖 ) ⁡ = ⁡ (000, 010, 001, 101)⁡each with probability 1/4. That is, (𝑋𝑖 , 𝑌𝑖 , 𝐷𝑖 )’s are independent for different 𝑖’s, and each (𝑋𝑖 , 𝑌𝑖 , 𝐷𝑖 ) is distributed according to 𝜇. Now take a private-coin protocol with minimum worst-case communication cost among all protocols that have 𝜀-error in the worst case. Consider the messages over all rounds, namely the entire transcript of the conversation. Denote it by 𝛤, and its length (i.e. the number of bits) by |𝛤|. Note that since the input (𝑋, 𝑌) is a random variable, and the protocol is randomized, the induced transcript 𝛤 is also a random variable. By the relation among mutual information, entropy and the size of sample space, we have 𝑅𝜖 (𝐷𝑖𝑠𝑗𝑛 ) = |𝛤| ≥ 𝐻(𝛤|𝐷) ≥ 𝐼(𝛤; 𝑋𝑌|𝐷). Note that both 𝑋 and 𝑌 can be decomposed into n bits: 𝑋 = 𝑋1 … 𝑋𝑛 and 𝑌 = 𝑌1 … 𝑌𝑛 . Each 𝑋𝑖 and 𝑌𝑖 are also random variables, and note that (𝑋𝑖 , 𝑌𝑖 , 𝐷𝑖 ) for different i are independent. Thus we can apply Lemma 1.1 and have 𝑛 𝐼(𝛤; 𝑋𝑌|𝐷) ≥ ∑ 𝐼(𝛤; 𝑋𝑖 𝑌𝑖 |𝐷). 𝑖=1 Since 𝐼(𝛤; 𝑋𝑖 𝑌𝑖 |𝐷) = 𝐄𝑑−𝑖 [𝐼(𝛤; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 )], we have 𝑛 𝐼(𝛤; 𝑋𝑌|𝐷) ≥ ∑ 𝐄𝑑−𝑖 [𝐼(𝛤; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 )]. 𝑖=1 For each fixed 𝑑−𝑖 ∈ {0,1}𝑛−1 , we design a worst-case ε-error private-coin protocol 𝛤𝑖 (𝑑−𝑖 ), for the function 𝑔(𝑥𝑖 , 𝑦𝑖 ) = 𝑥𝑖 ∧ 𝑦𝑖 , as follows. Note that the protocol should work for all possible inputs (𝑥𝑖 , 𝑦𝑖 ), not only those in the support of (𝑋𝑖 , 𝑌𝑖 ). 𝛤𝑖 (𝑑−𝑖 ): On input (𝑥𝑖 , 𝑦𝑖 ), 1. Generate (𝑋’−𝑖 , 𝑌’−𝑖 ) from 𝜇 𝑛−1 |𝐷−𝑖 = 𝑑−𝑖 . 2. Run protocol 𝛤 on (𝑥𝑖 𝑋’−𝑖 , 𝑦𝑖 𝑌’−𝑖 ) and output the answer. Lemma 2.2. 1. 𝛤𝑖 (𝑑−𝑖 ) is private-coin, 2. 𝛤𝑖 (𝑑−𝑖 ) has 𝜀-error in the worst case. 3. 𝐼(𝛤; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 ) = 𝐼(𝛤𝑖 (𝑑−𝑖 ); 𝑋𝑖 𝑌𝑖 |𝐷𝑖 ). Proof. Note that once 𝑑−𝑖 is fixed, then each (𝑋𝑗 , 𝑌𝑗 )⁡(𝑗 ∈ [𝑛] − 𝑖) is a product distribution 1. (over Alice and Bob’s spaces). Indeed, if 𝑑𝑗 = 0, then (𝑋𝑗 , 𝑌𝑗 ) ∼ 0 × 𝑈 where 𝑈 is the uniform distribution on {0,1}. If 𝑑𝑗 = 1, then (𝑋𝑗 , 𝑌𝑗 ) ∼ 𝑈 × 0. Since 𝜇 only puts weight on 0-inputs, 𝐷𝑖𝑠𝑗𝑛 (𝑥𝑖 𝑋’−𝑖 , 𝑦𝑖 𝑌’−𝑖 ) = 𝑥𝑖 ∧ 𝑦𝑖 , and thus 2. 𝛤𝑖 (𝑑−𝑖 ) is correct on (𝑥𝑖 , 𝑦𝑖 ) iff 𝛤 is correct on (𝑥𝑖 𝑋’−𝑖 , 𝑦𝑖 𝑌’−𝑖 ). Thus the error probability of 𝛤𝑖 (𝑑−𝑖 ) on (𝑥𝑖 , 𝑦𝑖 ) is the average (over 𝑋’−𝑖 and 𝑌’−𝑖 ) error prob of Γ on (𝑥𝑖 𝑋’−𝑖 , 𝑦𝑖 𝑌’−𝑖 ). Since Γ has error probability at most 𝜀 on all inputs, so is 𝛤𝑖 (𝑑−𝑖 ) on all its possible inputs (𝑥𝑖 , 𝑦𝑖 ). 𝛤𝑖 does nothing but invoking 𝛤 on the same input distribution 𝜇 𝑛 . 3. □ Now we have 𝑛 𝐼(𝛤; 𝑋𝑌|𝐷) ≥ ∑ 𝐄𝑑−𝑖 [𝐼(𝛤𝑖 (𝑑−𝑖 ); 𝑋𝑖 𝑌𝑖 |𝐷𝑖 )]. 𝑖=1 Next we show that any worst-case ε-error private-coin protocol for 𝑔 has to contain a constant amount of information about (𝑋𝑖 , 𝑌𝑖 ) (conditioned on 𝐷𝑖 ). Lemma 2.3. For any worst-case ε-error private-coin protocol for 𝑔 with transcript 𝛱, ⁡𝐼(𝛱; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 ) = Ω(1). Proof. Denote the transcript for input (𝑎, 𝑏) by 𝛱𝑎𝑏 . First, we show that if the protocol doesn’t contain enough information, then 𝜫𝟎𝟎 is close to both 𝜫𝟎𝟏 and 𝜫𝟏𝟎 . Thus 𝜫𝟎𝟏 is close to 𝜫𝟏𝟎 . Consider the Hellinger distance: 1 2 ℎ(𝑝, 𝑞)2 = 1 − ∑ √𝑝𝑖 𝑞𝑖 = ∑(√𝑝𝑖 − √𝑞𝑖 ) . 2 𝑖 𝑖 We have 1 1 ⁡⁡⁡⁡⁡𝐼(𝛱; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 ) = 2 𝐼(𝛱; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 = 0) + 2 𝐼(𝛱; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 = 1) 1 1 ≥ 2 ℎ(𝛱00 , 𝛱01 )2 + 2 ℎ(𝛱00 , 𝛱10 )2 ⁡ 1 ≥ 4 ℎ(𝛱10 , 𝛱01 )2 ⁡ // For 𝐵 ∼ 𝑈, 𝐼(𝐵, 𝑝𝐵 ) ⁡ ≥ ⁡ℎ(𝑝0 , 𝑝1 )2 // Cauchy-Schwartz and Triangle Then, we will show that any communication protocol enjoys the property that 𝒉(𝛱𝒙𝒚 ⁡, 𝛱𝒙’𝒚’ ) ⁡ = ⁡𝒉(𝛱𝒙𝒚′ ⁡, 𝛱𝒙′𝒚 ) for any 𝒙, 𝒚, 𝒙’, 𝒚’. Thus 𝛱𝟎𝟎 is close to 𝛱𝟏𝟏 . The property is proven by writing down 𝑃𝑟[𝛱𝑥𝑦 = 𝛾] and 𝑃𝑟[𝛱𝑥’𝑦’ = 𝛾] for any fixed 𝛾. It’s not hard to see that 𝑃𝑟[𝛱𝑥𝑦 = 𝛾] is product of some function of 𝑥 and some function of 𝑦. So by switching the function for 𝑥 and that for 𝑦’, we have that 𝑃𝑟[𝛱𝑥𝑦 = 𝛾]𝑃𝑟[𝛱𝑥’𝑦’ = 𝛾] = 𝑃𝑟[𝛱𝑥’𝑦 = 𝛾]𝑃𝑟[𝛱𝑥𝑦’ = 𝛾]. Then by the multiplicative nature of the definition of ℎ(𝑝, 𝑞), one gets ℎ(𝛱𝑥𝑦 , 𝛱𝑥’𝑦’ ) = ℎ(𝛱𝑥𝑦’ ⁡, 𝛱𝑥’𝑦 ). Now the contradiction comes: 𝒈(𝟎, 𝟎) ≠ 𝒈(𝟏, 𝟏), so 𝛱𝟎𝟎 should be far from 𝛱𝟏𝟏 . The best probability gap to distinguish 𝛱𝑥𝑦 and 𝛱𝑥’𝑦’ is ‖𝛤00 − 𝛤11 ‖1, which is upper bounded in terms of ℎ(𝑝, 𝑞) by the following fact: ‖𝑝 − 𝑞‖1 ≤ ℎ(𝑝, 𝑞)√2 − ℎ(𝑝, 𝑞)2 . □ Putting everything together, we get 𝑅𝜖 (𝐷𝑖𝑠𝑗𝑛 ) = |𝛤| ≥ 𝐼(𝛤; 𝑋𝑌|𝐷) ≥ ∑𝑛𝑖=1 𝐸𝑑−𝑖 [𝐼(𝛤; 𝑋𝑖 𝑌𝑖 |𝐷𝑖 , 𝐷−𝑖 = 𝑑−𝑖 )] = ∑𝑛𝑖=1 𝐸𝑑−𝑖 [𝐼(𝛤𝑖 (𝑑−𝑖 ); 𝑋𝑖 𝑌𝑖 |𝐷𝑖 )] = ∑𝑛𝑖=1 Ω(1) = Ω(𝑛). □ References [BYJKS04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4), pp. 702-732, 2004. [CSWY01] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao, Informational complexity and the direct sum problem for simultaneous message complexity, in Proceedings of the 42nd Annual Symposium on Foundations of Computer Science, pp. 270-278, 2001. [CT06] Thomas Cover, Joy Thomas. Elements of Information Theory, Second Edition, Wiley InterScience, 2006.

Lecture 11 - Department of Computer Science and Engineering

Related documents

Products

Support

Lecture 11 - Department of Computer Science and Engineering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib