Lecture 11. Information theoretical argument An interesting technique to prove lower bound is to use some information theoretical argument. Since introduced by [CSWY01], information complexity has been studied and applied to prove strong lower bounds of communication complexity. In this lecture, we will first introduce/review some basic notions such as entropy and mutual information, and then show how to use them to prove a linear lower bound for randomized communication complexity of Disjointness function [BYJKS04], one of the most important functions in communication complexity. 1. Review of basic notions in information theory Let’s first review some basic concepts in information theory. All these can be found in the standard textbook such as [CT06] (Chapter 1). Suppose that π is a discrete random variable with alphabet X and probability mass function π(π₯) = Prβ‘[π = π₯]. Following the standard notation in information theory, we sometimes also use the capital letter π to denote the distribution. By writing π₯ ← π, we mean to draw a sample π₯ from the distribution π. The entropy π»(π) of a discrete random variable π is defined by π»(π) = ∑ π(π₯) log 2 π(π₯). π₯∈X (Here we use the convention that 0 log 2 0 = 0.) One intuition is that entropy measures the amount of the uncertainty of a random variable. The more entropy π has, the more uncertain it is. In particular, if π»(π) = 0 then π is fixed/deterministic. The maximum entropy for any random variable on X is log 2 |X|: 0 ≤ π»(π) ≤ log 2 |X|, with upper bound achieved by the uniform distribution. Sometimes we have more than one random variable. If (π, π) is a pair of random variables on X × Y, distributed according to a joint distribution π(π₯, π¦), then the total entropy is simple: π»(π, π) = ∑ π(π₯, π¦) log 2 π(π₯, π¦). π₯∈X,π¦∈Y We can also define the conditional entropy π»(π|π) = πΈπ₯←π [π»(π|π = π₯)], namely the expected entropy of the conditional distribution π|π = π₯. Thus, π»(π|π) ≥ 0. Another basic fact is the chain rule: π»(π, π) = π»(π) + π»(π|π).β‘ The chain rule also works with conditional entropy: π»(π, π|π) = π»(π|π) + π»(π|π, π).β‘ And if we have more variables, it also works: π»(π1, … , ππ |π) = π»(π1 |π) + π»(π2|π1 , π) + β― + π»(ππ |π1 , … ππ−1 , π) ≤ π»(π1 |π) + π»(π2|π) + β― + π»(ππ |π), where the inequality is usually referred to as the subadditivity of entropy. The relative entropy of two distributions π and π is π·(π||π) = ∑ π(π₯) log 2 π₯∈X π(π₯) β‘. π(π₯) For a joint distribution (π, π) = π(π₯, π¦), the mutual information between π and π is πΌ(π; π) = π·(π(π₯, π¦)||π(π₯)π(π¦)) = π»(π) − π»(π|π). It is a good exercise to verify that the above equality holds. But the latter one has a clear explanation: the mutual information is the amount of uncertainty of π minus that when π is known. In this way, it measures how much π contains the information of π. It is not hard to verify that it’s symmetric: πΌ(π; π) = π»(π) − π»(π|π) = π»(π) − π»(π|π), and that it’s nonnegative: πΌ(π, π) ≥ 0. That is, πΌ(π, π) ≤ min{π»(π), π»(π)} The conditional mutual information is defined by πΌ(π, π|π·) = πΈπ [πΌ(π, π|π· = π)] It is not hard to see that πΌ(π, π|π·) ≤ min{π»(π|π·), π»(π|π·)} ≤ log 2 |X| and πΌ(π; π|π·) = π»(π|π·) − π»(π|ππ·). Lemma 1.1. If π = π1 … ππ , π = π1 … ππ , and π· = π·1 … π·π , and (ππ , ππ , π·π )’s are independent, then π πΌ(π; π|π·) ≥ ∑ πΌ(ππ ; π|π·) π=1 Proof. πΌ(π; π|π·) = π»(π|π·) − π»(π|ππ·) = ∑π π»(ππ |π·) − π»(π|ππ·) // π»(π|π·) = ∑π π»(ππ |π·): independence of (ππ , π·π )’s ≥ ∑π π»(ππ |π·) − ∑π π»(ππ |ππ·) // π»(π|ππ·) ≤ ∑π π»(ππ |ππ·): subadditivity of entropy = ∑ππ=1 πΌ(ππ ; π|π·) 2. β‘ Linear lower bound for the Disjointness function The information complexity is an approach to prove lower bounds for communication complexity. The basic idea is that for some functions, any small-error protocol has to contain enough information about the input. Thus the communication cost, namely the length of the protocol, is also large. Now we focus on Disjointness function, where both Alice and Bob have input set {0,1}π and π(π₯, π¦) = 1 iff ∃π ∈ [π], π₯π = π¦π = 1. We sometimes use subscript – π to denote the set [π] − {π}. Denote by π π (π) the π-error private-coin communication complexity of π. We want to prove the following. Theorem 2.1. π π (π) = Ω(π). First, we define a distribution on inputs. Again, we’ll use capital letters (π, π) to denote both the random variables and the distribution. We also introduce a third random variable π· with the sample space {0,1}π . The joint distribution (π, π, π·) is defined by π π , where π(π₯π , π¦π , ππ ) β‘ = β‘ (000, 010, 001, 101)β‘each with probability 1/4. That is, (ππ , ππ , π·π )’s are independent for different π’s, and each (ππ , ππ , π·π ) is distributed according to π. Now take a private-coin protocol with minimum worst-case communication cost among all protocols that have π-error in the worst case. Consider the messages over all rounds, namely the entire transcript of the conversation. Denote it by π€, and its length (i.e. the number of bits) by |π€|. Note that since the input (π, π) is a random variable, and the protocol is randomized, the induced transcript π€ is also a random variable. By the relation among mutual information, entropy and the size of sample space, we have π π (π·ππ ππ ) = |π€| ≥ π»(π€|π·) ≥ πΌ(π€; ππ|π·). Note that both π and π can be decomposed into n bits: π = π1 … ππ and π = π1 … ππ . Each ππ and ππ are also random variables, and note that (ππ , ππ , π·π ) for different i are independent. Thus we can apply Lemma 1.1 and have π πΌ(π€; ππ|π·) ≥ ∑ πΌ(π€; ππ ππ |π·). π=1 Since πΌ(π€; ππ ππ |π·) = ππ−π [πΌ(π€; ππ ππ |π·π , π·−π = π−π )], we have π πΌ(π€; ππ|π·) ≥ ∑ ππ−π [πΌ(π€; ππ ππ |π·π , π·−π = π−π )]. π=1 For each fixed π−π ∈ {0,1}π−1 , we design a worst-case ε-error private-coin protocol π€π (π−π ), for the function π(π₯π , π¦π ) = π₯π ∧ π¦π , as follows. Note that the protocol should work for all possible inputs (π₯π , π¦π ), not only those in the support of (ππ , ππ ). π€π (π−π ): On input (π₯π , π¦π ), 1. Generate (π’−π , π’−π ) from π π−1 |π·−π = π−π . 2. Run protocol π€ on (π₯π π’−π , π¦π π’−π ) and output the answer. Lemma 2.2. 1. π€π (π−π ) is private-coin, 2. π€π (π−π ) has π-error in the worst case. 3. πΌ(π€; ππ ππ |π·π , π·−π = π−π ) = πΌ(π€π (π−π ); ππ ππ |π·π ). Proof. Note that once π−π is fixed, then each (ππ , ππ )β‘(π ∈ [π] − π) is a product distribution 1. (over Alice and Bob’s spaces). Indeed, if ππ = 0, then (ππ , ππ ) ∼ 0 × π where π is the uniform distribution on {0,1}. If ππ = 1, then (ππ , ππ ) ∼ π × 0. Since π only puts weight on 0-inputs, π·ππ ππ (π₯π π’−π , π¦π π’−π ) = π₯π ∧ π¦π , and thus 2. π€π (π−π ) is correct on (π₯π , π¦π ) iff π€ is correct on (π₯π π’−π , π¦π π’−π ). Thus the error probability of π€π (π−π ) on (π₯π , π¦π ) is the average (over π’−π and π’−π ) error prob of Γ on (π₯π π’−π , π¦π π’−π ). Since Γ has error probability at most π on all inputs, so is π€π (π−π ) on all its possible inputs (π₯π , π¦π ). π€π does nothing but invoking π€ on the same input distribution π π . 3. β‘ Now we have π πΌ(π€; ππ|π·) ≥ ∑ ππ−π [πΌ(π€π (π−π ); ππ ππ |π·π )]. π=1 Next we show that any worst-case ε-error private-coin protocol for π has to contain a constant amount of information about (ππ , ππ ) (conditioned on π·π ). Lemma 2.3. For any worst-case ε-error private-coin protocol for π with transcript π±, β‘πΌ(π±; ππ ππ |π·π ) = Ω(1). Proof. Denote the transcript for input (π, π) by π±ππ . First, we show that if the protocol doesn’t contain enough information, then π«ππ is close to both π«ππ and π«ππ . Thus π«ππ is close to π«ππ . Consider the Hellinger distance: 1 2 β(π, π)2 = 1 − ∑ √ππ ππ = ∑(√ππ − √ππ ) . 2 π π We have 1 1 β‘β‘β‘β‘β‘πΌ(π±; ππ ππ |π·π ) = 2 πΌ(π±; ππ ππ |π·π = 0) + 2 πΌ(π±; ππ ππ |π·π = 1) 1 1 ≥ 2 β(π±00 , π±01 )2 + 2 β(π±00 , π±10 )2 β‘ 1 ≥ 4 β(π±10 , π±01 )2 β‘ // For π΅ ∼ π, πΌ(π΅, ππ΅ ) β‘ ≥ β‘β(π0 , π1 )2 // Cauchy-Schwartz and Triangle Then, we will show that any communication protocol enjoys the property that π(π±ππ β‘, π±π’π’ ) β‘ = β‘π(π±ππ′ β‘, π±π′π ) for any π, π, π’, π’. Thus π±ππ is close to π±ππ . The property is proven by writing down ππ[π±π₯π¦ = πΎ] and ππ[π±π₯’π¦’ = πΎ] for any fixed πΎ. It’s not hard to see that ππ[π±π₯π¦ = πΎ] is product of some function of π₯ and some function of π¦. So by switching the function for π₯ and that for π¦’, we have that ππ[π±π₯π¦ = πΎ]ππ[π±π₯’π¦’ = πΎ] = ππ[π±π₯’π¦ = πΎ]ππ[π±π₯π¦’ = πΎ]. Then by the multiplicative nature of the definition of β(π, π), one gets β(π±π₯π¦ , π±π₯’π¦’ ) = β(π±π₯π¦’ β‘, π±π₯’π¦ ). Now the contradiction comes: π(π, π) ≠ π(π, π), so π±ππ should be far from π±ππ . The best probability gap to distinguish π±π₯π¦ and π±π₯’π¦’ is βπ€00 − π€11 β1, which is upper bounded in terms of β(π, π) by the following fact: βπ − πβ1 ≤ β(π, π)√2 − β(π, π)2 . β‘ Putting everything together, we get π π (π·ππ ππ ) = |π€| ≥ πΌ(π€; ππ|π·) ≥ ∑ππ=1 πΈπ−π [πΌ(π€; ππ ππ |π·π , π·−π = π−π )] = ∑ππ=1 πΈπ−π [πΌ(π€π (π−π ); ππ ππ |π·π )] = ∑ππ=1 Ω(1) = Ω(π). β‘ References [BYJKS04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4), pp. 702-732, 2004. [CSWY01] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao, Informational complexity and the direct sum problem for simultaneous message complexity, in Proceedings of the 42nd Annual Symposium on Foundations of Computer Science, pp. 270-278, 2001. [CT06] Thomas Cover, Joy Thomas. Elements of Information Theory, Second Edition, Wiley InterScience, 2006.