advertisement

Basics of information theory and information complexity a tutorial Mark Braverman Princeton University June 1, 2013 1 Part I: Information theory • Information theory, in its modern format was introduced in the 1940s to study the problem of transmitting data over physical channels. communication channel Alice Bob 2 Quantifying “information” • Information is measured in bits. • The basic notion is Shannon’s entropy. • The entropy of a random variable is the (typical) number of bits needed to remove the uncertainty of the variable. • For a discrete variable: ≔ ∑ Pr = log 1/Pr[ = ] 3 Shannon’s entropy • Important examples and properties: – If = is a constant, then = 0. – If is uniform on a finite set of possible values, then = log . – If is supported on at most values, then X ≤ log . – If is a random variable determined by , then ≤ (). 4 Conditional entropy • For two (potentially correlated) variables , , the conditional entropy of given is the amount of uncertainty left in given : ≔ ~ H X Y = y . • One can show = + (|). • This important fact is knows as the chain rule. • If ⊥ , then = + = + . 5 Example • • • • = 1 , 2 , 3 = 1 ⊕ 2 , 2 ⊕ 4 , 3 ⊕ 4 , 5 Where 1 , 2 , 3 , 4 , 5 ∈ {0,1}. Then – = 3; = 4; = 5; – = 1 = − ; – = 2 = − . 6 Mutual information • = 1 , 2 , 3 • = 1 ⊕ 2 , 2 ⊕ 4 , 3 ⊕ 4 , 5 () (|) (; ) 1 1 ⊕ 2 2 ⊕ 3 (|) () 4 5 7 Mutual information • The mutual information is defined as ; = − = − (|) • “By how much does knowing reduce the entropy of ?” • Always non-negative ; ≥ 0. • Conditional mutual information: ; ≔ − (|) • Chain rule for mutual information: ; = ; + (; |) 8 • Simple intuitive interpretation. Example – a biased coin • A coin with -Heads or Tails bias is tossed several times. • Let ∈ {, } be the bias, and suppose that a-priori both options are equally likely: = 1. • How many tosses needed to find ? • Let 1 , … , be a sequence of tosses. • Start with = 2. 9 What do we learn about ? • ; 1 2 = ; 1 + ; 2 1 = ; 1 + 1 ; 2 − 1 ; 2 ≤ ; 1 + 1 ; 2 = ; 1 + ; 2 + 1 ; 2 = ; 1 + ; 2 = 2 ⋅ ; 1 . • Similarly, ; 1 … ≤ ⋅ ; 1 . • To determine with constant accuracy, need 0 < c < ; 1 … ≤ ⋅ ; 1 . • = Ω(1/(; 1 )). 10 Kullback–Leibler (KL)-Divergence • A distance metric between distributions on the same space. • Plays a key role in information theory. [] ( ∥ ) ≔ log . [] • ( ∥ ) ≥ 0, with equality when = . • Caution: ( ∥ ) ≠ ( ∥ )! 11 Properties of KL-divergence • Connection to mutual information: ; = ∼ (= ∥ ). • If ⊥ , then = = , and both sides are 0. • Pinsker’s inequality: − 1 = ( ( ∥ )). • Tight! (1/2+ ∥ 1/2 ) = Θ( 2 ). 12 Back to the coin example • ; 1 = ∼ (1,= ∥ 1 ) = 1 ∥ 1 = Θ 2 . 2 • =Ω 1 ;1 ± = 2 1 Ω( 2). • “Follow the information learned from the coin tosses” • Can be done using combinatorics, but the information-theoretic language is more natural for expressing what’s going on. Back to communication • The reason Information Theory is so important for communication is because information-theoretic quantities readily operationalize. • Can attach operational meaning to Shannon’s entropy: ≈ “the cost of transmitting ”. • Let be the (expected) cost of transmitting a sample of . 14 = ()? • Not quite. • Let trit ∈ 1,2,3 . 5 3 • = ≈ 1.67. 1 2 0 10 3 11 • = log 3 ≈ 1.58. • It is always the case that ≥ (). 15 But and () are close • Huffman’s coding: ≤ + 1. • This is a compression result: “an uninformative message turned into a short one”. • Therefore: ≤ ≤ + 1. 16 Shannon’s noiseless coding • The cost of communicating many copies of scales as (). • Shannon’s source coding theorem: – Let be the cost of transmitting independent copies of . Then the amortized transmission cost lim ( )/ = . →∞ • This equation gives () operational meaning. 17 1 , … , , … () per copy to transmit ’s communication channel 18 () is nicer than () • is additive for independent variables. • Let 1 , 2 ∈ {1,2,3} be independent trits. • 1 2 = log 9 = 2 log 3. • 1 2 = 29 9 5 3 < 1 + 2 = 2 × = 30 . 9 • Works well with concepts such as channel capacity. 19 “Proof” of Shannon’s noiseless coding • ⋅ = ≤ ≤ + 1. Additivity of entropy Compression (Huffman) • Therefore lim ( )/ = . →∞ 20 Operationalizing other quantities • Conditional entropy : • (cf. Slepian-Wolf Theorem). 1 , … , , … (|) per copy to transmit ’s communication channel 1 , … , , … Operationalizing other quantities • Mutual information ; : 1 , … , , … 1 , … , , … ; per copy to sample ’s communication channel Information theory and entropy • Allows us to formalize intuitive notions. • Operationalized in the context of one-way transmission and related problems. • Has nice properties (additivity, chain rule…) • Next, we discuss extensions to more interesting communication scenarios. 23 Communication complexity • Focus on the two party randomized setting. Shared randomness R A & B implement a functionality (, ). X Y F(X,Y) A e.g. , = “ = ? ” B 24 Communication complexity Goal: implement a functionality (, ). A protocol (, ) computing (, ): Shared randomness R m1(X,R) m2(Y,m1,R) m3(X,m1,m2,R) X Y A Communication costF(X,Y) = #of bits exchanged. B Communication complexity • Numerous applications/potential applications (some will be discussed later today). • Considerably more difficult to obtain lower bounds than transmission (still much easier than other models of computation!). 26 Communication complexity • (Distributional) communication complexity with input distribution and error : , , . Error ≤ w.r.t. . • (Randomized/worst-case) communication complexity: (, ). Error ≤ on all inputs. • Yao’s minimax: , = max (, , ). 27 Examples • , ∈ 0,1 . • Equality , ≔ 1= . • , ≈ 1 log . • , 0 ≈ . 28 Equality •is “ = ? ”. •is a distribution where w.p. ½ = and w.p. ½ (, ) are random. Y X MD5(X) [128 bits] X=Y? [1 bit] A Error? • Shows that , , 2−129 ≤ 129. B Examples • , ∈ 0,1 . • Innerproduct , ≔ ∑ ⋅ (2). • , 0 = − (). In fact, using information complexity: • , = − (). 30 Information complexity • Information complexity (, ):: communication complexity , as • Shannon’s entropy () :: transmission cost () 31 Information complexity • The smallest amount of information Alice and Bob need to exchange to solve . • How is information measured? • Communication cost of a protocol? – Number of bits exchanged. • Information cost of a protocol? – Amount of information revealed. 32 Basic definition 1: The information cost of a protocol • Prior distribution: , ∼ . Y X Protocol Protocol π transcript Π B A (, ) = (Π; |) + (Π; |) what Alice learns about Y + what Bob learns about X Example •is “ = ? ”. •is a distribution where w.p. ½ = and w.p. ½ (, ) are random. Y X MD5(X) [128 bits] X=Y? [1 bit] A B (, ) = (Π; |) + (Π; |) ≈1 + 64.5 = 65.5 bits what Alice learns about Y + what Bob learns about X Prior matters a lot for information cost! • If = 1 , a singleton, , = 0. 35 Example •is “ = ? ”. •is a distribution where (, ) are just uniformly random. Y X MD5(X) [128 bits] X=Y? [1 bit] A B (, ) = (Π; |) + (Π; |) ≈0 + 128 = 128 bits what Alice learns about Y + what Bob learns about X Basic definition 2: Information complexity • Communication complexity: , , ≔ min ℎ≤ Needed! • Analogously: , , ≔ . inf ℎ≤ (, ). 37 Prior-free information complexity • Using minimax can get rid of the prior. • For communication, we had: , = max (, , ). • For information , ≔ inf ℎ≤ max (, ). 38 Ex: The information complexity of Equality • What is (, 0)? • Consider the following protocol. A non-singular in × X in {0,1}n A1·X Y in {0,1}n Continue for n steps, or A1·Y until a disagreement is A2·X discovered. A A2·Y B39 Analysis (sketch) • If X≠Y, the protocol will terminate in O(1) rounds on average, and thus reveal O(1) information. • If X=Y… the players only learn the fact that X=Y (≤1 bit of information). • Thus the protocol has O(1) information complexity for any prior . 40 Operationalizing IC: Information equals amortized communication • Recall [Shannon]: lim ( )/ = . →∞ • Turns out: lim ( , , )/ = , , , →∞ for > 0. [Error allowed on each copy] • For = 0: lim ( , , 0+ )/ = , , 0 . →∞ • [ lim ( , , 0)/ an interesting open →∞ problem.] 41 Information = amortized communication • lim ( , , )/ = , , . →∞ • Two directions: “≤” and “≥”. • ⋅ = ≤ ≤ + 1. Additivity of entropy Compression (Huffman) The “≤” direction • lim ( , , )/ ≤ , , . →∞ • Start with a protocol solving , whose , is close to , , . • Show how to compress many copies of into a protocol whose communication cost is close to its information cost. • More on compression later. 43 The “≥” direction • lim ( , , )/ ≥ , , . →∞ • Use the fact that , , ≥ , , . • Additivity of information complexity: , , = , , . 44 Proof: Additivity of information complexity • Let 1 (1 , 1 ) and 2 (2 , 2 ) be two twoparty tasks. • E.g. “Solve (, ) with error ≤ w.r.t. ” • Then 1 × 2 , 1 × 2 = 1 , 1 + 2 , 2 • “≤” is easy. • “≥” is the interesting direction. 1 , 1 + 2 , 2 ≤ 1 × 2 , 1 × 2 • Start from a protocol for 1 × 2 with prior 1 × 2 , whose information cost is . • Show how to construct two protocols 1 for 1 with prior 1 and 2 for 2 with prior 2 , with information costs 1 and 2 , respectively, such that 1 + 2 = . 46 ( 1 , 2 , 1 , 2 ) 1 1 , 1 • Publicly sample 2 ∼ 2 • Bob privately samples 2 ∼ 2 |X2 • Run ( 1 , 2 , 1 , 2 ) 2 2 , 2 • Publicly sample 1 ∼ 1 • Alice privately samples 1 ∼ 1 |Y1 • Run ( 1 , 2 , 1 , 2 ) Analysis - 1 1 1 , 1 • Publicly sample 2 ∼ 2 • Bob privately samples 2 ∼ 2 |X2 • Run ( 1 , 2 , 1 , 2 ) • Alice learns about 1 : Π; 1 X1 X2 ) • Bob learns about 1 : Π; 1 1 2 2 ). • 1 = Π; 1 X1 X2 ) + Π; 1 1 2 2 ). 48 Analysis - 2 2 2 , 2 • Publicly sample 1 ∼ 1 • Alice privately samples 1 ∼ 1 |Y1 • Run ( 1 , 2 , 1 , 2 ) • Alice learns about 2 : Π; 2 X1 X2 Y1 ) • Bob learns about 2 : Π; 2 1 2 ). • 2 = Π; 2 X1 X2 Y1 ) + Π; 2 1 2 ). 49 Adding 1 and 2 1 + 2 = Π; 1 X1 X2 ) + Π; 1 1 2 2 ) + Π; 2 X1 X2 Y1 ) + Π; 2 1 2 ) = Π; 1 X1 X2 ) + Π; 2 X1 X2 Y1 ) + Π; 2 1 2 ) + Π; 1 1 2 2 ) = Π; 1 2 1 2 + Π; 2 1 1 2 = . 50 Summary • Information complexity is additive. • Operationalized via “Information = amortized communication”. • lim ( , , )/ = , , . →∞ • Seems to be the “right” analogue of entropy for interactive computation. 51 Entropy vs. Information Complexity Entropy IC Additive? Yes Yes Operationalized ( )/ Compression? lim →∞ Huffman: ≤ +1 , , lim →∞ ???! Can interactive communication be compressed? • Is it true that , , ≤ , , + (1)? • Less ambitiously: , , () = , , ? • (Almost) equivalently: Given a protocol with , = , can Alice and Bob simulate using communication? • Not known in general… 53 Direct sum theorems • Let be any functionality. • Let () be the cost of implementing . • Let be the functionality of implementing independent copies of . • The direct sum problem: “Does ( ) ≈ ∙ ()?” • In most cases it is obvious that ( ) ≤ ∙ (). 54 Direct sum – randomized communication complexity • Is it true that (, , ) = Ω( ∙ (, , ))? • Is it true that (, ) = Ω( ∙ (, ))? 55 Direct product – randomized communication complexity • Direct sum (, , ) = Ω( ∙ (, , ))? • Direct product (, , (1 − ) ) = Ω( ∙ (, , ))? 56 Direct sum for randomized CC and interactive compression Direct sum: • (, , ) = Ω( ∙ (, , ))? In the limit: • ⋅ (, , ) = Ω( ∙ (, , ))? Interactive compression: • , , = ( , , ) ? Same question! 57 The big picture (, , ) additivity (=direct sum) for information interactive compression? (, , ) direct sum for communication? (, , )/ information = amortized communication (, , )/ Current results for compression A protocol that has bits of communication, conveys bits of information over prior , and works in rounds can be simulated: • Using ( + ) bits of communication. • Using ⋅ bits of communication. • Using 2() bits of communication. • If = × , then using ( polylog ) bits of communication. 59 Their direct sum counterparts • (, , ) = Ω(1/2 ∙ (, , )). • (, ) = Ω(1/2 ∙ (, )). For product distributions = × , • (, , ) = Ω( ∙ (, , )). When the number of rounds is bounded by ≪ , a direct sum theorem holds. 60 Direct product • The best one can hope for is a statement of the type: (, , 1 − 2− ) = Ω( ∙ (, , 1/3)). • Can prove: , , 1 − 2− = Ω(1/2 ∙ (, , 1/3)). 61 Proof 2: Compressing a one-round protocol • • • Say Alice speaks: , = ; . Recall KL-divergence: ; = ∥ = ∥ Bottom line: – Alice has ; Bob has ; – Goal: sample from using ∼ ( ∥ ) communication. 62 The dart board • Interpret the public randomness as q1 u random points in × [0,1], where 1 is the universe of all possible messages. • First message under the histogram of is distributed ∼ . 0 q2 u q3 u q4 u q5 u q6 u u3 u4 u1 u2 u5 q7 …. u Proof Idea • Sample using (log1/ + ( ∥ )) communication with statistical error ε. Public randomness: 1 q1 u MX q2 u q3 u u3 ~|U| samples q4 u q5 u q6 u MY 1 q7 …. u u3 u1 u1 u2 u4 u4 u2 0 MX 0 u4 MY 64 Proof Idea • Sample using (log1/ + ( ∥ )) communication with statistical error ε. 1 MX 1 MY u2 u4 0 MX u4 h1(u4) h2(u4) u2 0 MY 65 Proof Idea • Sample using (log1/ + ( ∥ )) communication with statistical error ε. 1 MX 1 2MY u4 0 MX u2 u4 u4 h3(u4) h4(u4)… hlog 1/ ε(u4) 0 u4 MY h1(u4), h2(u4) 66 Analysis 2 (4), • If (4) ≈ then the protocol will reach round of doubling. • There will be ≈ 2 candidates. • About + log1/ hashes to narrow to one. • The contribution of 4 to cost: – (4)(log(4)/(4) + log1/). ( ∥ ) ≔ () () log . () Done! 67 External information cost • (, )~. X C Y Protocol Protocol π transcript π B A (, ) = (Π; ) what Charlie learns about (X,Y) Example •F is “X=Y?”. •μ is a distribution where w.p. ½ X=Y and w.p. ½ (X,Y) are random. Y X MD5(X) X=Y? B A (, ) = (Π; ) = 129 what Charlie learns about (X,Y) 69 External information cost • It is always the case that , ≥ , . • If = × is a product distribution, then , = , . 70 External information complexity • , , ≔ inf ℎ≤ (, ). • Can it be operationalized? 71 Operational meaning of ? • Conjecture: Zero-error communication scales like external information: , , 0 lim = , , 0 ? →∞ • Recall: , , 0+ lim = , , 0 . →∞ 72 Example – transmission with a strong prior • , ∈ 0,1 • is such that ∈ 0,1 , and = with a very high probability (say 1 − 1/ ). • , = is just the “transmit ” function. • Clearly, should just have Alice send to Bob. • , , 0 = , = 1 = o(1). • , , 0 = , = 1. 73 Example – transmission with a strong prior • , , 0 = , = 1 = o(1). • , , 0 = , = 1. • , , 0+ = . • , , 0 = Ω(). Other examples, e.g. the two-bit AND function fit into this picture. 74 Additional directions Information complexity Interactive coding Information theory in TCS 75 Interactive coding theory • So far focused the discussion on noiseless coding. • What if the channel has noise? • [What kind of noise?] • In the non-interactive case, each channel has a capacity . 76 Channel capacity • The amortized number of channel uses needed to send over a noisy channel of capacity is • Decouples the task from the channel! 77 Example: Binary Symmetric Channel • Each bit gets independently flipped with probability < 1/2. • One way capacity 1 − . 1− 0 1 0 1− 1 78 Interactive channel capacity • Not clear one can decouple channel from task in such a clean way. • Capacity much harder to calculate/reason about. 1− 0 • Example: Binary symmetric channel. 0 • One way capacity 1 − . 1 1 • Interactive (for simple pointer jumping, 1 − [Kol-Raz’13]): 1−Θ . Information theory in communication complexity and beyond • A natural extension would be to multi-party communication complexity. • Some success in the number-in-hand case. • What about the number-on-forehead? • Explicit bounds for ≥ log players would imply explicit 0 circuit lower bounds. 80 Naïve multi-party information cost XZ YZ A XY B C , = (Π; |)+ (Π; |)+ (Π; |) 81 Naïve multi-party information cost , = (Π; |)+ (Π; |)+ (Π; |) • Doesn’t seem to work. • Secure multi-party computation [BenOr,Goldwasser, Wigderson], means that anything can be computed at near-zero information cost. • Although, these construction require the players to share private channels/randomness. 82 Communication and beyond… • The rest of today: – – – – • • • • Data structures; Streaming; Distributed computing; Privacy. Exact communication complexity bounds. Extended formulations lower bounds. Parallel repetition? … 83 Thank You! 84 Open problem: Computability of IC • Given the truth table of , , and , compute , , . • Via , , = lim ( , , )/ can →∞ compute a sequence of upper bounds. • But the rate of convergence as a function of is unknown. 85 Open problem: Computability of IC • Can compute the -round , , information complexity of . • But the rate of convergence as a function of is unknown. • Conjecture: 1 , , − , , = ,, 2 . • This is the relationship for the two-bit AND. 86