Empirical distribution of good channel codes with non-vanishing error probability

1 Empirical distribution of good channel codes with non-vanishing error probability Yury Polyanskiy and Sergio Verdú Abstract This paper studies several properties of channel codes that approach the fundamental limits of a given memoryless channel with a non-vanishing probability of error. The output distribution induced by an ǫ-capacity-achieving code is shown to be close in a strong sense to the capacity achieving output distribution (for DMC and AWGN). Relying on the concentration of measure (isoperimetry) property enjoyed by the latter, it is shown that regular (Lipschitz) functions of channel outputs can be precisely estimated and turn out to be essentially non-random and independent of the used code. It is also shown that the binary hypothesis testing between the output distribution of a good code and the capacity achieving one cannot be performed with exponential reliability. Using related methods it is shown that quadratic forms and sums of q-th powers when evaluated at codewords of good AWGN codes approach the values obtained from a randomly generated Gaussian codeword. The random process produced at the output of the channel is shown to satisfy the asymptotic equipartition property (for DMC and AWGN). Index Terms Shannon theory, discrete memoryless channels, additive white Gaussian noise, information measures, relative entropy, empirical output statistics, asymptotic equipartition property, concentration of measure, optimal transportation Y. Polyanskiy is with the Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, 02139 USA, e-mail: yp@mit.edu. S. Verdú is with the Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544 USA. e-mail: verdu@princeton.edu. The work was supported in part by the National Science Foundation (NSF) under Grant CCF-1016625 and by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under Grant CCF-0939370. Parts of this work were presented at 49-th and 50th Allerton Conferences on Communication, Control, and Computing, Allerton Retreat Center, IL, 2011-2012. December 21, 2012 DRAFT 2 I. I NTRODUCTI ON A reliable channel code is a collection of waveforms of fixed duration distinguishable with small probability of error when observerd through a noisy channel. Such a code is optimal (extremal) if it possesses the maximal cardinality among all codebooks of equal duration and probability of error. In this paper we characterize several properties of optimal and close-tooptimal channel codes indirectly, i.e. without identifying the best code explicitly. This characterization provides theoretical insight and ultimately may facilitate the search for new good codes. Shannon [1] was the first to recognize that to maximize information transfer across a memoryless channel the input (output) waveforms must be “noise-like”, i.e. resemble a typical sample of a memoryless random process with marginal distribution P ∗X (P Y∗ ) – a capacity achieving input (output) distribution. The most general formal statement of this property of optimal codes has been put forward in [2] which showed that a capacity achieving sequence of codes with vanishing probability of error must satisfy [2, Theorem 2] 1 D(PY n ||P ∗Y n ) → 0 , n (1) where PY n denotes the output distribution induced by the codebook (assuming equiprobable codewords) and Yn P ∗PY × · ·∗· × PY = ∗ (2) is the n-th power of the single-letter capacity achieving output distribution P ∗Y and D(·||·) is the relative entropy. Furthermore, [2] shows that (under regularity conditions) the empirical frequency of input letters (or sequential k-letter blocks) inside the codebook approaches the capacity achieving input distribution (or its k-th power) in the sense of vanishing relative entropy. Note that (1) establishes a strong sense in which PY n is close to the memoryless P ∗Y n . In this paper we extend the result in [2] to the case of non-vanishing probability of error. Studying this regime as opposed to vanishing probability of error has recently proved to be fruitful for the non-asymptotic characterization of the maximal achievable rate [3]. Although for the memoryless channels considered in this paper the ǫ-capacity Cǫ is independent of the probability of error ǫ, it does not immediately follow that a Cǫ -achieving code necessarily satisfies the empirical distribution property (1). In fact, somewhat surprisingly, we will show that (1) only holds under the maximal probability of error criterion. December 21, 2012 DRAFT 3 To illustrate the delicacy of the question of approximating PY n with P ∗Y n , consider a good, capacity-achieving k-to-n code for the binary symmetric channel (BSC) with crossover probability δ and capacity C . First, consider the set of all codewords; its probability under PY n is k− n – the two being exponentially very different (approximately) (1 − δ)n while over PY∗n it is 2 (since for a reliable code k < n log 2(1 − δ)). On the other hand, consider a set E consisting of a union of small Hamming balls surrounding each codeword, whose radius ≈ δn is chosen such that PY n [E] = 12 , say. Assuming that the code is decodable with small probability of error, the union will be almost disjoint and hence P Y∗ n [E] ≈ 2k− nC – the two becoming exponentially comparable (provided k ≈ nC ). Thus, although PY n and PY∗n differ dramatically in “small” details, on a “larger” scale they look the same. One of the goals of this paper is to make this intuition precise. We will show that the closer the code approaches the fundamental limits, the better PY n approximates P Y∗ n . Studying the output distribution PY n also becomes important in the context of secure communication, where the output due to the code is required to resemble white noise; and in the problem of asynchronous communication where the output statistics of the code imposes limits on the quality of synchronization [4]. For example, in a multi-terminal communication problem, the channel output of one user may create interference for another. Assessing the average impairment caused by such interference involves the analysis of the expectation of a certain function of the channel output E[F (Y n )]. We show that under certain regularity assumptions on F not only one can approximate the expectation of F by substituting the unknown PY n with P ∗Y n , as in Z Z n F (y n )dP Y∗n , F (y )dPY n ≈ (3) but one can also prove that in fact the distribution of F (Y n ) will be tightly concentrated around its expectation. Thus, we are able to predict with overwhelming probability the random value of F (Y n ) without any knowledge of the code used to produce Y n (but assuming the code is ǫ-capacity-achieving). Besides (1) and (3) we will show that 1) the hypothesis testing problem between PY n and P Y∗ n has zero Stein’s exponent; 2) a convenient inequality holds for the conditional relative entropy for the channel output in terms of the cardinality of the employed code; 3) codewords of good codes for the additive white Gaussian noise (AWGN) channel become December 21, 2012 DRAFT 4 more and more isotropically distributed (in the sense of evaluating quadratic forms) and resemble white Gaussian noise (in the sense of ℓq norms) as the code approaches the fundamental limits; 4) the output process Y n enjoys an asymptotic equipartition property. Throughout the paper we will observe repeated connections with the concentration of measure (isoperimetry) and optimal transportation, which were introduced into the information theory by the seminal works [5]–[7]. Although some key results are stated for general channels, most of the discussion is specialized to discrete memoryless channels (DMC) (possibly with a (separable) input cost constraint) and to the AWGN channel. The organization of the paper is as follows. Section II contains the main definitions and notation. Section III proves a sharp upper bound on the relative entropy D(PY n ||P ∗Y n ). In Section IV we discuss various implications of the bounds on relative entropy and in particular prove approximation (3). Section V considers the hypothesis testing problem of discriminating between PY n and P Y∗ n . The asymptotic equipartition property of the channel output process is established in Section VI. Section VII discusses results for the quadratic forms and ℓp norms of the codewords of good Gaussian codes. Finally, Section VIII outlines extensions to other channels. II. D EFINITIONS AND NOTATION A random transformation PY |X : X → Y is a Markov kernel acting between a pair of measurable spaces. An (M, ǫ)avg code for the random transformation PY |X is a pair of random transformations f : {1, . . . , M } → X and g : Y → {1, . . . , M } such that P[Ŵ = W ] ≤ ǫ , (4) where in the underlying probability space X = f (W ) and Ŵ = g(Y ) with W equiprobable on {1, . . . , M }, and W, X, Y, Ŵ forming a Markov chain: P Y |X g f W → X → Y → Ŵ . (5) In particular, we say that PX (resp., PY ) is the input (resp., output) distribution induced by the encoder f . An (M, ǫ)max code is defined similarly except that (4) is replaced with the more stringent maximal probability of error criterion: max P[Ŵ = W |W = j] ≤ ǫ . 1≤ j≤ M December 21, 2012 (6) DRAFT 5 A code is called deterministic, denoted (M, ǫ)det , if the encoder f is a functional (non-random) mapping. A channel is a sequence of random transformations, {PY n |X n , n = 1, . . .} indexed by the parameter n, referred to as the blocklength. An (M, ǫ) code for the n-th random transformation is called an (n, M, ǫ) code. The non-asymptotic fundamental limit of communication is defined as1 M ∗ (n, ǫ) = max{M : ∃(n, M, ǫ)-code} . (7) To the three types of channels considered below we also associate a special sequence of output distributions P Y∗ n , defined as the n-th power of a certain single-letter P Y∗ :2 △ P Y∗ n = (P ∗Y )n , (8) where P Y is defined as follows: ∗ 1) a DMC (without feedback) is built from a single letter transformation PY |X : X → Y acting between finite spaces by extending the latter to all n ≥ 1 in a memoryless way. Namely, the input space of the n-th random transformation PY n |X n is given by 3 △ (9) Xn = X n = X | × .{.z. × X} n t imes n and similarly for the output space Y = Y × . . . × Y, while the transition kernel is set to be n n PY n |X n (y |x ) = n Y PY |X (y j |xj ) . (10) j=1 The capacity C and P Y∗ , the unique capacity-achieving output distribution (caod), are found by solving C = max I (X ; Y ) . PX (11) 2) a DMC with input constraint (c, P ) is a generalization of the previous construction with an additional restriction on the input space Xn : ( ) n X Xn = xn ∈ X n : c(xj ) ≤ nP (12) j=1 1 Additionally, one should also specify which probability of error criterion, (4) or (6), is used. 2 For general channels, the sequence {P Y∗ n } is required to satisfy a quasi-caod property, see [8, Section IV]. 3 To unify notation we denote the input space as Xn (instead of the more natural X n ) even in the absence of cost constraints. December 21, 2012 DRAFT 6 In this case the capacity C and the caod P Y∗ are found from a restricted maximization C = max I (X ; Y ) , PX (13) where PX is any distribution such that E[c(X )] ≤ P . (14) 3) the AW GN (P ) channel has an input space4 n o √ Xn = x ∈ Rn : kxk2 ≤ nP (15) the output space Y n = Rn and the transition kernel PY n |X n =x = N (x, In ) , (16) where N (x, Σ) denotes a (multidimensional) normal distribution with mean x and covariance matrix Σ and In – is the n × n identity matrix. Then5 1 log(1 + P ) 2 = N (0, 1 + P ) . C = PY∗ (17) (18) As shown in [9], [10] in all three cases P Y∗ is unique and P Y∗ n satisfies the key property: D(PY n |X n =x ||P ∗Y n ) ≤ nC , (19) for all x ∈ Xn . Property (19) implies that for every input distribution PX n the induced output distribution P Yn satisfies P ∗Y n PY n ≪ PY n |X n =xn ≪ PY n , (20) ∀xn ∈ Xn , (21) ∗ D(PY n ||P ∗Y n ) ≤ nC − I (X n ; Y n ) . (22) As a consequence of (21) the information density is well defined: △ ı∗X n ;Y n (x n ; y n ) = log dPY n |X n =xn n (y ) . dPY∗n (23) 4 For convenience we denote the elements of Rn as x, y (for non-random vectors) and X n , Y n (for the random vectors). 5 As usual, all logarithms log and exponents exp are taken to arbitrary fixed base, which also specifies the information units. December 21, 2012 DRAFT 7 Moreover, for every channel considered here there is a constant a1 > 0 such that6 sup Var ı∗X n ;Y n (X n ; Y n ) X n = xn ≤ na1 . xn ∈ X n (24) In all three cases, the ǫ-capacity Cǫ equals C for all 0 < ǫ < 1, i.e. log M ∗ (n, ǫ) = nC + o(n) , n → ∞. (25) In fact, see [3]:7 log M ∗ (n, ǫ) = nC − for any 0 < ǫ < 1 2 √ nV Q− 1 (ǫ) + O(log n) , n → ∞, (27) and a certain constant V ≥ 0, the channel dispersion. In this paper we consider the following increasing degrees of optimality for sequences of (n, Mn , ǫ) codes: 1) A code sequence is called ǫ-capacity-achieving, or o(n)-achieving, if 1 log Mn → C . n (28) √ 2) A code sequence is called O( n)-achieving if √ log Mn = nC + O( n) . (29) √ 3) A code sequence is called capacity-dispersion achieving, or o( n)-achieving, if log Mn = nC − √ √ nV Q− 1 (ǫ) + o( n) . (30) 4) A code sequence is called O(log n)-achieving if log Mn = nC − √ nV Q− 1 (ǫ) + O(log n) . (31) We also need to introduce the performance of an optimal binary hypothesis test, which is one of the main tools in [3]. Consider an A-valued random variable B which can take probability measures P or Q. A randomized test between those two distributions is defined by a random 6 For discrete channels (24) is shown, e.g., in [3, Appendix E]. 7 Q− 1 is the inverse of the standard Q-function: Z ∞ Q(x) = x December 21, 2012 e− y √ dy . 2π 2 (26) DRAFT 8 transformation PZ|B : A 7→ {0, 1} where 0 indicates that the test chooses Q. The best performance achievable among those randomized tests is given by8 X βα (P, Q) = min Q(a)PZ|B (1|a) , (32) a∈ A where the minimum is over all probability distributions PZ|B satisfying X PZ|B : P (a)PZ|B (1|a) ≥ α . (33) a∈ A The minimum in (32) is guaranteed to be achieved by the Neyman-Pearson lemma. Thus, βα (P, Q) gives the minimum probability of error under hypothesis Q if the probability of error under hypothesis P is no larger than 1 − α. III. U PPER BOUND ON THE OUTPUT RELATIVE ENTROPY The main goal of this section is to establish (for each of the three types of channels introduced) D(PY n ||P ∗Y n ) ≤ nC − log Mn + o(n) , (34) where PY n is the sequence of output distributions induced by a sequence of (n, Mn , ǫ) codes. √ In fact we show that o(n) = O( n) for all channels except DMCs with zeros in the transition matrix PY |X . We start by showing an inequality tying relative entropy of the output distribution to the size of the code (Section III-A), then we prove (34) for the case of the DMC (Sections III-B and III-C) and the AWGN (Section III-D). A. Augustin’s strong converse The following result is a generalization of the strong converse of Wolfowitz [11] to the case of variable threshold. It first appeared as part of the proofs in [12, Satz 7.3 and 8.2] by Augustin and formally stated in [13, Section 2]. 8 We sometimes write summations over alphabets for simplicity of exposition. For arbitrary measurable spaces βα (P, Q) is defined by replacing the summation in (32) by an expectation. December 21, 2012 DRAFT 9 Theorem 1 ([12], [13]): Consider a random transformation PY |X , a distribution PX induced by an (M, ǫ)max,det code, an auxiliary output distribution QY and a function ρ : X → R. Then, provided the denominator is positive, M ≤ exp{E[ρ(X )]} h i , dPY |X =x inf x PY |X=x log dQ (Y ) ≤ ρ(x) − Y ǫ (35) with the infimum taken over the support of PX . Proof: Fix a code, QY , and ρ. Denoting by ci the i-th codeword, we have QY [Ŵ (Y ) = i] ≥ β1− ǫ (PY |X=ci , QY ) , i = 1, . . . , M , (36) since Ŵ (Y ) = i is a suboptimal test to decide betwee PY |X=ci and QY , which achieves error probability no larger than ǫ when PY |X=ci is true. Therefore, 1 M M ≥ ≥ ≥ ≥ 1 X β1− ǫ (PY |X=ci , QY ) M i=1 MX 1 M PY |X=ci log i=1 inf PY |X=x log x inf PY |X=x log x (37) dPY |X=ci (Y ) ≤ ρ(c ) i − ǫ exp{− ρ(ci )} dQY dPY |X=x dQY dPY |X=x dQY (Y ) ≤ ρ(x) − ǫ (38) M 1 X exp{− ρ(c )} i M i=1 (Y ) ≤ ρ(x) − ǫ exp{− E[ρ(X )]} , (39) (40) where (37) is by taking the arithmetic average of (36) over i, (40) is by Jensen’s inequality, and (38) is by the standard estimate of βα , e.g. [3, (102)], β1− ǫ (P, Q) ≥ P log dP (Z ) ≤ ρ − ǫ exp{− ρ} , dQ (41) with Z distributed according to P . Remark 1: Unlike many other converse bounds, see [14, Section 2.7.3], we were unable to reduce the proof of Theorem 1 to a binary hypothesis testing argument. Remark 2: Following the idea of Poor and Verdú [15] we may further strengthen Theorem 1 in the special case of QY = PY : The maximal probability of error ǫ for any test of M hypotheses {Pj , j = 1, . . . , M } satisfies: ǫ≥ December 21, 2012 1− exp{ρ̄} M inf P [ıW ;Y (W ; Y ) ≤ ρj |W = j] , 1≤ j≤ M (42) DRAFT 10 10 where ρj ∈ R are arbitrary, ρ̄ = 1 M PM j=1 ρj , W is equiprobable on {1, . . . , M } and PY |W =j = Pj . Indeed, since ıW ;Y (a; b) ≤ log M we get from [14, Lemma 35] exp{ρj }β1− ǫ (PY |W =j , PY ) + 1 − exp{ρj } P [ıW ;Y (W ; Y ) > ρj |W = j] ≥ 1 − ǫ . (43) M Multiplying by exp{− ρj } and using resulting bound in place of (41) we repeat steps (37)-(40) to obtain 1 M inf P [ıW ;Y (W ; Y ) ≤ ρj |W = j] ≥ inf P [ıW ;Y (W ; Y ) ≤ ρ j |W = j] − ǫ 1≤ j≤ M 1≤ j≤ M exp{− ρ̄} , (44) which in turn is equivalent to (42). Choosing ρ(x) = D(PY |X=x ||QY ) + ∆ we can specialize Theorem 1 in the following convenient form: Theorem 2: Consider a random transformation PY |X , a distribution PX induced by an (M, ǫ)max,det code and an auxiliary output distribution QY . Assume that for all x ∈ X we have △ d(x) = D(PY |X=x ||Q Y ) <∞ (45) and sup PY |X=x log dPY |X=x dQY x ′ (Y ) ≥ d(x) + ∆ ≤ δ , (46) for some pair of constants ∆ ≥ 0 and 0 ≤ δ ′ < 1 − ǫ. Then, we have D(PY |X ||QY |PX ) ≥ log M − ∆ + log(1 − ǫ − δ ′ ) . (47) Remark 3: Note that (46) holding with a small δ ′ is a natural non-asymptotic embodiment of information stability of the underlying channel, cf. [8, Section IV]. One way to estimate the upper deviations in (46) is by using Chebyshev’s inequality. As an example, we obtain Corollary 3: If in the conditions of Theorem 2 we replace (46) with sup Var log dPY |X=x x dQY for some constant Sm ≥ 0, then we have D(PY − December 21, 2012 |X ||QY |PX ) ≥ log M (Y ) X = x ≤ Sm r 2Sm 1− ǫ + log . 1− ǫ 2 (48) (49) DRAFT 11 11 Notice that when QY is chosen to be a product distribution, such as P ∗Y n , log dPY |X =x dQY becomes a sum of independent random variables. In particular, (24) leads to an equivalent restatement of (1): Theorem 4: Consider a memoryless channel belonging to one of the three classes in Section II. Then for any 0 < ǫ < 1 and any sequence of (n, Mn , ǫ)max,det capacity-achieving codes we have 1 1 I (X n ; Y n ) → C ⇐⇒ D(PY n ||PY∗n ) → 0 , n n (50) where X n is the output of the encoder. Proof: The direction ⇒ is trivial from property (22) of P ∗Y n . For the direction ⇐ we only need to lower-bound I (X n ; Y n ) since I (X n ; Y n ) ≤ nC . To that end, we have from (24) and Corollary 3: D(PY n |X n √ ||PY∗ n |PX n ) ≥ log Mn + O( n) . (51) Then the conclusion follows from (28) and the following identity applied with QY n = P ∗Y n : I (X n ; Y n ) = D(PY n |X n ||QY n |PX n ) − D(PYn ||QY n ) , (52) which holds for all QY n such that the unconditional relative entropy is finite. We remark that Theorem 4 can also be derived from a simple extension of the Wolfowitz converse [11] to an arbitrary output distribution QY n , e.g. [14, Theorem 10], and then choosing QY n = P Y∗ n . Note that Theorem 4 implies the result in [2] since the capacity-achieving codes with vanishing error probability are a subclass of those considered in Theorem 4. B. DMC with C1 < ∞ For a given DMC denote the parameter introduced by Burnashev [16] C1 = max D(PY a,a |X=a ||PY |X=a′ ). (53) ′ Note that C1 < ∞ if and only if the transition matrix does not contain any zeros. In this section we show (34) for a (regular) class of DMCs with C1 < ∞ by an application of the main inequality (47). We also demonstrate that (1) may not hold for codes with non-deterministic encoders or unconstrained maximal probability of error. December 21, 2012 DRAFT 12 12 Theorem 5: Consider a DMC PY |X with capacity C > 0 (with or without an input constraint). Then for any 0 < ǫ < 1 there exists a constant a = a(ǫ) > 0 such that any (n, Mn , ǫ)max,det code satisfies √ D(PY n ||P ∗Y n ) ≤ nC − log M + a n , (54) where PY n is the output distribution induced by the code. In particular, for any capacity-achieving sequence of such codes we have 1 D(PY n ||P ∗Y n ) → 0 , n Remark 4: If P ∗ Y (55) is equiprobable on Y (such as for some symmetric channels), (55) is equivalent to H (Y n ) = n log |Y| + o(n) . (56) In any case, as we will see in Section IV-A, (55) always implies H (Y n ) = nH (Y ∗ ) + o(n) (57) (by (133) applied to f (y) = log P Y∗ (y)). Note also that traditional combinatorial methods, e.g. [17], are not helpful in dealing with quantities like H (Y n ), D(PY n ||P ∗Y n ) or P Y n -expectations of functions which are not of the form of cumulative average. Proof: Fix yn , ȳ n ∈ Y n which differ in the j-th letter only. Then, denoting y\j = {yk , k = j} we have | log PY n (y n ) − log PY n (ȳ n )| = log PY PY log ≤ max ′ a,b,b (yj |yj\ ) (ȳ |yj\ ) \j j jY| \j jY| PY |X (b|a) PY |X (58) (59) (b′ |a) △ = a1 < ∞ , (60) where (59) follows from PYj |Y\j (b|y\j ) = December 21, 2012 X a∈ X PY |X (b|a)PXj |Y\j (a|y\j ) . (61) DRAFT 13 13 Thus, the function yn 7→ log PY n (y n ) is a1 -Lipschitz in Hamming metric on Y n . Its discrete gradient (see definition of D(f ) in [18, Section 4]) is bounded by n|a1 |2 and thus by the discrete Poincaré inequality [18, Theorem 4.1f] we have Var [log PY n (Y n )|X n = xn ] ≤ n|a1 |2 . (62) Therefore, for some 0 < a2 < ∞ and all xn ∈ Xn we have Var [iX n ;Y n (X n ; Y n )|X n = xn ] = Var log ≤ ≤ PY n |X n (Y n |X n ) n X = xn n n PY (Y ) (63) 2 Var log PY n |X n (Y n |X n ) X n = xn + 2 Var [log PY n (Y n )|X n = xn ] (64) 2na2 + 2n|a1 |2 , (65) where (65) follows from (62) and the fact that the random variable in the first variance in (64) is a sum of n independent terms. Applying Corollary 3 with Sm = 2na2 + 2n|a1 |2 and QY = PY n we obtain: D(PY n |X n √ ||PY n |PX n ) ≥ log Mn + O( n) . (66) We can now complete the proof: D(PY n ||PY∗n ) = D(PY n |X n ||PY∗n |PX n ) − D(PY ≤ nC − D(PY ||PY n |PX n ) √ ≤ nC − log Mn + O( n) n |X n n |X n ||PY n |PX n ) (67) (68) (69) where (68) is because P Y∗ n satisfies (19) and (69) follows from (66). This completes the proof of (54). Remark 5: (55) need not hold if the maximal probability of error is replaced with the average or if the encoder is allowed to be random. Indeed, for any 0 < ǫ < 1 we construct a sequence of (n, Mn , ǫ)avg capacity-achieving codes which do not satisfy (55). Consider a sequence of (n, M ′ , ǫ′ )max,det codes with → 0 and ǫ′ n n n December 21, 2012 1 DRAFT n log M ′ n 14 14 →C. (70) December 21, 2012 DRAFT 15 15 For all n such that ǫ′n < 2 this code cannot have repeated codewords and we can additionally 1 assume (perhaps by reducing M by one) that there is no codeword equal to (x0 , . . . , x0 ) ∈ Xn , n ′ where x0 is some fixed letter in X such that D(PY ∗ |X=x0 ||P Y )>0 (71) (the existence of such x0 relies on the assumption C > 0). Denote the output distribution induced by this code by P Y′ n . Next, extend this code by adding ′ ǫ− ǫ 1− ǫ M identical codewords: (x , . . . , x ) ∈ X . Then the n 0 0 n minimal average probability of error achievable with the extended codebook of size △ Mn = 1 − ǫn ′ M 1− ǫ n (72) is easily seen to be not larger than ǫ. Denote the output distribution induced by the extended code by PY n and define a binary random variable S = 1{X n = (x0 , . . . , x0 )} with distribution (73) ′ PS (1) = 1 − PS (0) = ǫ − ǫn (74) 1 − ǫ′ We have then . n D(PY n ||PY∗ n ) = D(PY ≥ D(PY ∗ n |S ||PY∗ n |PS ) − I (S; Y n ) (75) ||PY n |PS ) − log 2 (76) n |S ∗ = ′ nD(PY |X=x0 ||P )PS (1) + D(P Y Yn ||P ∗ Yn )PS (0) − log 2 = nD(PY |X=x0 ||PY∗ )PS (1) + o(n) , (77) (78) where (75) is by (52), (76) follows since S is binary, (77) is by noticing that PY n |S=0 = P ′Y n , and (78) is by (55). It is clear that (71) and (78) show the impossibility of (55) for this code. Similarly, one shows that (55) cannot hold if the assumption of the deterministic encoder is dropped. Indeed, then we can again take the very same (n, M ′ , ǫ′ ) code and make its December 21, 2012 DRAFT 16 16 encoder randomized so that with probability n ǫ− ǫ 1− ǫ n it outputs (x 0 , . . . , x 0) ∈ Xn and otherwise it outputs the original codeword. The same analysis shows that (78) holds again and thus (55) fails. December 21, 2012 DRAFT 17 17 Note that the counterexamples constructed above also demonstrate that in Theorem 2 (and hence Theorem 1) the assumptions of maximal probability of error and deterministic encoders are not superfluous, contrary to what is claimed by Ahlswede [13, Remark 1]. C. DMC with C1 = ∞ Next, we show the estimate for D(PY n ||P Y∗ n ) differing by a log 23 n factor from (34) for the DMCs with C1 = ∞. Theorem 6: For any DMC PY |X with capacity C > 0 (with or without input constraints) and 0 < ǫ < 1 there exists a constant b > 0 with the property that for any sequence of (n, Mn , ǫ)max,det codes we have for all n ≥ 1 √ D(PY n ||P ∗Y n ) ≤ nC − log Mn + b n log3 n . (79) 2 In particular, for any such sequence achieving capacity we have 1 D(PY n ||P ∗Y n ) → 0 . n (80) Remark 6: Claim (80) is not true if either the maximal probability of error is replaced with the average, or if we allow the encoder to be stochastic. Counterexamples are constructed exactly as in Remark 5. Proof: Let ci and Di , i = 1, . . . Mn denote the codewords and the decoding regions of the code. Denote the sequence ℓn = b1 p n log n (81) with b1 > 0 to further constrained shortly. According to the isoperimetric inequality for Hamming space [17, Corollary I.5.3], there is a constants a > 0 such that for every i = 1, . . . , Mn ℓn 1 − PY n |X n =ci [Γℓn Di ] ≤ Q Q− 1 (ǫ) + √ a (82) n 2 ≤ exp − b2ℓ n = n− b2 1 , ≤ n n (83) (84) (85) where December 21, 2012 DRAFT ℓ n n n Γ D = {b ∈ Y : ∃y ∈ D s.t. |{j : yj = bj }| ≤ ℓ} December 21, 2012 18 18 (86) DRAFT 19 19 denotes the ℓ-th Hamming neighborhood of a set D and we assumed that b1 was chosen large enough so there is b2 ≥ 1 satisfying (85). Let ′ M n = Mn nℓn n|Y|ℓn (87) and consider a subcode F = (F1 , . . . , FM′n ), i.e. an ordered list of Mn indices from {1, . . . , Mn }. ′ Then for every possible choice of F we denote by PX n (F ) and PY n (F ) the input/output distribution induced by a given subcode, so that for example: Mn ′ PY n (F ) 1 X PY n |X n =Fj . = ′ M n j=1 We aim to apply the random coding argument over all equally likely M Mn′ (88) n choices of a subcode F . Random coding among subcodes was originally invoked in [6] to demonstrate the existence of a good subcode. One easily verifies that the expected induced output distribution is △ E[PYn (F )] = 1 Mn ′ Mn Mn X F1 =1 ··· Mn X FM′n PY n (F ) (89) =1 = PY n . (90) Next, for every F we denote by ǫ′ (F ) the minimal possible average probability of error achieved by an appropriately chosen decoder. With this notation we have, for every possible value of F : D(PY n (F = D(PY ∗ ) ||P Y n n |X n ) ||PY∗ n |PX n (F ) ) − I (Xn (F ); Y n (F )) ≤ nC − I (X n (F ); Y n (F )) (91) (92) ≤ nC − (1 − ǫ′ (F )) log nM ′ + log 2 (93) ≤ nC − log Mn ′ + nǫ′ (F ) log |X | + log 2 (94) √ 3 ≤ nC − log Mn + nǫ′ (F ) log |X | + b3 n log 2 n (95) December 21, 2012 DRAFT where (91) is by (52), (92) is by (19), (93) is by Fano’s inequality, (94) is because log M ′n ≤ 20 20 n log |X | and (95) holds for some b3 > 0 by the choice of M n in (87) and by ′ log December 21, 2012 n ≤ ℓ log n . n ℓn (96) DRAFT 21 21 Taking the expectation of both sides of (95), applying convexity of relative entropy and (90) we get D(PY n ||PY∗ n ) ≤ nC − log Mn + n E[ǫ′ (F )] log |X | + b3 √ 3 n log 2 n . (97) Accordingly, it remains to show that n E[ǫ′ (F )] ≤ 2 . (98) To that end, for every subcode F define the suboptimal randomized decoder: ∀F j ∈ L(y, F ) Ŵ (y) = F j (with probability 1 ) |L(y,F )| , (99) where L(y, F ) is a list of those indices i ∈ F for which y ∈ Γℓn Di . Let FW be equiprobable on F , then averaged over the selection of F we have n n n E[L(Y , F ) | FW ∈ L(Y , F )] ≤ 1 + because each y ∈ Y n can belong to at most n ℓn ℓn |Y|ℓn Mn (M n′ − 1) , (100) |Y|ℓn enlarged decoding regions Γℓn Di and since each Fj is chosen independently and equiprobably among all possible Mn alternatives. The average probability of error for the given decoder when also averaged over F can be upperbounded as E[ǫ′ (F )] ≤ P[FW 6∈ L(Y n , F )] |L(Y n , F )| − 1 +E 1{ FW ∈ L(Y n , F )} n |L(Y , F )| ≤ P[FW 6∈ L(Y n , F )] n ℓn + 1 + n 2 , ≤ n ≤ n ℓn ′ (102) Mn |Y|ℓn M n′ Mn where (102) is by Jensen’s inequality applied to December 21, 2012 |Y|ℓn Mn (101) (103) (104) x− 1 x and (100), (103) is by (85), and (104) is DRAFT ′ by (87). Since (104) also serves as an upper bound to E[ǫ (F )] the proof of (98) is complete. December 21, 2012 22 22 DRAFT 23 23 D. Gaussian channel Theorem 7: For any 0 < ǫ < 1 and P > 0 there exists a = a(ǫ, P ) > 0 such that the output distribution PY n of any (n, Mn , ǫ)max,det code for the AW GN (P ) channel satisfies √ D(PY n ||P ∗Y n ) ≤ nC − log M + a n , (105) where P Y∗ n = N (0, 1 + P )n . In particular for any capacity-achieving sequence of such codes we have 1 D(PY n ||PY∗n ) → 0 . n (106) Remark 7: (106) need not hold if the maximal probability of error is replaced with the average or if the encoder is allowed to be stochastic. Counterexamples are constructed similarly to those for Remark 5 with x0 = 0. Note also that Theorem 7 need not hold if the power-constraint is in the average-over-the-codebook sense; see [14, Section 4.3.3]. Proof: Denote by lower-case pY n |X n =x and pY n densities of PY n |X n =x and PY n . Then an elementary computation shows ∇ log pY n (y) = log e ∇pY n (y) pY n (y) (107) M = log e X 1 − 12 ||y− cj ||2 n ∇e pY n (y) j=1 M (2π) 2 log e M 1 1 (108) 2 (109) = pY n (y) n X M (2π) n2 j=1 n (cj − y)e− = (E[X |Y = y] − y) log e . 2 ||y− cj || (110) For convenience denote X̂ n = E[X n |Y n ] and notice that since kX n k ≤ √ nP we have also ˆ √ n X ≤ December 21, 2012 (111) nP . (112) DRAFT 24 24 Then 1 E[k∇ log p n (Y n )k2 | X n ] Y log2 e = E Y n − X̂ n ≤ 2 E kY n k ≤ 2 E kY n k 2 2 Xn Xn + 2E (113) ˆn X 2 Xn X n + 2nP (114) (115) 2 = 2 E kX n + Z n k2 X n + 2nP (116) ≤ 4kX n k2 + 4n + 2nP (117) ≤ (6P + 4)n , (118) where (114) is by ka + bk2 ≤ 2kak2 + 2kbk2 , (119) (115) is by (112), in (116) we introduced Z n ∼ N (0, In ) which is independent of X n , (117) is by (119) and (118) is by the power-constraint imposed on the codebook. Conditioned on X n , the random vector Y n is Gaussian. Thus, from Poincaré’s inequality for the Gaussian measure, e.g. [19, (2.16)], we have Var[log pY n (Y n ) | X n] ≤ E[k∇ log pY n k2 | X n ] (120) and together with (118) this yields the required estimate Var[log pY n (Y n ) | X n] ≤ a1 n (121) for some a1 > 0. The argument then proceeds step by step as in the proof of Theorem 5 with (121) taking the place of (62) and recalling that property (19) holds for the AWGN channel too. IV. D ISCUSS ION We have shown that there is a constant a = a(ǫ) independent of n and M such that √ D(PY n ||P ∗Y n ) ≤ nC − log M + a n , December 21, 2012 DRAFT (122) December 21, 2012 25 25 DRAFT 20 20 where PY n is the output distribution induced by an arbitrary (n, M, ǫ)max,det code. Therefore, there any (n, M, ǫ)max,det necessarily satisfies √ log M ≤ nC + a(ǫ) n (123) as is classically known [20]. In particular, (122) implies that any ǫ-capacity-achieving code must satisfy (1). In this section we discuss this and other implications of this result, such as: 1) (122) implies that the empirical marginal output distribution n 1X P̄n = PY n j=1 i △ converges to P ∗ Y (124) in a strong sense (Section IV-A); 2) (122) guarantees estimates of the precision in the approximation (3) (Sections IV-B and IV-E), 3) (122) provides estimates for the deviations of f (Y n ) from its average (Sections IV-C). 4) relation to optimal transportation (Section IV-D), 5) implications of (1) for the empirical input distribution of the code (Sections IV-G and IV-H). A. Empirical distributions and empirical averages Considering the empirical marginal distributions, the convexity of relative entropy and (1) result in D(P̄n ||P ∗ ) ≤ ∗ Y 1 D(PY n ||P ) → 0 , n Y (125) n where P̄n is the empirical marginal output distribution (124). More generally, we have [2, (41)] D(P̄ n(k) ||PY∗k ) ≤ (k) where P̄n k D(PY n ||PY∗n ) → 0 , n− k + 1 (126) is a k-th order empirical output distribution P̄n(k) = n− k+1 X 1 P j+k− 1 . n − k + 1 j=1 Yj (127) Knowing that a sequence of distributions Pn converges in relative entropy to a distribution P , i.e. D(Pn ||P ) → 0 (128) implies convergence properties for the expectations of functions: December 21, 2012 DRAFT 21 21 1) The following inequality (often called Pinsker’s) is established in [21]: ||Pn − P ||T2V ≤ 1 D(Pn ||P ) , 2 log e (129) where △ kP − QkT V = sup |P (A) − Q(A)| (130) A From (129) we have that (128) implies Z Z f dPn → f dP (131) for all bounded functions f . 2) In fact, (131) also holds for unbounded f as long as it satisfies Cramer’s condition under P , i.e. Z etf dP < ∞ (132) for all t in some neighborhood of 0; see [22, Lemma 3.1]. Together (131) and (125) show that for a wide class of functions f : Y → R empirical averages over distributions induced by good codes converge to the average over the capacity achieving output distribution (caod): " # Z n 1X ∗ E f (Yj ) → f dP Y . n j=1 (133) From (126) a similar conclusion holds for k-th order empirical averages. B. Averages of functions of Y n To go beyond empirical averages, we need to provide some definitions. Definition 1: The function F : Y n → R is called (b, c)-concentrated with respect to measure µ on Y n if for all t ∈ R Z exp{t(F (Y n ) − F̄ )}dµ ≤ b exp{ct2 } , Z F dµ . F̄ = (134) A function F is called (b, c)-concentrated for the channel if it is (b, c)-concentrated for every PY n |X n =x and P Y∗ n . A couple of simple properties of (b, c)-concentrated functions: 1) Gaussian concentration around the mean: 2 P[|F (Y n ) − E[F (Y n )]| > t] ≤ b exp − t . 4c December 21, 2012 (135) DRAFT 22 22 2) Small variance: Z n ∞ Var[F (Y )] = 0 Z ≤ ∞ P[ |F (Y n ) − E[F (Y n )]|2 > t]dt min b exp − 0 (136) t , 1 dt 4c (137) = 4c log(2be) . (138) Examples of concentrated functions (see [19] for a survey): • A bounded function F with kF k∞ ≤ A is (exp{A2 1 }, c)-concentrated for any c and (4c)− any measure µ. Moreover, for a fixed µ and a sufficiently large c any bounded function is (1, c)-concentrated. • • If F is (b, c)-concentrated then λF is (b, λ2 c)-concentrated. Let f : Y → R be (1, c)-concentrated with respect to µ. Then so is n 1 X F (yn ) = √ f (yj ) n j=1 (139) with respect to µn . In particular, any F defined in this way from a bounded f is (1, c)concentrated for a memoryless channel (for a sufficiently large c independent of n). • If µ = N (0, 1)n and F is a Lipschitz function on Rn with Lipschitz constant kF kLip then F is (1, 2 kF kLip )-concentrated 2 log e with respect to µ, e.g. [23, Proposition 2.1]: ( ) Z 2 F k k Lip exp{t(F (yn ) − F̄ )}dµ(y n) ≤ exp t2 . 2 log e n R Therefore any Lipschitz function is (1, • (1+P ) kF k2Lip )-concentrated 2 log e (140) for the AWGN channel. For discrete Y n we endow it with the Hamming distance d(yn , z n ) = |{i : yi = zi }| (141) and define Lipschitz functions in the usual way. In this case, a simpler criterion is: F : Y n → R is Lipschitz with constant ℓ if and only if |F (y1 , . . . , yj , . . . , yn ) − F (y1 , . . . , b, . . . , yn ) | ℓ . max (142) y n ,b,j ≤ Let µ be any product probability measure P1 × . . . × Pn on Y n , then the standard AzumaHoeffding estimate shows that ( ) n kF k2Lip 2 exp{t(F (y ) − F̄ )}µ(y ) ≤ exp t 2 log e yn ∈ Y n X December 21, 2012 n n (143) DRAFT 23 23 and thus any Lipschitz function F is (1, measure on Y n . 2 nkF kLip )-concentrated 2 log e with respect to any product Note that unlike the Gaussian case, the constant of concentration c worsens linearly with dimension n. Generally, this growth cannot be avoided as shown by the coefficient 1 √ in the exact solution of the Hamming isoperimetric problem [24]. At the same time, thisn growth P does not mean (143) is “weaker” than (140); for example, F = nj=1 φ(yj ) has Lipschitz √ constant O( n) in Euclidean space and O(1) in Hamming. Surprisingly, however, for convex functions the concentration (140) holds for product measures even under Euclidean distance [25]. We now show how to approximate expectations of concentrated functions: Proposition 8: Suppose that F : Y n → R is (b, c)-concentrated with respect to P ∗Y n . Then q n ∗n (144) | E[F (Y )] − E[F (Y )]| ≤ 2 cD(PY n ||P Y n ) + c log b , ∗ where E[F (Y ∗ ∗n Z )] = F (y n )dPY n . (145) Proof: Recall the Donsker-Varadhan inequality [26, Lemma 2.1]: For any probability meaR sures P and Q with D(P ||Q) < ∞ and a measurable function g such that exp{g}dQ < ∞ R we have that gdP exists (but perhaps is − ∞) and moreover Z Z (146) gdP − log exp{g}dQ ≤ D(P ||Q) . Since by (134) the moment generating function of F exists under P ∗Y n , applying (146) to tF we get t E[F (Y n )] − log E[exp{tF (Y ∗ n )}] ≤ D(PY n ||P Y n) . (147) ∗ From (134) we have ct2 − t E[F (Y n )] + t E[F (Y ∗ n )] + D(PY n ||PY n ) + log b ≥ 0 (148) ∗ for all t. Thus the discriminant of the parabola in (148) must be non-positive which is preDecember 21, 2012 DRAFT 24 24 cisely (144). Note that for empirical averages F (yn ) = 1 Pn j=1 f (yi ) n we may either apply the estimate for concentration in the example (139) and then use Proposition 8, or directly apply Proposition 8 December 21, 2012 DRAFT 25 25 to (125) – the result is the same: n 1X E[f (Y )] − E[f (Y ∗ )] ≤ 2 r n j n c ) → 0, D(PY n ||P ∗ Yn (149) j=1 for any f which is (1, c)-concentrated with respect to P Y∗ . For the Gaussian channel, Proposition 8 and (140) yield: Corollary 9: For any 0 < ǫ < 1 there exist two constants a1 , a2 > 0 such that for any (n, M, ǫ)max,det code for the AW GN (P ) channel and for any Lipschitz function F : Rn → R we have | E[F (Y n )] − E[F (Y ∗ n )]| ≤ a1 kF Lip q k where C = 1 2 nC − log Mn + a2 √ n, (150) log(1 + P ) is the capacity. Note that in the proof of Corollary 9 concentration of measure is used twice: once for PY n |X n in the form of Poincaré inequality (proof of Theorem 7) and once in the form of (134) (proof of Proposition 8). C. Concentration of functions of Y n Surprisingly not only can we estimate expectations of F (Y n ) by replacing the unwieldy PY n with the simple P Y∗ n , but in fact the distribution of F (Y n ) exhibits a sharp peak at its expectation: Proposition 10: Consider a channel for which (122) holds. Then for any F which is (b, c)concentrated for such channel, we have for every (n, M, ǫ)max,det code: P[|F (Y n ) − E[F (Y ∗ n )]| > t] ≤ 3b exp and, 2 √ nC − log M + a n − 16c t (151) √ Var[F (Y n )] ≤ 16c nC − log M + a n + log(6be) . (152) Proof: Denote for convenience: December 21, 2012 △ = E[F (Y ∗ n )] , (153) φ(xn ) = E[F (Y n )|X n = xn ] . (154) F̄ DRAFT Then as a consequence of F being (b, c)-concentrated for PY n |X n =xn we have 26 26 2 P[|F (Y n ) − φ(xn )| > t|X n = xn ] ≤ b exp − t . 4c December 21, 2012 (155) DRAFT 27 27 Consider now a subcode C1 consisting of all codewords such that φ(xn ) > F̄ + t for t > 0. The number M1 = |C1 | of codewords in this subcode is M1 = M P[φ(X n ) > F̄ + t] . (156) Let QY n be the output distribution induced by C1 . We have the following chain: 1 X φ(xn ) F̄ + t ≤ M1 x∈ C1 Z = F (Y n )dQY n q ≤ F̄ + 2 D(QY n ||P ) + c log b) ∗ c Yn q √ ≤ F̄ + 2 c (nC − log M1 + a n) + c log b (157) (158) (159) (160) where (157) is by the definition of C1 , (158) is by (154), (159) is by Proposition 8 and the assumption of (b, c)-concentration of F under P Y∗ n , and (160) is by (122). Together (156) and (160) imply: √ t 2 . (161) P[φ(X n ) > F̄ + t] ≤ b exp nC − log M + a n 4c − Applying the same argument to − F we obtain a similar bound on P[|φ(X n ) − F̄ | > t and thus P[|F (Y n ) − F̄ | > t] ≤ P[|F (Y n ) − φ(X n )| > t/2] + P[|φ(X n ) − F̄ | > t/2] (162) 2 ≤ b exp − t 16c ≤ 3b exp − 1 + 2 exp{nC − log M + √ a n} √ t2 + nC − log M + a 16c , (163) (164) n where (163) is by (155) and (161) and (164) is by (123). Thus (151) is proven. Moreover, (152) follows by (138). D. Relation to optimal transportation Since the seminal work of Marton [7], [27] one of the tools for proving (b, c)-concentration of Lipschitz functions is the optimal transportation theory. Namely, Marton demonstrated that if a probability measure µ on a metric space satisfies a T1 inequality December 21, 2012 DRAFT p W1 (ν, µ) ≤ c′ D(ν||µ) ∀ν 28 28 (165) December 21, 2012 DRAFT 29 29 then any Lipschitz f is (b, kf kLipc)-concentrated with respect to µ for some b = b(c, c′ ) and any 2 0<c< c′ 4 . In (165) W 1 (ν, µ) denotes the linear-cost transportation distance, or Wasserstein-1 distance, defined as △ W1 (ν, µ) = inf E[d(Y, Y ′ )] , PY Y (166) ′ where d(·, ·) is the distance on the underlying metric space and the infimum is taken over all couplings PY Y ′ Theorem with fixed marginals PY = µ, PY ′ = ν. Note that according to [28, 1] we have kν − µkT V = W1 (ν, µ) when the underlying distance on Y is d(y, y ′ ) = 1{y = y ′ }, sometimes called an L0 -distance. In this section we show that (165) in fact directly implies the estimate of Proposition 8 without invoking either Marton’s argument or Donsker-Varadhan inequality. Indeed, assume that F : Y n → R is a Lipschitz function and observe that for any coupling PY n ,Y ∗ n we have | E[F (Y n )] − E[F (Y ∗ n )]| ≤ kF k E[d(Y n , Y ∗ n )] , Lip (167) where the distance d is either Hamming or Euclidean depending on the nature of Y n . Now taking the infimum in the right-hand side of (167) with respect to all couplings we observe | E[F (Y n )] − E[F (Y ∗ n )]| ≤ kF Lip k W1 (PY n , PY n ∗ ) (168) and therefore by the transportation inequality (165) we get q ∗ n ∗n | E[F (Y )] − E[F (Y )]| ≤ c′ kFLip D(PY n ||PY n ) (169) k2 which is precisely what Proposition 8 yields for (1, c′ k k2Lip )-concentrated F 4 functions. Our argument can be turned around and used to prove linear-cost transportation T1 inequalities (165). Indeed, by the Kantorovich-Rubinstein duality [29, Chapter 1] we have sup | E[F (Y n )] − E[F (Y ∗ n ]| = W1 (PY n , PY n ) , (170) ∗ F where the supremum is over all F with kF kLip ≤ 1. Thus the argument in the proof of Proposition 8 shows that (165) must hold for any µ for which every 1-Lipschitz F is (1, c′ )-concentrated, demonstrating an equivalence between T1 transportation and Gaussian-like concentration – a December 21, 2012 DRAFT 30 30 result reported in [30, Theorem 3.1]. We also mention that unlike general iid measures, an iid Gaussian µ = N (0, 1)n satisfies a much stronger T2 -transportation inequality [31] p W2 (ν, µ) ≤ c′ D(ν||µ) ∀ν ≪ µ , (171) December 21, 2012 DRAFT 31 31 where remarkably c′ does not depend on n and the Wasserstein-2 distance W2 is defined as p △ W2 (ν, µ) = inf E[d2 (Y, Y ′ )] , PY Y ′ (172) the infimum being over all couplings as in (166). E. Empirical averages of non-Lipschitz functions One drawback of proving Proposition 8 relying on the transportation inequality (165) is that it does not show anything for non-Lipschitz functions. In this section we demonstrate how the proof of Proposition 8 can be extended to functions that do not satisfy the strong concentration assumptions. Proposition 11: Let f : Y → R be a (single-letter) function such that for some θ > 0 we have m1 = E[exp{θf (Y ∗ )}] < ∞ , (173) (one-sided Cramer condition) and m2 = E[f 2 (Y ∗ )] < ∞ . Then there exists b = b(m1 , m2 , θ) > 0 such that for all n ≥ n 1X n E[f (Y )] ≤ E[f (Y ∗ )] + j j=1 (174) 16 θ4 we have D(PYn ||PY∗ n ) + √ b n 3 n4 , provided that P Y∗ n = (P Y∗ )n . Remark 8: Substituting the estimate (122) and assuming log M = nC + O( that both Proposition 8 and (175) give the same order n− 1 4 (175) √ n) we can see for the deviation of the empirical average from E[f (Y ∗ )]. Note also that Proposition 11 applies to functions which only have onesided exponential moments; this will be useful in Theorem 23. Finally, the remainder term is √ D(PY n ||PY∗ n ) at the expense of n but can be improved to order-optimal for D(PY n ||P Y∗ n ) ∼ n q adding technical conditions on n, θ and m1 . Proof: It is clear that if the moment-generating function t 7→ E[exp{tf (Y ∗ )}] exists for t = θ > 0 then it also exists for all 0 ≤ t ≤ θ. Notice that since December 21, 2012 DRAFT −2 x exp{− x } ≤ e 2 2 log e , ∀x ≥ 0 32 32 (176) December 21, 2012 DRAFT 33 33 we have for all 0 ≤ t ≤ 2θ : −2 log e E[f 2(Y ∗ ) exp{tf (Y ∗ )}] ≤ E[f 2(Y ∗ )1{f < 0}] + E [exp{θf (Y ∗ )}1{f ≥ 0}(]177) e (θ − t)2 e− 2 m1 log e ≤ m2 + (178) (θ − t)2 −2 ≤ m2 + 4e m1 log e (179) θ2 △ = b(m1 , m2 , θ) · 2 log e . (180) Then a simple estimate log E[exp{tf (Y ∗ )}] ≤ t E[f (Y ∗ )] + bt2 , 0≤ t≤ 2 (181) θ , can be obtained by taking the logarithm of the identity Zt Zs t 1 ∗ ∗ E[exp{tf (Y )} ] = 1 + ds E[f 2(Y ∗ ) exp{tf (Y ∗ )}]du E[f (Y )] + 2 log e 0 log e 0 and applying upper bounds (180) and log x ≤ (x − 1) log e. P Next, we define F (yn ) = 1n nj=1 f (yi ) and consider the chain: t E[F (Y n )] ≤ log E[exp{tF (Y ∗ n )}] + D(PY n ||PY n ) (182) (183) ∗ Y t ∗ = n log E[exp{ f (Y ∗ )}] + D(PY n ||P ) n n ≤ t E[f (Y ∗ )] + bt2 n (184) ∗ + D(PY n ||PY n ) , (185) where (183) is by (147), (184) is because P Y∗ n = (P Y∗ )n and (185) is by (181) assuming nt ≤ 2θ . 3 The proof concludes by taking t = n 4 in (185). A natural extension of Proposition 11 to functions such as n− r+1 X 1 j+r− 1 F (y ) = f (yj ) n − r + 1 j=1 n (186) is made by replacing the step (184) with an estimate log E[exp{tF (Y ∗ )}] ≤ n − r+1 log r exp tr f (Y ∗ r ) n , (187) E which in turn is shown by splitting the sum into r subsums with independent terms and then December 21, 2012 DRAFT 34 34 applying Holder’s inequality: E[X1 · · · Xr ] ≤ (E[|X1 |r ] · · · E[|X1 |r ]) r 1 (188) (for n not a multiple of r a small remainder term needs to be added to (187)). December 21, 2012 DRAFT 35 35 F. Functions of degraded channel outputs Notice that if the same code is used over a channel QY |X which is stochastically degraded with respect to PY |X then by the data-processing for relative entropy, the upper bound (122) holds for D(QY n ||Q∗Y n ), where QY n is the output of the QY |X channel and Q∗Y n is the output of QY |X when the input is distributed according to a capacity-achieving distribution of PY |X . Thus, in all the discussions the pair (PY n , P Y∗ n ) can be replaced with (QY n , Q∗Y n ) without any change in arguments or constants. This observation can be useful in questions of information theoretic security, where the wiretapper has access to a degraded copy of the channel output. G. Input distribution: DMC As shown in Section IV-A we have for every ǫ-capacity-achieving code: n 1X PY → P Y∗ . P̄n = n j=1 j (189) As noted in [2], convergence of output distributions can be propagated to statements about the input distributions. For example, this is obvious for the case of a DMC with a non-singular (more generally, injective) matrix PY |X . For other DMCs, the following argument extends that of [2, Theorem 4]. By Theorem 4 and 5 we know that 1 I (X n ; Y n ) → C . n (190) By concavity of mutual information, we must necessarily have I (X̄ ; Ȳ ) → C , where PX̄ = 1 n Pn j=1 (191) PXj . By compactness of the simplex of input distributions and continuity of the mutual information on that simplex the distance to the (compact) set of capacity achieving distributions Π must vanish: d(PX̄ , Π) → 0 . (192) ∗ is unique ,then (192) shows the convergence of P ¯ → If the capacity achieving distribution P X X PX∗ in the (strong) sense of total variation. December 21, 2012 DRAFT 30 30 H. Input distribution: AWGN In the case of the AWGN, just like in the discrete case, (50) implies that for any capacity achieving sequence of codes we have n PX̄(n) = 1 XP → P △ = N (0, P ) , Xj X n j=1 (193) ∗ however, in the sense of weak convergence of distributions only. Indeed, first, a collection of all distributions on R with second moment not exceeding P is weakly compact by Prokhorov’s may be assumed criterion. Thus, the sequence of empirical input marginal distributions P (n) X̄ to weakly converge to some distribution QX . On the other hand, for the sequence of induced empirical output distributions P Ȳ(n) from (50) we get D(P Ȳ(n) ||PY∗ ) → 0 . (194) Thus by the weak lower-semicontinuity of relative entropy [32] we have D(QY ||P ∗Y ) = 0, where QY = QX ∗ N (0, 1) is the output distribution induced by QX over the AWGN channel PY |X and ∗ is convolution. Since the map QX 7→ QX ∗ N (0, 1) is injective (e.g. by computing characteristic ∗ and thus (193) holds. functions), we must also have QX = P X We now discuss whether (193) can be claimed in a stronger topology than the weak one. is purely atomic and P is purely diffuse, we have Since PX̄ ∗ X ||PX̄ − PX∗ ||T V = 1 , (195) and convergence in total variation (let alone in relative entropy) cannot hold. On the other hand, it is quite clear (and also follows from a quantitative estimate in (267) P below) that the second moment of 1n PX j necessarily converges to that of N (0, P ). Together weak convergence and control of second moments imply [29, (12), p.7] ! n X 1 W22 PX , P ∗ → 0 . n j=1 j X (196) Therefore (193) holds in the sense of topology metrized by the W2 -distance. Note that convexity properties of W 22 (·, ·) imply ! nX n 1X 2 21 ∗ W PX , P ∗ W PX , P ≤ 2 j X n j=1 j X n j=1 2 1 December 21, 2012 2 (197) ∗ DRAFT ≤ December 21, 2012 31 n W2 (PX n , PX n ) , 31 (198) DRAFT 32 32 where we denoted △ ∗ n ∗ = N (0, P In ) . PX n = (P X ) (199) Comparing (196) and (198), it is natural to conjecture a stronger result: For any capacityachieving sequence of codes 1 (200) W (P n , P ∗X n ) → 0 . n 2 X Another reason to conjecture (200) arises from considering the behavior of Wasserstein dis√ tance under convolutions. Indeed from the T2 -transportation inequality (171) and the relative entropy bound (122) we have 1 2 W (PX n ∗ N (0, In ), P∗X n ∗ N (0, In )) → 0 , n 2 (201) since by definition PY n = PX n ∗ N (0, In ) Yn (202) P=∗ PX n ∗ N (0, In ) , (203) ∗ where ∗ denotes convolution of distributions on Rn . Trivially, for any P, Q and N – probability measures on Rn it is true that (e.g. [29, Proposition 7.17]) W2 (P ∗ N , Q ∗ N ) ≤ W2 (P, Q) . (204) Thus, overall we have 1 0← √ ∗ n 1 W2 (PX n ∗ N (0, In ), XPn ∗ N (0, In )) ≤ √ ∗ n W2 (PX n ,XPn ) , (205) and (200) simply conjectures that (conversely to (204)) the convolution with Gaussian kernel is unable to significantly decrease W2 . Despite the foregoing intuitive considerations, the conjecture (200) is false. Indeed, define D∗ (M, n) to be the minimum achievable average square distortion among all vector quantizers of the memoryless Gaussian source N (0, P ) for blocklength n and cardinality M . In other words, D∗ (M, n) = 1 inf W22 (P X∗ n , Q) , n Q (206) where the infimum is over all probability measures Q supported on M equiprobable atoms in December 21, 2012 DRAFT n R . Standard rate-distortion (converse) lower bound dictates 1 1 P log M ≥ log ∗ n D (M, 2 n) December 21, 2012 33 33 (207) DRAFT 34 34 and hence W22 (PX n , PX∗ n ) ≥ nD ∗(n, M ) ≥ nP exp − (208) 2 log M n , (209) which shows that for any sequence of codes with log Mn = O(n), the normalized transportation distance stays strictly bounded away from zero: 1 lim inf √ W2 (PX n , P X∗ n ) > 0 . n→∞ n V. B INARY HYPOTHESIS TESTING PY n (210) VS . P Y∗ n We now turn to the question of distinguishing PY n from P Y∗ n in the sense of binary hypothesis testing. First, a simple data-processing reasoning yields d(α||βα(PY n , P Y∗ n )) ≤ D(PY n ||P ) , ∗ (211) Yn where we have denoted the binary relative entropy △ d(x||y) = x log 1− x x + (1 − x) log 1 − y . y (212) From (122) and (211) we conclude: Every (n, M, ǫ)max,det code must satisfy βα (PY n , P Y∗n ≥ ) M 2 1 α C exp{− n α − √ a n } α (213) for all 0 < α ≤ 1. Therefore, in particular we see that the hypothesis testing problem for discriminating PY n from P Y∗ n has zero Stein’s exponent (and thus a zero Chernoff exponent), provided the sequence of (n, Mn , ǫ)max,det codes, defining PY n , is capacity achieving. The following result shows that (213) is not tight: Theorem 12: Consider one of the three types of channels introduced in Section II. Then every (n, M, ǫ)avg code must satisfy βα (PY n , P Y∗n ) ≥ M exp{− nC − a2 √ n} ǫ ≤ α ≤ 1, (214) where a2 = a2 (ǫ, a1 ) > 0 depends only on ǫ and the constant a1 from (24). To prove Theorem 12 we need the following strengthening of the meta-converse results [3, December 21, 2012 DRAFT Theorems 27 and 31]: December 21, 2012 35 35 DRAFT 36 36 Theorem 13: Consider an (M, ǫ)avg code for an arbitrary random transformation PY |X . Let PX be the distribution of the encoder output if all codewords are equally likely and PY be the induced output distribution. Then for any QY and ǫ ≤ α ≤ 1 we have βα (PY , QY ) ≥ M βα− ǫ (PXY , PX QY ) . (215) If the code is (M, ǫ)max,det then additionally we have for every δ > 0: δ βα (PY , QY ) ≥ M inf βα− ǫ− δ (PY |X=x , QY ) ǫ + δ ≤ α ≤ 1, 1 − α + δ x∈ C (216) where the infimum is taken over the codebook C. Proof: On the probability space corresponding to a given (M, ǫ)avg code, define the following random variable Z = 1{Ŵ (Y ) = W, Y ∈ E} , (217) where E is an arbitrary subset satisfying PY [E] ≥ α . (218) Precisely as in the original meta-converse [3, Theorem 26] the main idea is to use Z as a suboptimal hypothesis test for discriminating PX Y against PX QY . Following the same reasoning as in [3, Theorem 27] one notices that (PX QY )[Z = 1] ≤ QY [E] M (219) and PX Y [Z = 1] ≥ α − ǫ . (220) Therefore, by definition of βα we must have βα− ǫ (PXY , PX QY ) ≤ QY [E] . M (221) To complete the proof of (215) we just need to take the infimum in (221) over all E satisfying (218). To prove (216), we again consider any set E satisfying (218). Let ci , i = 1, . . . , M be the codewords of the (M, ǫ)max,det code. Define pi = PY |X=ci [E] December 21, 2012 (222) DRAFT qi = QY [Ŵ = i, E] , December 21, 2012 37 37 i = 1, . . . , M. (223) DRAFT 38 38 Since the sets {Ŵ = i} are disjoint, the (arithmetic) average of qi is lower-bounded by E[qW ] ≤ QY [E] , M (224) whereas because of (218) we have E[pW ] ≥ α . (225) Thus, the following lower bound holds: E QY [E] pW − δqW M ≥ QY [E] (α − δ) M (226) implying that there must exist i ∈ {1, . . . , M } such that QY [E] − δq QY [E] i≥ (α − δ) . pi M M (227) PY |X=ci [E] ≥ (228) For such i we clearly have α− δ QY [E] 1 − α − δ QY [Ŵ = i, E] ≤ . δ M (229) By the maximal probability of error constraint we deduce PY |X=ci [E, Ŵ = i] ≥ α − ǫ − δ (230) and thus by the definition of βα : βα− ǫ− δ (PY |X=ci , QY ) ≤ QY [E] 1 − α − δ . δ (231) M Taking the infimum in (231) over all E satisfying (218) completes the proof of (216). Proof of Theorem 12: To show (214) we first notice that as a consequence of (19), (24) and [3, Lemma 59] (see also [14, (2.71)]) we have for any xn ∈ Xn : ( ) r 2a n α ∗ 1 βα (PY n |X n =xn , P Y n ) . exp − nC α 2 ≥ − (232) From [14, Lemma 32] and the fact that the function of α in the right-hand side of (232) is convex we obtain that for any PX n βα (PX n Y n , PX n P Y∗ n ) December 21, 2012 α 2 exp ≥ ( − nC − DRAFT r 39 39 ) 2a1 n α . (233) Finally, (233) and (215) imply (214). December 21, 2012 DRAFT 40 40 VI. AEP FOR THE OUTPUT PROCESS Yn We say that a sequence of distributions PY n on Y n (with Y a countable set) satisfies the asymptotic equipartition property (AEP) if 1 1 log − H (Y n ) → 0 n n PY n (Y ) (234) in probability. Although the sequence of output distributions induced by codes are far from being (a finite chunk of) a stationary ergodic process, we will show that (234) is satisfied for ǫ-capacityachieving codes (and other codes). Thus, in particular, if the channel outputs are to be losslessly compressed and stored for later decoding 1nH (Y n ) bits per sample would suffice (cf. (57)). In √ fact, the log P Y n1(Y n ) concentrates up to n around the entropy H (Y n ). Such questions are also interesting in other contexts and for other types of distributions, see [33]. Theorem 14: Consider a DMC PY |X with C1 < ∞ (with or without input constraints) and a capacity achieving sequence of (n, Mn , ǫ)max,det codes. Then the output AEP (234) holds. Proof: In the proof of Theorem 5 it was shown that log PY n (y n ) is Lipschitz with Lipschitz constant upper bounded by a1 , see (60). Thus by (143) and Proposition 10 we find that for any capacity-achieving sequence of codes Var[log PY n (Y n )] = o(n2 ) , n → ∞. (235) Thus (234) holds by Chebyshev inequality. Surprisingly, for many practically interesting DMCs (such as those with additive noise in a finite group), the estimate (235) can be improved to O(n) even without assuming the code to be capacity-achieving. Theorem 15: Consider a DMC PY |X with C1 < ∞ (with or without input constraints) and such that H (Y |X = x) is constant on X . Then for any sequence of (n, Mn , ǫ)max,det codes there exists a constant a = a(ǫ) such that for all n sufficiently large Var log 1 ≤ an . PY n (Y n ) (236) In particular, the output AEP (234) holds. Proof: First, let X be a random variable and A some event (think P[Ac ] ≪ 1) such that |X − E[X ]| ≤ L December 21, 2012 (237) DRAFT 41 41 if X 6∈ A. Then Var[X ] = E[(X − E[X ])2 1A ] + E[(X − E[X ])2 1Ac ] ≤ E[(X − E[X ])2 1A ] + P[Ac ]L2 P[Ac ] P[A] = P[A] Var[X |A] + ≤ (238) 2 (239) ! (E[X ] − E[X |A ]) P[A ]L c P[Ac ] 2 Var[X |A] + L , P[A] c (240) 2 (241) where (239) is by (237), (240) is after rewriting E[(X − E[X ])2 |A] in terms of Var[X |A] and applying the obvious identity c c E[X |A] = E[X ] − P[A ] E[X |A ] P[A] (242) and (241) is because (237) implies | E[X |Ac] − E[X ]| ≤ L. Next, fix n and for any codeword xn ∈ Xn denote for brevity d(xn ) = D(PY n |X n =xn v(xn ) = E log ||PY n ) (243) 1 n) X n = xn PY n (Y n = d(x ) + H (Y n |X n = xn ) . (244) (245) If we could show that for some a1 > 0 Var[d(X n )] ≤ a1 n (246) the proof would be completed as follows: Var log 1 PY n (Y n) = Var log 1 n) PY n (Y ≤ a2 n + Var[v(X n )] X n + Var[v(X n )] (247) (248) = a2 n + Var[d(X n )] (249) ≤ (a1 + a2 )n , (250) where (248) follows for an appropriate constant a2 > 0 from (62), (249) is by (245) and H (Y n |X n = xn ) does not depend on xn by assumption9, and (250) is by (246). 9 This argument also shows how to construct a counterexample when H (Y |X = x) is non-constant: merge two constant composition subcodes of types P1 and P2 such that H (W |P1 ) = H (W |P2 ) where W = PY |X is the channel matrix. In this case one clearly has Var[log PY n (y n )] ≥ Var[v(X n )] = const · n2 . December 21, 2012 DRAFT 42 42 To show (246), first, notice that n PX n |Y n (y |xn ) iX n ;Y n (x ; y ) = log ≤ log Mn . PX n (xn ) n n (251) Second, as shown in (65) one may take Sm = a3 n in Corollary 3. In turn, this implies that one q ′ 3 n and δ = in Theorem 2, that is: can take ∆ = 2a 1− ǫ 2 1− ǫ inf Plog PY n |X n =xn (Y n ) < d(x) + ∆ X n = xn ≥ 1+ǫ . 2 (252) PY n Then applying Theorem 1 with ρ(xn ) = d(xn ) + ∆ to the subcode consisting of all codewords xn with {d(X n ) ≤ log Mn − 2∆} we get P[d(X n ) ≤ log Mn − 2∆] since ≤ 2 exp{− ∆} , 1− ǫ (253) E[ρ(X n )|d(X n )] ≤ log Mn − 2∆] ≤ log Mn − ∆ . (254) Now, we apply (241) to d(X n ) with L = log Mn and A = {d(X n ) > log Mn − 2∆}. Since Var[X |A] ≤ ∆2 this yields 2 log2 Mn exp{− ∆} 1− ǫ √ for all n such that 1−2 ǫ exp{− ∆} ≤ 21 . Since ∆ = O( n) and log Mn = O(n) we Var[d(X n )] ≤ ∆2 + (255) conclude from (255) that there must be a constant a1 such that (246) holds. We also have a similar extension for the AWGN channel, except that this time output-AEP does not have a natural interpretation in terms of fundamental limits of compression. Theorem 16: Consider the AW GN channel. Then for any sequence of (n, Mn , ǫ)max,det codes there exists a constant a = a(ǫ) such that for all n sufficiently large Var log 1 ≤ an , pY n (Y n ) (256) where pY n is the density of PY n with respect to Lebesgue measure Leb on Rn . In particular, the AEP holds for Y n . Proof: The argument follows that of Theorem 15 step by step with (121) used in place of (62). Corollary 17: If in the setting of Theorem 16, the codes are the spherical codes (i.e., energies of all codewords X n are equal) or, more generally, December 21, 2012 DRAFT Var[||X n ||2 ] = o(n2 ), December 21, 2012 43 43 (257) DRAFT 44 44 then dP n 1 log Y∗ (Y n ) − D(PY dPY n n n ||PY∗n ) → 0 (258) in PY n -probability. Yn Proof: To apply Chebyshev’s inequality to log dP (Y n ) we need in addition to (256) show dP n ∗ Y Var[log p∗Y n (Y n )] = o(n2 ) , (259) ||y n ||2 where p∗Y n (y n ) = (2π(1 + P ))− 2 e− 2(1+P ) . Introducing i.i.d. Zj ∼ N (0, 1) we have # " n 2 X log e n 2 Var[log (Y n )] = X Z + ||Z || . Var ||X n ||2 + 2 ∗ p n j j Y 4(1 + P )2 j=1 n (260) Variances of second and third terms are clearly O(n), while the variance of the first term is o(n2 ) by assumption (257). Then (260) implies (259) via Var[A + B + C ] ≤ 3(Var[A] + Var[B] + Var[C ]) . VII. E XPECTATIO NS OF NON - LINEAR POLYNOMIALS OF G AUSSIAN (261) CODES This section contains results special to the AWGN channel. Because of the algebraic structure available on Rn it is natural to ask whether we can provide approximations for polynomials. Since Theorem 7 shows the validity of (122), all the results for Lipschitz (in particular linear) functions from Section IV follow. Polynomials of higher degree, however, do not admit bounded Lipschitz constants. In this section we discuss the case of quadratic polynomials (Section VII-A) and polynomials of higher degree (Section VII-B). We present results directly in terms of the polynomials in (X1 , . . . , Xn ) on the input space. This is strictly stronger than considering polynomials on the output space, since E[q(Y n )] = E[q(X n + Z n )] and thus by taking integrating over distribution of Z n problem reduces to computing the expectation of a (different) polynomial of X n . The reverse reduction is not possible, clearly. A. Quadratic forms We denote the canonical inner product on Rn as n X (a, b) = aj bj , (262) j=1 December 21, 2012 DRAFT 45 45 and write the quadratic form corresponding to matrix A as n X n X (Ax, x) = ai,j xi xj . (263) j=1 i=1 n n Note that when X ∼ N (0, P ) we have trivially E[(AX n , X n )] = P tr A , (264) where tr is the trace operator. Therefore, the next result shows that the distribution of good codes must be close to isotropic Gaussian distribution, at least in the sense of evaluating quadratic forms: Theorem 18: For any P > 0 and 0 < ǫ < 1 there exists a constant b = b(P, ǫ) > 0 such that for all (n, M, ǫ)max,det codes and all quadratic forms A such that − In ≤ A ≤ In (265) we have √ 2(1 + P ) n | E[(AX , X )] − P tr A| ≤ q √ log e n n √ C − log M + b n (266) n and (a refinement for A = In ) n X √ 2(1 + P ) | E[X 2] − nP | ≤ (nC − log M + b n) . (267) j log e j=1 Remark 9: By using the same method as in the proof of Proposition 10 one can also show that the estimate (266) holds on a per-codeword basis for an overwhelming majority of codewords. Proof: Denote Σ = E[xxT ] , (268) V = (In + Σ)− 1 , (269) QY n = N (0, In + Σ) , (270) R(y|x) = log dPY n |X n =x (y) , log e ln det(In + Σ) + (Vy, y) − ||y − x||2 , 2 = E[R(Y n |x)|X n = x] , = d(x) dQY n (271) (272) (273) log e December 21, 2012 DRAFT = (ln det(In + Σ) + (Vx, x) + tr(V − In )) 2 v(x) = Var[R(Y n |x)|X n = x] . December 21, 2012 46 46 (274) (275) DRAFT 40 40 Denote also the spectrum of Σ by {λi , i = 1, . . . , n} and its eigenvectors by {vi , i = 1, . . . , n}. We have then |E[(AX n , X n )] − P tr A| = |tr(Σ − P In )A| n X = ≤ (λi − P )(Avi , vi ) i=1 n X |λi − P | , (276) (277) (278) i=1 where (277) follows by computing the trace in the eigenbasis of Σ and (278) is by (265). From (274), it is straightforward to check that D(PY n |X n ||QY n |PX n ) = E[d(X n )] 1 = log det(In + Σ) 2 n 1X = log(1 + λj ) . 2 j=1 (279) (280) (281) By using (261) we estimate v(x) ≤ 3 log2 e 1 1 Var[||Z n ||2 ] + Var[(V Z n , Z n )] + Var[(V x, Z n )] 4 4 (282) 9 △ + 3P log2 e = nb12 (283) 4 where (283) results from applying the following identities and bounds for Z n ∼ N (0, In ): ≤ n Var[||Z n ||2 ] = 3n , (284) Var[(a, Z n )] = ||a||2 , (285) Var[(VZ n , Z n )] = 3 tr V2 ≤ 3n (286) ||Vx||2 ≤ ||x||2 ≤ nP . (287) Finally from Corollary 3 applied with Sm = b21n and (281) we have n √ 1X log(1 + λ ) ≥ log M − b1 n − log 2 j 1− ǫ 2 j=1 √ ≥ log M − b n (288) (289) n December 21, 2012 DRAFT = December 21, 2012 2 (log(1 + P ) − δn ) , 41 41 (290) DRAFT 42 42 where we abbreviated s 2 9 4 + 3P 2 log e + log 1− ǫ 1− ǫ √ = 2(nC + b n − log M ) . b = δn (291) (292) To derive (267) consider the chain: − δn ≤ n 1X 1 + λi log n j=1 1+P n 1 X 1 + λi n j=1 1 + P (293) ! ≤ log ≤ log e X(λ − P ) i n(1 + P ) (294) n (295) j=1 log e (E[||X n ||2 ] − nP ) (296) = n(1 + P ) where (293) is (290), (294) is by Jensen’s inequality, (295) is by log x ≤ (x − 1) log e. Note that (267) is equivalent to (296). Finally, (266) follows from (278), (293) and the next Lemma applied with X equiprobable on 1+λi 1+P , i = 1, . . . , n . Lemma 19: Let X > 0 and E[X ] ≤ 1, then s E[|X − 1|] ≤ 1 2 E ln X (297) Proof: Define two distributions on R+ : △ P [E] = P[X ∈ E] (298) △ Q[E] = E[X · 1{X ∈ E}] + (1 − E[X ])1{0 ∈ E} . (299) Then, we have 2||P − Q||T V = 1 − E[X ] + E[|X − 1|] D(P ||Q) = E log 1 X . (300) (301) and (297) follows by (129). The proof of Theorem 18 relied on a direct application of the main inequality (in the form of Corollary 3) and is independent of the previous estimate (122). At the expense of a more December 21, 2012 DRAFT 43 43 technical proof we could derive an order-optimal form of Theorem 18 starting from (122) using concentration properties of Lipschitz functions. Indeed, notice that because E[Z n ] = 0 we have E[(AY n , Y n )] = E[(AX n , X n )] + tr A . (302) Thus, (266) follows from (122) if we can show | E[(AY n , Y n )] − E[(AY ∗ n , Y ∗ n )]| ≤ b q nD(PY n ||PY∗ n ) . (303) This is precisely what Corollary 9 would imply if the function y 7→ (Ay, y) were √ Lipschitz with constant O( n). However, (Ay, y) is generally not Lipschitz when considered on the entire of Rn . On the other hand, it is clear that from the point of√view of evaluation of both the E[(AY n , Y n )] and E[(AY ∗ n , Y ∗ n )] only vectors of norm O( n) are important, and when √ restricted to the ball S = {y : kyk2 ≤ b n} quadratic form (Ay, y) does have a required Lipschitz constant of O( √ n). This approximation idea can be made precise using Kirzbraun’s theorem (see [34] for a short proof) to extend √ (Ay, y) beyond the ball S preserving the maximum absolute value and the Lipschitz constant O( n). Another method of showing (303) is by using the Bobkov-Götze extension of Gaussian concentration (140) to non-Lipschitz functions [30, Theorem 1.2] to estimate the moment generating function of (AY ∗ n , Y ∗ n ) and apply (147) with q t = n1 D(PY n ||PY∗ n ). Both methods yield (303), and hence (266), but with less sharp constants than those in Theorem 18. B. Behavior of ||x||q The next natural question is to consider polynomials of higher degree. The simplest example P of such polynomials are F (x) = nj=1 xqj for some power q, to analysis of which we proceed now. To formalize the problem, consider 1 ≤ q ≤ ∞ and define the q-th norm of the input vector in the usual way △ ||x||q = n X |xi |q ! 1q . (304) i=1 The aim of this section is to investigate the values of ||x||q for the codewords of good codes for the AWGN channel. Notice that when the coordinates of x are independent Gaussians we expect to have December 21, 2012 n DRAFT X |xi |q ≈ n E[|Z |q ] , 44 44 (305) i=1 December 21, 2012 DRAFT 45 45 where Z ∼ N (0, 1). In fact it can be shown that there exists a sequence of capacity achieving codes and constants Bq , 1 ≤ q ≤ ∞ such that every codeword x at every blocklength n satisfies10 : 1 1 ||x||q ≤ Bq n q = O(n q ) and ||x||∞ ≤ B∞ p 1 ≤ q < ∞, p log n = O( log n) . (306) (307) But do (306)-(307) hold (possibly with different constants) for any good code? It turns out that the answer depends on the range of q and on the degree of optimality of the code. Our findings are summarized in Table I. The precise meaning of each entry will be clear from Theorems 20, 23 and their corollaries. The main observation is that the closer the code size comes to M ∗ (n, ǫ), the better ℓq -norms reflect those of random Gaussian codewords (306)(307). Loosely speaking, very little can be said about ℓq -norms of capacity-achieving codes, while O(log n)-achieving codes are almost indistinguishable from the random Gaussian ones. In particular, we see that, for example, for capacity-achieving codes it is not possible to approximate expectations of polynomials of degrees higher than 2 (or 4 for dispersion-achieving codes) by assuming Gaussian inputs, since even the asymptotic growth rate with n can be dramatically different. The question of whether we can approximate expectations of arbitrary polynomials for O(log n)-achieving codes remains open. We proceed to support the statements made in Table I. 1 q− 4 In fact, each estimate in Table I, except n q log 2q n, is tight in the following sense: if the √ entry is nα , then there exists a constant Bq and a sequence of O(log n)-, dispersion-, O( n)-, or capacity-achieving (n, Mn , ǫ)max,det codes such that each codeword x ∈ Rn satisfies for all n≥ 1 kxkq ≥ Bq nα . (308) If the entry in the table states o(nα ) then there is Bq such that for any sequence τn → 0 there √ exists a sequence of O(log n)-, dispersion-, O( n)-, or capacity-achieving (n, Mn , ǫ)max,det codes 10 This does not follow from a simple random coding argument since we want the property to hold for every codeword, which constitutes exponentially many constraints. However, the claim can indeed be shown by invoking the κβ-bound [3, Theorem 25] with a suitably chosen constraint set F. December 21, 2012 DRAFT 46 46 TABLE I B E H AV I O R O F ℓq NORMS kxkq O F C O D E WO R D S F RO M C O D E S F O R T H E 1≤q≤2 Code 2<q≤4 1 random Gaussian any O(log n)-achieving n any dispersion-achieving √ any O( n)-achieving n n n 1 q n 1 q n 1 q n any code nq 1 1 q n log 1 q 1 q 1 2 o(n ) 1 Note: All estimates, except n q log 1 q nq 1 n2 q−4 2q q−4 2q q=∞ √ 1 nq 1 q any capacity-achieving 4<q<∞ 1 nq AWGN C H A N N E L . n 1 4 o(n ) 1 n4 1 2 o(n ) 1 n2 √ log n log n 1 o(n 4 ) 1 n4 1 o(n 2 ) 1 n2 n, are shown to be tight. such that each codeword satisfies for all n ≥ 1 kxkq ≥ Bq τn nα . (309) First, notice that a code from any row is an example of a code for the next row, so we only need to consider each entry which is worse than the one directly above it. Thus it suffices to 1 1 1 1 show the tightness of o(n 4 ), n 4 , o(n 2 ) and n 2 . To that end recall that by [3, Theorem 54] the maximum number of codewords M ∗ (n, ǫ) at a fixed probability of error ǫ for the AWGN channel satisfies log M ∗ (n, ǫ) = nC − √ nV Q− 1 (ǫ) + O(log n) , (310) log 2 e P (P +2) 2 (P +1)2 is the channel dispersion. Next, we fix a sequence δ n → 0, such that √ nδn → ∞ and construct the following sequence of codes. The first coordinate x1 = nδn P for where V (P ) = every codeword and the rest (x2 , . . . , xn ) are chosen as coordinates of an optimal AWGN code for blocklength n − 1 and power-constraint (1 − δn )P . Following the argument of [3, Theorem 67] the number of codewords Mn in such a code will be at least p log Mn = (n − 1)C (P − δn ) − (n − 1)V (P − δn )Q− 1 (ǫ) + O(1) p = nC (P ) − nV (P )Q− 1 (ǫ) + O(nδn ) . (311) (312) At the same time, because x1 of each codeword x is abnormally high we have p kxkq ≥ nδn P . (313) December 21, 2012 DRAFT 47 47 So all the examples are constructed by choosing a suitable δn as follows: • Row 1: see (306)-(307). • Row 2: nothing to prove. • Row 3: for entries o(n 4 ) taking δn = 1 τn2 √ yields a dispersion-achieving code according n to (312); the estimate (309) follows from (313). • √ 1 Row 4: for entries n 4 taking δn = √1 yields an O( n)-achieving code according to (312); n the estimate (308) follows from (313). • 1 Row 5: for entries o(n 2 ) taking δn = τn2 yields a capacity-achieving code according to (312); the estimate (309) follows from (313). • √ 1 Row 6: for entries n 2 we can take a codebook with one codeword ( nP , 0, . . . , 0). Remark 10: The proof can be modified to show that in each case there are codes that simul1 taneously achieve all entries in the respective row of Table I (except n q log q− 4 2q n). We proceed to proving upper bounds. First, we recall some simple relations between the ℓq norms of vectors in Rn . To estimate a lower-q norm in terms of a higher one, we invoke Holder’s inequality: kxkq ≤ n q − p kxkp , (314) 1 1 ≤ q ≤ p ≤ ∞. 1 To provide estimates for q > p, notice that obviously kxk∞ ≤ kxkp . (315) Then, we can extend to q < ∞ via the following chain: 1− kxkq ≤ p p kxk∞ q kxkqp ≤ kxkp , (316) q≥ p (317) Trivially, for q = 2 the answer is given by the power constraint kxk2 ≤ √ nP (318) Thus by (314) and (317) we get: Each codeword of any code for the AW GN (P ) channel must satisfy kxkq ≤ December 21, 2012 1 1 ≤ q ≤ 2, 1 2 < q ≤ ∞. √ P nq, · n2, (319) DRAFT 48 48 This proves the entries in the first column and the last row of Table I. Before proceeding to justify the upper bounds for q > 2 we point out an obvious problem with trying to estimate kxkq for each codeword. Given any code whose codewords lie exactly on the power sphere, we can always apply an orthogonal transformation to it so that one of the √ codewords becomes ( nP , 0, 0, . . . 0). For such a codeword we have kxkq = √ nP (320) and the upper-bound (319) is tight. Therefore, to improve upon the (319) we must necessarily consider subsets of codewords of a given code. For simplicity below we show estimates for the half of all codewords. The following result, proven in the Appendix, takes care of the sup-norm: Theorem 20 (q = ∞): For any 0 < ǫ < 1 and P > 0 there exists a constant b = b(P, ǫ) such that for any11 n ≥ N (P, ǫ) and any (n, M, ǫ)max,det -code for the AW GN (P ) channel at least half of the codewords satisfy √ M , (321) nC − nV Q − 1(ǫ) + 2 log n + log b log 2 − where C and V are the capacity and the dispersion. In particular, the expression in (·) is nonkxk2 ≤ 4(1 + P ) ∞ log e negative for all codes and blocklengths. Remark 11: What sets Theorem 20 apart from other results is its sensitivity to whether the code achieves the dispersion term. This is unlike estimates of the form (122), which only sense √ whether the code is O( n)-achieving or not. From Theorem 20 the explanation of the entries in the last column of Table I becomes obvious: the more terms the code achieves in the asymptotic expansion of log M ∗ (n, ǫ) the closer kxk∞ √ becomes to the O( log n), which arises from a random Gaussian codeword (307). To be specific, we give the following exact statements: Corollary 21 (q = ∞ for O(log n)-codes): For any 0 < ǫ < 1 and P > 0 there exists a constant b = b(P, ǫ) such that for any (n, Mn , ǫ)max,det -code for the AW GN (P ) with √ nV Q− 1 (ǫ) − K log n log Mn ≥ nC − 11 (322) N (P, ǫ) = 8(1 + 2P − 1 )(Q− 1 (ǫ))2 for ǫ < 21 and N (P, ǫ) = 1 for ǫ ≥ 21 . December 21, 2012 DRAFT 49 49 for some K > 0 we have that at least half of the codewords satisfy p kxk∞ ≤ (b + K ) log n + b . (323) Corollary 22 (q = ∞ for capacity-achieving codes): For any capacity-achieving sequence of (n, Mn , ǫ)max,det -codes there exists a sequence τn → 0 such that for at least half of the codewords we have kxk∞ ≤ τn n 2 . 1 (324) Similarly, for any dispersion-achieving sequence of (n, Mn , ǫ)max,det -codes there exists a sequence τn → 0 such that for at least half of the codewords we have kxk∞ ≤ τn n 4 . 1 (325) Remark 12: By (309), the sequences τn in Corollary 22 are necessarily code-dependent. For the q = 4 we have the following estimate (see Appendix for the proof): Theorem 23 (q = 4): For any 0 < ǫ < 1 2 and P > 0 there exist constants b1 > 0 and b2 > 0, depending on P and ǫ, such that for any (n, M, ǫ)max,det -code for the AW GN (P ) channel at least half of the codewords satisfy kxk2 ≤ 4 2 nC + b2 M √ n − log b1 , (326) 2 where C is the capacity of the channel. In fact, we also have a lower bound E[kxk4 ] ≥ 3nP 2 − (nC − log M + b3 (327) √ 1 n)n 4 , 4 for some b3 = b3 (P, ǫ) > 0. Remark 13: Note that E[kzk44] = 3nP 2 for z ∼ N (0, P )n . We can now complete the proof of the results in Table I: 1) Row 2: q = 4 is Theorem 23; 2 < q ≤ 4 follows by (314) with p = 4; q = ∞ is Corollary 21; for 4 < q < ∞ we apply interpolation via (316) with p = 4. 2) Row 3: q ≤ 4 is treated as in Row 2; q = ∞ is Corollary 22; for 4 < q < ∞ apply interpolation (316) with p = 4. 3) Row 4: q ≤ 4 is treated as in Row 2; q ≥ 4 follows from (317) with p = 4. 4) Row 5: q = ∞ is Theorem 22; for 2 < q < ∞ we apply interpolation (316) with p = 2. December 21, 2012 DRAFT 50 The upshot of this section is that we cannot approximate values of non-quadratic polynomials 50 √ in x (or y) by assuming iid Gaussian entries, unless the code is O( n)-achieving, in which December 21, 2012 DRAFT 51 51 case we can go up to degree 4 but still will have to be content with one-sided (lower) bounds only, cf. (327).12 Before closing this discussion we demonstrate the sharpness of the arguments in this section by considering the following example. Suppose that a power of a codeword x from a capacitydispersion optimal code is measured by an imperfect tool, such that its reading is described by n 1X E= (xi )2Hi , n i=1 (328) where Hi ’s are i.i.d bounded random variables with expectation and variance both equal to 1. For large blocklengths n we expect E to be Gaussian with mean P and variance 1n kxk44. On the one hand, Theorem 23 shows that the variance will not explode; (327) shows that it will be at least as large as that of a Gaussian codebook. Finally, to establish the asymptotic normality rigorously, the usual approach based on checking Lyapunov condition will fail as shown by (309), but the Lindenberg condition does hold as a consequence of Theorem 22. If in addition, the code is O(log n)-achieving then − P[|E − E[E ]| > δ] ≤ e (329) VIII. E XTENSION nδ 2 √ b1 +b2 δ log n . TO OTHER CHANNELS : TILTING Let us review the scheme of investigating functions of the output F (Y n ) that was employed in this paper so far. First, an inequality (122) was shown by verifying that QY = PY n satisfies conditions of Theorem 2. Then approximation of the form F (Y n ) ≈ E[F (Y n )] ≈ E[F (Y ∗n )] (330) follows by the arguments in Section IV, cf. Propositions 8 and 10. Therefore, in some sense application of Theorem 2 with QY = PY n implies an approximation (330) simultaneously for all concentrated (e.g. Lipschitz) functions. On one hand, this is very convenient since all the channel-specific work is isolated in proving (122). On the other hand, verifying conditions of Theorem 2 for QY = PY n may not in general be feasible even for memoryless channels. In this section we show how Theorem 2 can be used to investigate (330) for a specific function F in 12 Using quite similar methods, (327) can be extended to certain bi-quadratic forms, i.e. 4-th degree polynomials December 21, 2012 P x2 2A wai jhere DRAFT 52 = (ai− j ) is a Toeplitz positive semi-definite matrix. December 21, 2012 i,j − i x52 j, DRAFT 53 53 the absence of the universal estimate (122). A different application of the same technique was previously reported in [8, Section VIII]. In this section we focus on studying an arbitrary random transformation PY |X : X → Y, for convenience we introduce Y ♭ distributed according to auxiliary distribution QY and denote analogously to (145) ♭ △ E[F (Y )] = Z F (y)dQ Y , (331) where F – is arbitrary function of Y. Suppose that F is such that ZF = log E[exp{F (Y ♭ )}] < ∞ , (332) Denote by Q(FY ) an F -tilting of QY , namely ) dQ(F Y = exp{F − ZF }dQY (333) The core idea of our technique is that if F is sufficiently regular and QY satisfies conditions of Theorem 2, then Q(F Y ) also does. Consequently, the expectation of F under PY (induced by the code) can be investigated in terms of the moment-generating function of F under QY . For brevity we only present a variance-based version (similar to Corollary 3): Theorem 24: Let QY and F be such that (332) holds and dPY |X=x (Y ) X = x < ∞ , QY x = sup Var[F (Y )|X = x] . S = sup Var log SF x (334) (335) Then there exists a constant a = a(ǫ, S) > 0 such that for any (M, ǫ)max,det code we have for all 0 ≤ t ≤ 1 p t E[F (Y )] − log E[exp{tF (Y ♭ )}] ≤ D(PY | X ||QY |PX ) − log M + a S + t2 SF (336) √ SF 2 1+ t (337) ≤ D(PY |X ||QY |PX ) − log M + a 2S S Proof: Note that since dPY |X log Y (F ) dQ = log dPY |X dQY − F (Y ) + ZF (338) we have for any 0 ≤ t ≤ 1: D(P Y |X ||QY |PX ) = D(P Y |X ||Q Y |P X ) − t E[F (Y )] + log E[exp{tF (Y ♭ )}] , (tF ) December 21, 2012 (339) DRAFT 50 50 and " Var log # dPY |X dQY(F ) X = x ≤ 2(S + t2 SF ) , (340) where we used Var[A + B] ≤ 2(Var[A] + Var[B]). Then from Corollary 3 applied with QY ) and S replaced by Q(tF and 2S + 2t2 SF , respectively, we obtain (336), while (337) follows by Y √ 1 + x ≤ 1 +2 x . To see that (337) is actually useful, notice that for example Corollary 9 can easily be recovered by taking QY = P Y∗ n , applying (19), estimating the moment-generating function via (140) and SF via Poincaré inequality: SF ≤ bkF k2 . Lip (341) The advantage of such a direct approach is that in fact a “dimension-dependent” Poincare inequality would also succeed in this case: SF ≤ bnkFk 2 , Lip (342) whereas the full “dimensionless” Poincaré inequality (341) was required in order to show (122). A PPENDIX In this appendix we prove results from Section VII-B. To prove Theorem 20 the basic intuition is that any codeword which is abnormally peaky (i.e., has a high value of kxk∞ ) is wasteful in terms of allocating its power budget. Thus a good capacity- or dispersion-achieving codebook cannot have too many of such wasteful codewords. A non-asymptotic formalization of this intuitive argument is as follows: Lemma 25: For any ǫ ≤ 1 2 and P > 0 there exists a constant b = b(P, ǫ) such that given any (n, M, ǫ)max,det code for the AW GN (P ) channel, we have for any 0 ≤ λ ≤ P :13 n o √ p −1 P[kX n k exp nC (P − λ) − nV (P − λ)Q (ǫ) + 2 log n , ∞ b ≥ λn] ≤ (343) M where C (P ) and V (P ) are the capacity and the dispersion of the AW GN (P ) channel, and X n is the output of the encoder assuming equiprobable messages. 13 For ǫ > 1 2 one must replace V (P − λ) with V (P ) in (343). This does not modify any of the arguments required to prove Theorem 20. December 21, 2012 DRAFT 51 51 Proof: Our method is to apply the meta-converse in the form of [3, Theorem 30] to the √ subcode that satisfies kX n k ≥ λn. Application of a meta-converse requires selecting a ∞ suitable auxiliary channel QY n |X n . We specify this channel now. For any x ∈ Rn let j ∗ (x) be the first index s.t. |xj | = ||x||∞, then we set QY n |X n (y n |x) = PY |X (yj ∗ |xj ∗ ) Y ∗ PY (yj ) (344) j=j ∗ (x) We will show below that for some b1 = b1 (P ) any M -code over the Q-channel (344) has average probability of error ǫ′ satisfying: 3 2 1 − ǫ′ ≤ b1 n . M dP (345) =x (Y n ) we see that it coincides with On the other hand, writing the expression for log dQYY n|X |X n =x the expression for log dPY n |X n =x dPY∗ n n n except that the term corresponding to j ∗ (x) will be missing; compare with [14, (4.29)]. Thus, one can repeat step by step the analysis in the proof of [3, Theorem 65] with the only difference that nP should be replaced by nP − kxk2 reflecting the ∞ reduction in the energy due to skipping of j ∗ . Then, we obtain for some b2 = b2 (α, P ): ! v ! u 2 2 u xk xk k ∞ k ∞ 1 +tnV P − Q− 1 (ǫ)− log n− b2 , P − log β1− ǫ (PY n |X n =x , QY n |X n =x ) ≥ n n 2 − nC (346) √ which holds simultaneously for all x with kxk ≤ nP . Two remarks are in order: first, the analysis in [3, Theorem 64] must be done replacing n with n − 1, but this difference is absorbed into b2 . Second, to see that b2 can be chosen independent of x notice that B(P ) in [3, (620)] tends to 0 with P → 0 and hence can be bounded uniformly for all P ∈ [0, Pmax ]. √ Denote the cardinality of the subcode {kxk∞ ≥ λn} by √ Mλ = M P[kxk∞ ≥ λn] . (347) Then according to [3, Theorem 30], we get inf β1− ǫ (PY n |X n =x , QY n |X n =x ) ≤ 1 − ǫ′ , x (348) where the infimum is over the codewords of Mλ -subcode. Applying both (345) and (346) we get inf − nC x P − December 21, 2012 2 kxk∞ n ! v u u + tnV ! 2 1 log n− b2 ≤ − log Mλ +log b1 +3 kxk∞ log n Q− 1 (ǫ) − 2 P − 2 n DRAFT 52 52 (349) December 21, 2012 DRAFT 53 53 and, further, since the function of ||x||∞ in left-hand side of (349) is monotone in kxk∞: p 3 1 − nC (P − λ) + nV (P − λ)Q− 1 (ǫ) − log n − ≤ − log + log b + log n . (350) M b 2 λ 1 2 2 Thus, overall log Mλ ≤ nC (P − λ) − p nV (P − λ)Q− 1 (ǫ) + 2 log n + b2 + log b1 , (351) which is equivalent to (343) with b = b1 exp{b2 }. It remains to show (345). Consider an (n, M, ǫ′ )avg,det -code for the Q-channel and let Mj , j = 1, . . . , n denote the cardinality of the set of all codewords with j ∗ (x) = j. Let ǫ′ j denote the minimum possible average probability of error of each such codebook achievable with the maximum likelihood (ML) decoder. Since 1 1 − ǫ′ ≤ nX M Mj (1 −j ǫ′ ) (352) j=1 it suffices to prove q 1− ǫ′ j≤ 2nP π +2 Mj (353) for all j. Without loss of generality assume j = 1 in which case the observations Y 2n are useless for determining the value of the true codeword. Moreover, the ML decoding regions Di , i = 1, . . . , Mj for each codeword are disjoint intervals in R1 (so that decoder outputs message estimate i whenever Y1 ∈ Di ). Note that for Mj ≤ 2 there is nothing to prove, so assume otherwise. Denote the first coordinates of the Mj codewords by xi , i = 1, . . . , Mj and assume √ nP and that D2 , . . . DMj − 1 (without loss of generality) that − nP ≤ x1 ≤ x2 ≤ · · · ≤ xMj √ ≤ December 21, 2012 DRAFT 54 54 are finite intervals. We have the following chain then 1− Mj 1 X PY |X (D i |x i) Mj i=1 (354) ≤ Mj − 1 2 1 X + PY |X (Di |xi ) Mj Mj j=2 (355) ≤ Mj − 1 2 1 X Leb(Di ) + 1 − 2Q Mj Mj 2 (356) Mj − 1 X 2 Mj − 2 1 1 − 2Q + Leb(Di ) Mj Mj 2Mj − 4 i=2 (357) ǫj′ = j=2 ≤ ≤ ≤ √ 2 Mj − 2 nP + 1 − 2Q !! Mj Mj q 2nP Mj − 2 2 π + , Mj Mj (358) (359) where in (354) PY |X=x = N (x, 1), (355) follows by upper-bounding probability of successful decoding for i = 1 and i = Mj by 1, (356) follows since clearly for a fixed value of the length Leb(Di ) the optimal location of the interval Di , maximizing the value PY |X (Di |xi ), is centered at xi , (357) is by Jensen’s inequality applied to x → 1 − 2Q(x) concave for x ≥ 0, (358) is because Mj − 1 [ i=2 Di ⊂ [− nP , √ √ nP ] (360) and Di are disjoint, and (359) is by r 1 − 2Q(x) ≤ 2 x, π x ≥ 0. (361) Thus, (359) completes the proof of (353), (345) and the theorem. Proof of Theorem 20: Notice that for any 0 ≤ λ ≤ P we have C (P − λ) ≤ C (P ) − (362) On the other hand, by concavity of December 21, 2012 λ log e . 2(1 + P ) p V (P ) and since V (0) = 0 we have for any 0 ≤ λ ≤ P p V (P ) DRAFT p − December 21, 2012 V (P − λ) ≥ p 55 V (P ) P λ. 55 (363) DRAFT 56 56 Thus, taking s = λn in Lemma 25 we get with the help of (362) and (363): n o 1 P[kxk2∞ ≥ s] ≤ exp ∆n − (b1 − b2 n− 2 )s , (364) where we denoted for convenience log e , 2(1 + P ) p V (P ) − 1 Q (ǫ) , = P p = nC (P ) − nV (P )Q− 1 (ǫ) + 2 log n − log M + log b . b1 = (365) b2 (366) ∆n (367) Note that Lemma 25 only shows validity of (364) for 0 ≤ s ≤ nP , but since for s > nP the left-hand side is zero, the statement actually holds for all s ≥ 0. Then for n ≥ N (P, ǫ) we have (b1 − b2 n− 2 ) ≥ 1 b1 2 (368) and thus further upper-bounding (364) we get P[kxk2 ≥ s] ≤ ∞ ∆n − exp b1 s 2 . (369) Finally, if the code is so large that ∆n < 0, then (369) would imply that P[kxk2 ≥ s] < 1 for ∞ all s ≥ 0, which is clearly impossible. Thus we must have ∆n ≥ 0 for any (n, M, ǫ)max,det code. The proof concludes by taking s = 2(log 2+∆n ) b1 in (369). Proof of Theorem 23: To prove (326) we will show the following statement: There exist two constants b0 and b1 such that for any (n, M1 , ǫ) code for the AW GN (P ) channel with codewords x satisfying kxk4 ≥ bn 4 1 (370) we have an upper bound on the cardinality: ( ) r √ 4 nV M1 ≤ exp nC + 2 − b1 (b − b02 n , 1− ǫ 1− ǫ ) provided b ≥ b0 (P, ǫ). From here (326) follows by first upper-bounding (b − b0 )2 ≥ b then verifying easily that the choice December 21, 2012 (371) − b02 and 2 2 2 b2 = √ b1 n DRAFT (372) 57 57 √ M n − log ) 2 2 q 4 V with b2 = b20 b1 + 2 1− takes the right-hand side of (371) below log M2 . ǫ + log 1− ǫ (nC + b December 21, 2012 DRAFT 58 58 To prove (371), denote 1 4 6 1+ǫ S =b− (373) and choose b large enough so that △ δ = S − 64 1 √ 1 + P > 0. (374) Then, on one hand we have PY n [kY n k4 ≥ Sn 4 ] = P[kX n + Z n k4 ≥ Sn 4 ] 1 1 (375) ≥ P[kX n k − kZ n k ≥ Sn 14 ] 4 4 (376) ≥ P[kZ n k ≤ n14 (S − b)] (377) 4 1+ǫ , (378) ≥ 2 where (376) is by the triangle inequality for k·k4 , (377) is by the constraint (370) and (378) is P by the Chebyshev inequality applied to kZ n k44 = nj=1 Zj4 . On the other hand, we have 1 1 1 ∗ P Y n [kY nk4 ≤ Sn 4 ] = P Y n [kYn k4 ≤ (6√ 1 + P + δ)n 4 ] ∗ 4 P ∗Y n [{kY n k4 n≤ 6 4 k 1 ∗ n ≥ PY n [{kY k4 ≥ 1 + P n 4 } + {kY 1 √ 4 ≤ δn {kY nk2 ≤ ≤ 6 1+Pn }+ √ 1 − exp{− b1 δ 2 n} , 1 4 ≥ √ (379) 1 4 1 4 ] (380) } ] (381) δn } 1 4 (382) where (379) is by the definition of δ in (374), (380) is by the triangle inequality for k·k4 which implies the inclusion {y : kyk4 ≤ a + b} ⊃ {y : kyk4 ≤ a} + {y : kyk4 ≤ b} (383) with + denoting the Minkowski sum of sets, (381) is by (317) with p = 2, q = 4; and (382) holds for some b1 = b1 (P ) > 0 by the Gaussian isoperimetric inequality [35] which is applicable since P∗ December 21, 2012 n 1 4 √ 1 1 DRAFT Y n [kY by the Chebyshev inequality applied to 1+Pn4] ≥ k4 ≤ 6 Pn j=1 59 (384) 2 59 Y 4 (note: Y n ∼ N (0, 1 + P )n under P ). As a ∗ j Yn side remark, we add that the estimate of the large-deviations of the sum of 4-th powers of iid √ Gaussians as exp{− O( n)} is order-optimal. December 21, 2012 DRAFT 60 60 Together (378) and (382) imply β 1+ǫ (PY n , P ∗ ) ≤ exp{− b1 δ 2 √ n} . Y n (385) 2 √ On the other hand, by [3, Lemma 59] we have for any x with kxk2 ≤ nP and any 0 < α < 1: ) ( r 2nV α ∗ βα (PY n |X n =x , P Y n ) , (386) exp − nC α 2 ≥ − where C and V are the capacity and the dispersion of the AW GN (P ) channel. Then, by convexity in α of the right-hand side of (386) and [14, Lemma 32] we have for any input distribution PX n : βα (PX n Y n , PX n P Y∗ n ≥ ) ( α exp − nC 2 − ) r 2nV α . (387) We complete the proof of (371) by invoking Theorem 13 in the form (215) with QY = P ∗Y n and α = 1+ǫ : 2 ) ≥ M1 β 1− ǫ (PX n Y n , PX n P ) . β 1+ǫ (PY n , P ∗ Yn 2 ∗ 2 (388) Yn Applying bounds (385) and (387) to (388) we conclude that (371) holds with 1 1 + 6 4 6 b0 = √ 4 1+P . 1+ǫ (389) Next, we proceed to the proof of (327). On one hand, we have n X E Yj4 n X = j=1 j=1 n X = E (Xj + Zj )4 (390) E[Xj4 + 6X j2 Z j2 + Z j4] (391) j=1 ≤ E[kxk44 ] + 6nP + 3n , (392) where (390) is by the definition of the AWGN channel, (391) is because X n and Z n are P 2 independent and thus odd terms vanish, (392) is by the power-constraint X j ≤ nP . On the other hand, applying Proposition 11 with f (y) = − y4 , θ = 2 and using (122) we obtain14 n X √ E Yj4 ≥ 3n(1 + P )2 − (nC − log M + b3 j=1 December 21, 2012 DRAFT 1 n)n 4 , 61 61 (393) for some b3 = b3 (P, ǫ) > 0. Comparing (393) and (392) statement (327) follows. 14 Of course, a similar Gaussian lower bound holds for any cumulative sum, in particular for any power December 21, 2012 P E[|Yj | q], q ≥ 1. DRAFT 62 62 We remark that by extending Proposition 11 to expectations like 1 n− 1 Pn− 1 j=1 2 E[Yj 2 Yj+1 ], cf. (187), we could provide a lower bound similar to (327) for more general 4-th degree polynomials in P x. For example, it is possible to treat the case of p(x) = i,j ai− j xi2 x2j , where A = (ai− j ) is a Toeplitz positive semi-definite matrix. We would proceed as in (392), computing E[p(Y n )] in two ways, with the only difference that the peeled off quadratic polynomial would require application of Theorem 18 instead of the simple power constraint. Finally, we also mention that the method (392) does not work for estimating E[kxk66] because we would need an upper bound √ E[kxk44 ] . 3nP 2 , which is not possible to obtain in the context of O( n)-achieving codes as the counterexamples (308) show. R EFERENCE S [1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423 and 623–656, Jul./Oct. 1948. [2] S. Shamai and S. Verdú, “The empirical distribution of good codes,” IEEE Trans. Inf. Theory, vol. 43, no. 3, pp. 836–846, 1997. [3] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, May 2010. [4] A. Tchamkerten, V. Chandar, and G. W. Wornell, “Communication under strong asynchronism,” IEEE Trans. Inf. Theory, vol. 55, no. 10, pp. 4508–4528, Oct. 2009. [5] R. Ahlswede, P. Gács, and J. Körner, “Bounds on conditional probabilities with applications in multi-user communication,” Probab. Th. Rel. Fields, vol. 34, no. 2, pp. 157–177, 1976. [6] R. Ahlswede and G. Dueck, “Every bad code has a good subcode: a local converse to the coding theorem,” Probab. Th. Rel. Fields, vol. 34, no. 2, pp. 179–182, 1976. [7] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. Theory, vol. 32, no. 3, pp. 445–446, 1986. [8] Y. Polyanskiy and S. Verdú, “Relative entropy at the channel output of a capacity-achieving code,” in Proc. 2011 49th Allerton Conference, Allerton Retreat Center, Monticello, IL, USA, Oct. 2011. [9] F. Topsøe, “An information theoretical identity and a problem involving capacity,” Studia Sci. Math. Hungar., vol. 2, pp. 291–292, 1967. [10] J. H. B. Kemperman, “On the Shannon capacity of an arbitrary channel,” Indagationes Math., vol. 77, no. 2, pp. 101–115, 1974. [11] J. Wolfowitz, “The coding of messages subject to chance errors,” Illinois J. Math., vol. 1, pp. 591–606, 1957. [12] U. Augustin, “Gedächtnisfreie kanäle für diskrete zeit,” Z. Wahrscheinlichkeitstheorie und Verw. Geb., vol. 6, pp. 10–61, 1966. [13] R. Ahlswede, “An elementary proof of the strong converse theorem for the multiple-access channel,” J. Comb. Inform. Syst. Sci, vol. 7, no. 3, pp. 216–230, 1982. December 21, 2012 DRAFT 63 63 [14] Y. Polyanskiy, “Channel coding: non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton Univ., Princeton, NJ, USA, 2010, available: http://people.lids.mit.edu/yp/homepage/. [15] H. V. Poor and S. Verdú, “A lower bound on the error probability in multihypothesis testing,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1992–1993, 1995. [16] M. V. Burnashev, “Data transmission over a discrete channel with feedback. random transmission time,” Prob. Peredachi Inform., vol. 12, no. 4, pp. 10–30, 1976. [17] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge University Press, 2011. [18] S. Bobkov and F. Götze, “Discrete isoperimetric and Poincaré-type inequalities,” Probab. Theory Relat. Fields, vol. 114, pp. 245–277, 1999. [19] M. Ledoux, “Concentration of measure and logarithmic Sobolev inequalities,” Seminaire de probabilites XXXIII, pp. 120– 216, 1999. [20] J. Wolfowitz, Coding Theorems of Information Theory. Englewood Cliffs, NJ: Prentice-Hall, 1962. [21] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observation,” Studia Sci. Math. Hungar., vol. 2, pp. 229–318, 1967. [22] ——, “I -divergence geometry of probability distributions and minimization problems,” Ann. Probab., vol. 3, no. 1, pp. 146–158, Feb. 1975. [23] M. Ledoux, “Isoperimetry and Gaussian analysis,” Lecture Notes in Math., vol. 1648, pp. 165–294, 1996. [24] L. Harper, “Optimal numberings and isoperimetric problems on graphs,” J. Comb. Th., vol. 1, pp. 385–394, 1966. [25] M. Talagrand, “An isoperimetric theorem on the cube and the Khintchine-Kahane inequalities,” Proc. Amer. Math. Soc., vol. 104, pp. 905–909, 1988. [26] M. Donsker and S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time. I. II.” Comm. Pure Appl. Math., vol. 28, no. 1, pp. 1–47, 1975. [27] K. Marton, “Bounding d̄-distance by information divergence: a method to prove measure concentration,” Ann. Probab., vol. 24, pp. 857–866, 1990. [28] V. Strassen, “The existence of probability measures with given marginals,” Ann. Math. Stat., vol. 36, pp. 423–439, 1965. [29] C. Villani, Topics in optimal transportation. Providence, RI:. American Mathematical Society, 2003, vol. 58. [30] S. Bobkov and F. Götze, “Exponential integrability and transportation cost related to logarithmic Sobolev inequalities,” J. Functional Analysis, vol. 163, pp. 1–28, 1999. [31] M. Talagrand, “Transportation cost for Gaussian and other product measures,” Geom. Funct. Anal., vol. 6, no. 3, pp. 587–600, 1996. [32] I. Gelfand, A. Kolmogorov, and A. Yaglom, “On the general definition of the amount of information (in russian),” Dokl. Akad. Nauk SSSR, vol. 111, no. 4, pp. 745–748, 1956. [33] S. Bobkov and M. Madiman, “Concentration of the information in data with log-concave distributions,” Ann. Probab., vol. 39, no. 4, pp. 1528–1543, 2011. [34] I. J. Schoenberg, “On a theorem of Kirzbraun and Valentine,” Am. Math. Monthly, vol. 60, no. 9, pp. 620–622, Nov. 1953. [35] V. Sudakov and B. Tsirelson, “Extremal properties of half-spaces for spherically invariant measures,” Zap. Nauch. Sem. LOMI, vol. 41, pp. 14–24, 1974. December 21, 2012 DRAFT

Empirical distribution of good channel codes with non-vanishing error probability

Related documents

Products

Support

Empirical distribution of good channel codes with non-vanishing error probability

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib