Empirical distribution of good channel codes with non-vanishing error probability

advertisement
1
Empirical distribution of good channel codes
with non-vanishing error probability
Yury Polyanskiy and Sergio Verdú
Abstract
This paper studies several properties of channel codes that approach the fundamental limits of a
given memoryless channel with a non-vanishing probability of error. The output distribution induced
by an ǫ-capacity-achieving code is shown to be close in a strong sense to the capacity achieving output
distribution (for DMC and AWGN). Relying on the concentration of measure (isoperimetry) property
enjoyed by the latter, it is shown that regular (Lipschitz) functions of channel outputs can be precisely
estimated and turn out to be essentially non-random and independent of the used code. It is also shown
that the binary hypothesis testing between the output distribution of a good code and the capacity
achieving one cannot be performed with exponential reliability. Using related methods it is shown that
quadratic forms and sums of q-th powers when evaluated at codewords of good AWGN codes approach
the values obtained from a randomly generated Gaussian codeword. The random process produced at the
output of the channel is shown to satisfy the asymptotic equipartition property (for DMC and AWGN).
Index Terms
Shannon theory, discrete memoryless channels, additive white Gaussian noise, information measures,
relative entropy, empirical output statistics, asymptotic equipartition property, concentration of measure,
optimal transportation
Y. Polyanskiy is with the Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, 02139 USA,
e-mail: yp@mit.edu. S. Verdú is with the Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544
USA. e-mail: verdu@princeton.edu.
The work was supported in part by the National Science Foundation (NSF) under Grant CCF-1016625 and by the Center for
Science of Information (CSoI), an NSF Science and Technology Center, under Grant CCF-0939370. Parts of this work were
presented at 49-th and 50th Allerton Conferences on Communication, Control, and Computing, Allerton Retreat Center, IL,
2011-2012.
December 21, 2012
DRAFT
2
I. I NTRODUCTI ON
A reliable channel code is a collection of waveforms of fixed duration distinguishable with
small probability of error when observerd through a noisy channel. Such a code is optimal
(extremal) if it possesses the maximal cardinality among all codebooks of equal duration and
probability of error. In this paper we characterize several properties of optimal and close-tooptimal channel codes indirectly, i.e. without identifying the best code explicitly. This characterization provides theoretical insight and ultimately may facilitate the search for new good
codes.
Shannon [1] was the first to recognize that to maximize information transfer across a memoryless channel the input (output) waveforms must be “noise-like”, i.e. resemble a typical sample
of a memoryless random process with marginal distribution P ∗X (P Y∗ ) – a capacity achieving
input (output) distribution. The most general formal statement of this property of optimal codes
has been put forward in [2] which showed that a capacity achieving sequence of codes with
vanishing probability of error must satisfy [2, Theorem 2]
1
D(PY n ||P ∗Y n ) → 0 ,
n
(1)
where PY n denotes the output distribution induced by the codebook (assuming equiprobable
codewords) and
Yn
P ∗PY × · ·∗· × PY
=
∗
(2)
is the n-th power of the single-letter capacity achieving output distribution P ∗Y and D(·||·) is
the relative entropy. Furthermore, [2] shows that (under regularity conditions) the empirical
frequency of input letters (or sequential k-letter blocks) inside the codebook approaches the
capacity achieving input distribution (or its k-th power) in the sense of vanishing relative entropy.
Note that (1) establishes a strong sense in which PY n is close to the memoryless P ∗Y n .
In this paper we extend the result in [2] to the case of non-vanishing probability of error.
Studying this regime as opposed to vanishing probability of error has recently proved to be
fruitful for the non-asymptotic characterization of the maximal achievable rate [3]. Although
for the memoryless channels considered in this paper the ǫ-capacity Cǫ is independent of the
probability of error ǫ, it does not immediately follow that a Cǫ -achieving code necessarily satisfies
the empirical distribution property (1). In fact, somewhat surprisingly, we will show that (1) only
holds under the maximal probability of error criterion.
December 21, 2012
DRAFT
3
To illustrate the delicacy of the question of approximating PY n with P ∗Y n , consider a good,
capacity-achieving k-to-n code for the binary symmetric channel (BSC) with crossover probability δ and capacity C . First, consider the set of all codewords; its probability under PY n is
k− n
– the two being exponentially very different
(approximately) (1 − δ)n while over PY∗n it is 2
(since for a reliable code k < n log 2(1 − δ)). On the other hand, consider a set E consisting
of a union of small Hamming balls surrounding each codeword, whose radius ≈ δn is chosen
such that PY n [E] = 12 , say. Assuming that the code is decodable with small probability of error,
the union will be almost disjoint and hence P Y∗ n [E] ≈ 2k− nC – the two becoming exponentially
comparable (provided k ≈ nC ). Thus, although PY n and PY∗n differ dramatically in “small”
details, on a “larger” scale they look the same. One of the goals of this paper is to make this
intuition precise. We will show that the closer the code approaches the fundamental limits, the
better PY n approximates P Y∗ n .
Studying the output distribution PY n also becomes important in the context of secure communication, where the output due to the code is required to resemble white noise; and in the
problem of asynchronous communication where the output statistics of the code imposes limits on
the quality of synchronization [4]. For example, in a multi-terminal communication problem, the
channel output of one user may create interference for another. Assessing the average impairment
caused by such interference involves the analysis of the expectation of a certain function of the
channel output E[F (Y n )]. We show that under certain regularity assumptions on F not only one
can approximate the expectation of F by substituting the unknown PY n with P ∗Y n , as in
Z
Z
n
F (y n )dP Y∗n ,
F (y )dPY n ≈
(3)
but one can also prove that in fact the distribution of F (Y n ) will be tightly concentrated around
its expectation. Thus, we are able to predict with overwhelming probability the random value
of F (Y n ) without any knowledge of the code used to produce Y n (but assuming the code is
ǫ-capacity-achieving).
Besides (1) and (3) we will show that
1) the hypothesis testing problem between PY n and P Y∗ n has zero Stein’s exponent;
2) a convenient inequality holds for the conditional relative entropy for the channel output in
terms of the cardinality of the employed code;
3) codewords of good codes for the additive white Gaussian noise (AWGN) channel become
December 21, 2012
DRAFT
4
more and more isotropically distributed (in the sense of evaluating quadratic forms) and
resemble white Gaussian noise (in the sense of ℓq norms) as the code approaches the
fundamental limits;
4) the output process Y n enjoys an asymptotic equipartition property.
Throughout the paper we will observe repeated connections with the concentration of measure
(isoperimetry) and optimal transportation, which were introduced into the information theory by
the seminal works [5]–[7]. Although some key results are stated for general channels, most of the
discussion is specialized to discrete memoryless channels (DMC) (possibly with a (separable)
input cost constraint) and to the AWGN channel.
The organization of the paper is as follows. Section II contains the main definitions and
notation. Section III proves a sharp upper bound on the relative entropy D(PY n ||P ∗Y n ). In
Section IV we discuss various implications of the bounds on relative entropy and in particular
prove approximation (3). Section V considers the hypothesis testing problem of discriminating
between PY n and P Y∗ n . The asymptotic equipartition property of the channel output process is
established in Section VI. Section VII discusses results for the quadratic forms and ℓp norms
of the codewords of good Gaussian codes. Finally, Section VIII outlines extensions to other
channels.
II. D EFINITIONS
AND NOTATION
A random transformation PY |X : X → Y is a Markov kernel acting between a pair of
measurable spaces. An (M, ǫ)avg code for the random transformation PY |X is a pair of random
transformations f : {1, . . . , M } → X and g : Y → {1, . . . , M } such that
P[Ŵ = W ] ≤ ǫ ,
(4)
where in the underlying probability space X = f (W ) and Ŵ = g(Y ) with W equiprobable on
{1, . . . , M }, and W, X, Y, Ŵ forming a Markov chain:
P
Y |X
g
f
W → X → Y → Ŵ .
(5)
In particular, we say that PX (resp., PY ) is the input (resp., output) distribution induced by the
encoder f . An (M, ǫ)max code is defined similarly except that (4) is replaced with the more
stringent maximal probability of error criterion:
max P[Ŵ = W |W = j] ≤ ǫ .
1≤ j≤ M
December 21, 2012
(6)
DRAFT
5
A code is called deterministic, denoted (M, ǫ)det , if the encoder f is a functional (non-random)
mapping.
A channel is a sequence of random transformations, {PY n |X n , n = 1, . . .} indexed by the
parameter n, referred to as the blocklength. An (M, ǫ) code for the n-th random transformation
is called an (n, M, ǫ) code. The non-asymptotic fundamental limit of communication is defined
as1
M ∗ (n, ǫ) = max{M : ∃(n, M, ǫ)-code} .
(7)
To the three types of channels considered below we also associate a special sequence of output
distributions P Y∗ n , defined as the n-th power of a certain single-letter P Y∗ :2
△
P Y∗ n = (P ∗Y )n ,
(8)
where P Y is defined as follows:
∗
1) a DMC (without feedback) is built from a single letter transformation PY |X : X → Y
acting between finite spaces by extending the latter to all n ≥ 1 in a memoryless way.
Namely, the input space of the n-th random transformation PY n |X n is given by 3
△
(9)
Xn = X n = X
| × .{.z. × X}
n t imes
n
and similarly for the output space Y = Y × . . . × Y, while the transition kernel is set to
be
n n
PY n |X n (y |x ) =
n
Y
PY |X (y j |xj ) .
(10)
j=1
The capacity C and P Y∗ , the unique capacity-achieving output distribution (caod), are found
by solving
C = max I (X ; Y ) .
PX
(11)
2) a DMC with input constraint (c, P ) is a generalization of the previous construction with
an additional restriction on the input space Xn :
(
)
n
X
Xn = xn ∈ X n :
c(xj ) ≤ nP
(12)
j=1
1
Additionally, one should also specify which probability of error criterion, (4) or (6), is used.
2
For general channels, the sequence {P Y∗ n } is required to satisfy a quasi-caod property, see [8, Section IV].
3
To unify notation we denote the input space as Xn (instead of the more natural X n ) even in the absence of cost constraints.
December 21, 2012
DRAFT
6
In this case the capacity C and the caod P Y∗ are found from a restricted maximization
C = max I (X ; Y ) ,
PX
(13)
where PX is any distribution such that
E[c(X )] ≤ P .
(14)
3) the AW GN (P ) channel has an input space4
n
o
√
Xn = x ∈ Rn : kxk2 ≤ nP
(15)
the output space Y n = Rn and the transition kernel
PY n |X n =x = N (x, In ) ,
(16)
where N (x, Σ) denotes a (multidimensional) normal distribution with mean x and covariance matrix Σ and In – is the n × n identity matrix. Then5
1
log(1 + P )
2
= N (0, 1 + P ) .
C =
PY∗
(17)
(18)
As shown in [9], [10] in all three cases P Y∗ is unique and P Y∗ n satisfies the key property:
D(PY
n |X n =x
||P ∗Y n ) ≤ nC ,
(19)
for all x ∈ Xn . Property (19) implies that for every input distribution PX n the induced output
distribution P Yn satisfies
P ∗Y n
PY n ≪
PY n |X n =xn ≪
PY n ,
(20)
∀xn ∈ Xn ,
(21)
∗
D(PY n ||P ∗Y n ) ≤
nC − I (X n ; Y n ) .
(22)
As a consequence of (21) the information density is well defined:
△
ı∗X n ;Y n (x n ; y n ) = log
dPY n |X n =xn n
(y ) .
dPY∗n
(23)
4
For convenience we denote the elements of Rn as x, y (for non-random vectors) and X n , Y n (for the random vectors).
5
As usual, all logarithms log and exponents exp are taken to arbitrary fixed base, which also specifies the information units.
December 21, 2012
DRAFT
7
Moreover, for every channel considered here there is a constant a1 > 0 such that6
sup Var ı∗X n ;Y n (X n ; Y n ) X n = xn ≤ na1 .
xn ∈ X n
(24)
In all three cases, the ǫ-capacity Cǫ equals C for all 0 < ǫ < 1, i.e.
log M ∗ (n, ǫ) = nC + o(n) ,
n → ∞.
(25)
In fact, see [3]:7
log M ∗ (n, ǫ) = nC
−
for any 0 < ǫ <
1
2
√
nV Q− 1 (ǫ) + O(log n) ,
n → ∞,
(27)
and a certain constant V ≥ 0, the channel dispersion.
In this paper we consider the following increasing degrees of optimality for sequences of
(n, Mn , ǫ) codes:
1) A code sequence is called ǫ-capacity-achieving, or o(n)-achieving, if
1
log Mn → C .
n
(28)
√
2) A code sequence is called O( n)-achieving if
√
log Mn = nC + O( n) .
(29)
√
3) A code sequence is called capacity-dispersion achieving, or o( n)-achieving, if
log Mn = nC −
√
√
nV Q− 1 (ǫ) + o( n) .
(30)
4) A code sequence is called O(log n)-achieving if
log Mn = nC −
√
nV Q− 1 (ǫ) + O(log n) .
(31)
We also need to introduce the performance of an optimal binary hypothesis test, which is one
of the main tools in [3]. Consider an A-valued random variable B which can take probability
measures P or Q. A randomized test between those two distributions is defined by a random
6
For discrete channels (24) is shown, e.g., in [3, Appendix E].
7
Q− 1 is the inverse of the standard Q-function:
Z
∞
Q(x) =
x
December 21, 2012
e− y
√ dy .
2π
2
(26)
DRAFT
8
transformation PZ|B : A 7→ {0, 1} where 0 indicates that the test chooses Q. The best performance
achievable among those randomized tests is given by8
X
βα (P, Q) = min
Q(a)PZ|B (1|a) ,
(32)
a∈ A
where the minimum is over all probability distributions PZ|B satisfying
X
PZ|B :
P (a)PZ|B (1|a) ≥ α .
(33)
a∈ A
The minimum in (32) is guaranteed to be achieved by the Neyman-Pearson lemma. Thus,
βα (P, Q) gives the minimum probability of error under hypothesis Q if the probability of error
under hypothesis P is no larger than 1 − α.
III. U PPER
BOUND ON THE OUTPUT RELATIVE ENTROPY
The main goal of this section is to establish (for each of the three types of channels introduced)
D(PY n ||P ∗Y n ) ≤ nC − log Mn + o(n) ,
(34)
where PY n is the sequence of output
distributions induced by a sequence of (n, Mn , ǫ) codes.
√
In fact we show that o(n) = O( n) for all channels except DMCs with zeros in the transition
matrix PY |X .
We start by showing an inequality tying relative entropy of the output distribution to the size of
the code (Section III-A), then we prove (34) for the case of the DMC (Sections III-B and III-C)
and the AWGN (Section III-D).
A. Augustin’s strong converse
The following result is a generalization of the strong converse of Wolfowitz [11] to the case
of variable threshold. It first appeared as part of the proofs in [12, Satz 7.3 and 8.2] by Augustin
and formally stated in [13, Section 2].
8
We sometimes write summations over alphabets for simplicity of exposition. For arbitrary measurable spaces βα (P, Q) is
defined by replacing the summation in (32) by an expectation.
December 21, 2012
DRAFT
9
Theorem 1 ([12], [13]): Consider a random transformation PY |X , a distribution PX induced
by an (M, ǫ)max,det code, an auxiliary output distribution QY and a function ρ : X → R. Then,
provided the denominator is positive,
M ≤
exp{E[ρ(X )]}
h
i
,
dPY |X =x
inf x PY |X=x log dQ
(Y
)
≤
ρ(x)
−
Y
ǫ
(35)
with the infimum taken over the support of PX .
Proof: Fix a code, QY , and ρ. Denoting by ci the i-th codeword, we have
QY [Ŵ (Y ) = i] ≥ β1− ǫ (PY |X=ci , QY ) ,
i = 1, . . . , M ,
(36)
since Ŵ (Y ) = i is a suboptimal test to decide betwee PY |X=ci and QY , which achieves error
probability no larger than ǫ when PY |X=ci is true. Therefore,
1
M
M
≥
≥
≥
≥
1 X
β1− ǫ (PY |X=ci , QY )
M
i=1
MX
1
M
PY |X=ci log
i=1
inf PY |X=x log
x
inf PY |X=x log
x
(37)
dPY |X=ci (Y ) ≤ ρ(c )
i
− ǫ exp{− ρ(ci )}
dQY
dPY |X=x
dQY
dPY |X=x
dQY
(Y ) ≤ ρ(x) − ǫ
(38)
M
1 X exp{− ρ(c )}
i
M i=1
(Y ) ≤ ρ(x) − ǫ exp{− E[ρ(X )]} ,
(39)
(40)
where (37) is by taking the arithmetic average of (36) over i, (40) is by Jensen’s inequality,
and (38) is by the standard estimate of βα , e.g. [3, (102)],
β1− ǫ (P, Q) ≥
P log
dP
(Z ) ≤ ρ − ǫ exp{− ρ} ,
dQ
(41)
with Z distributed according to P .
Remark 1: Unlike many other converse bounds, see [14, Section 2.7.3], we were unable to
reduce the proof of Theorem 1 to a binary hypothesis testing argument.
Remark 2: Following the idea of Poor and Verdú [15] we may further strengthen Theorem 1
in the special case of QY = PY : The maximal probability of error ǫ for any test of M hypotheses
{Pj , j = 1, . . . , M } satisfies:
ǫ≥
December 21, 2012
1−
exp{ρ̄}
M
inf P [ıW ;Y (W ; Y ) ≤ ρj |W = j] ,
1≤ j≤ M
(42)
DRAFT
10
10
where ρj ∈ R are arbitrary, ρ̄ =
1
M
PM
j=1 ρj ,
W is equiprobable on {1, . . . , M } and PY |W =j =
Pj . Indeed, since ıW ;Y (a; b) ≤ log M we get from [14, Lemma 35]
exp{ρj }β1− ǫ (PY |W =j , PY ) + 1 − exp{ρj } P [ıW ;Y (W ; Y ) > ρj |W = j] ≥ 1 − ǫ .
(43)
M
Multiplying by exp{− ρj } and using resulting bound in place of (41) we repeat steps (37)-(40)
to obtain
1
M
inf P [ıW ;Y (W ; Y ) ≤ ρj |W = j]
≥
inf P [ıW ;Y (W ; Y ) ≤ ρ j |W = j] −
ǫ
1≤ j≤ M
1≤ j≤
M
exp{− ρ̄} ,
(44)
which in turn is equivalent to (42).
Choosing ρ(x) = D(PY
|X=x ||QY )
+ ∆ we can specialize Theorem 1 in the following conve-
nient form:
Theorem 2: Consider a random transformation PY |X , a distribution PX induced by an (M, ǫ)max,det
code and an auxiliary output distribution QY . Assume that for all x ∈ X we have
△
d(x) = D(PY
|X=x ||Q Y )
<∞
(45)
and
sup PY |X=x log
dPY |X=x
dQY
x
′
(Y ) ≥ d(x) + ∆ ≤ δ ,
(46)
for some pair of constants ∆ ≥ 0 and 0 ≤ δ ′ < 1 − ǫ. Then, we have
D(PY
|X ||QY |PX )
≥ log M − ∆ + log(1 − ǫ − δ ′ ) .
(47)
Remark 3: Note that (46) holding with a small δ ′ is a natural non-asymptotic embodiment
of information stability of the underlying channel, cf. [8, Section IV].
One way to estimate the upper deviations in (46) is by using Chebyshev’s inequality. As an
example, we obtain
Corollary 3: If in the conditions of Theorem 2 we replace (46) with
sup Var log
dPY |X=x
x
dQY
for some constant Sm ≥ 0, then we have
D(PY
−
December 21, 2012
|X ||QY
|PX ) ≥ log M
(Y ) X = x ≤ Sm
r
2Sm
1− ǫ
+ log
.
1− ǫ
2
(48)
(49)
DRAFT
11
11
Notice that when QY is chosen to be a product distribution, such as P ∗Y n , log
dPY |X =x
dQY
becomes
a sum of independent random variables. In particular, (24) leads to an equivalent restatement
of (1):
Theorem 4: Consider a memoryless channel belonging to one of the three classes in Section II.
Then for any 0 < ǫ < 1 and any sequence of (n, Mn , ǫ)max,det capacity-achieving codes we have
1
1
I (X n ; Y n ) → C ⇐⇒
D(PY n ||PY∗n ) → 0 ,
n
n
(50)
where X n is the output of the encoder.
Proof: The direction ⇒ is trivial from property (22) of P ∗Y n . For the direction ⇐ we only
need to lower-bound I (X n ; Y n ) since I (X n ; Y n ) ≤ nC . To that end, we have from (24) and
Corollary 3:
D(PY
n |X n
√
||PY∗ n |PX n ) ≥ log Mn + O( n) .
(51)
Then the conclusion follows from (28) and the following identity applied with QY n = P ∗Y n :
I (X n ; Y n ) = D(PY
n |X n
||QY n |PX n ) − D(PYn ||QY n ) ,
(52)
which holds for all QY n such that the unconditional relative entropy is finite.
We remark that Theorem 4 can also be derived from a simple extension of the Wolfowitz
converse [11] to an arbitrary output distribution QY n , e.g. [14, Theorem 10], and then choosing
QY n = P Y∗ n . Note that Theorem 4 implies the result in [2] since the capacity-achieving codes
with vanishing error probability are a subclass of those considered in Theorem 4.
B. DMC with C1 < ∞
For a given DMC denote the parameter introduced by Burnashev [16]
C1 = max D(PY
a,a
|X=a ||PY |X=a′
).
(53)
′
Note that C1 < ∞ if and only if the transition matrix does not contain any zeros. In this section
we show (34) for a (regular) class of DMCs with C1 < ∞ by an application of the main
inequality (47). We also demonstrate that (1) may not hold for codes with non-deterministic
encoders or unconstrained maximal probability of error.
December 21, 2012
DRAFT
12
12
Theorem 5: Consider a DMC PY |X with capacity C > 0 (with or without an input constraint).
Then for any 0 < ǫ < 1 there exists a constant a = a(ǫ) > 0 such that any (n, Mn , ǫ)max,det
code satisfies
√
D(PY n ||P ∗Y n ) ≤ nC − log M + a n ,
(54)
where PY n is the output distribution induced by the code. In particular, for any capacity-achieving
sequence of such codes we have
1
D(PY n ||P ∗Y n ) → 0 ,
n
Remark 4: If P
∗
Y
(55)
is equiprobable on Y (such as for some symmetric channels), (55) is
equivalent to
H (Y n ) = n log |Y| + o(n) .
(56)
In any case, as we will see in Section IV-A, (55) always implies
H (Y n ) = nH (Y ∗ ) + o(n)
(57)
(by (133) applied to f (y) = log P Y∗ (y)). Note also that traditional combinatorial methods,
e.g. [17], are not helpful in dealing with quantities like H (Y n ), D(PY n ||P ∗Y n ) or P
Y
n
-expectations
of functions which are not of the form of cumulative average.
Proof: Fix yn , ȳ n ∈ Y n which differ in the j-th letter only. Then, denoting y\j = {yk , k = j}
we have
| log PY n (y n ) − log PY n (ȳ n )|
=
log
PY
PY
log
≤ max
′
a,b,b
(yj |yj\ )
(ȳ |yj\ )
\j j
jY| \j
jY|
PY |X (b|a)
PY |X
(58)
(59)
(b′ |a)
△
= a1 < ∞ ,
(60)
where (59) follows from
PYj |Y\j (b|y\j ) =
December 21, 2012
X
a∈ X
PY |X (b|a)PXj |Y\j (a|y\j ) .
(61)
DRAFT
13
13
Thus, the function yn 7→ log PY n (y n ) is a1 -Lipschitz in Hamming metric on Y n . Its discrete
gradient (see definition of D(f ) in [18, Section 4]) is bounded by n|a1 |2 and thus by the discrete
Poincaré inequality [18, Theorem 4.1f] we have
Var [log PY n (Y n )|X n = xn ] ≤ n|a1 |2 .
(62)
Therefore, for some 0 < a2 < ∞ and all xn ∈ Xn we have
Var [iX n ;Y n (X n ; Y n )|X n = xn ]
= Var log
≤
≤
PY n |X n (Y n |X n ) n
X = xn
n
n
PY (Y )
(63)
2 Var log PY n |X n (Y n |X n ) X n = xn
+ 2 Var [log PY n (Y n )|X n = xn ]
(64)
2na2 + 2n|a1 |2 ,
(65)
where (65) follows from (62) and the fact that the random variable in the first variance in (64) is
a sum of n independent terms. Applying Corollary 3 with Sm = 2na2 + 2n|a1 |2 and QY = PY n
we obtain:
D(PY
n |X n
√
||PY n |PX n ) ≥ log Mn + O( n) .
(66)
We can now complete the proof:
D(PY n ||PY∗n )
= D(PY
n |X n
||PY∗n |PX n ) − D(PY
≤ nC − D(PY
||PY n |PX n )
√
≤ nC − log Mn + O( n)
n |X n
n |X n
||PY n |PX n )
(67)
(68)
(69)
where (68) is because P Y∗ n satisfies (19) and (69) follows from (66). This completes the proof
of (54).
Remark 5: (55) need not hold if the maximal probability of error is replaced with the average
or if the encoder is allowed to be random. Indeed, for any 0 < ǫ < 1 we construct a sequence
of (n, Mn , ǫ)avg capacity-achieving codes which do not satisfy (55). Consider a sequence of
(n, M ′ , ǫ′ )max,det codes with → 0 and
ǫ′
n n
n
December 21, 2012
1
DRAFT
n
log M ′
n
14
14
→C.
(70)
December 21, 2012
DRAFT
15
15
For all n such that ǫ′n <
2
this code cannot have repeated codewords and we can additionally
1
assume (perhaps by reducing M by one) that there is no codeword equal to (x0 , . . . , x0 ) ∈ Xn ,
n
′
where x0 is some fixed letter in X such that
D(PY
∗
|X=x0 ||P Y
)>0
(71)
(the existence of such x0 relies on the assumption C > 0). Denote the output distribution induced
by this code by P Y′ n .
Next, extend this code by adding
′
ǫ− ǫ
1− ǫ
M identical codewords: (x , . . . , x ) ∈ X . Then the
n
0
0
n
minimal average probability of error achievable with the extended codebook of size
△
Mn =
1 − ǫn ′
M
1− ǫ n
(72)
is easily seen to be not larger than ǫ. Denote the output distribution induced by the extended
code by PY n and define a binary random variable
S = 1{X n = (x0 , . . . , x0 )}
with distribution
(73)
′
PS (1) = 1 − PS (0) = ǫ − ǫn
(74)
1 − ǫ′
We have then
.
n
D(PY n ||PY∗ n )
= D(PY
≥ D(PY
∗
n |S
||PY∗ n |PS ) − I (S; Y n )
(75)
||PY n |PS ) − log 2
(76)
n |S
∗
=
′ nD(PY |X=x0 ||P )PS (1) + D(P
Y
Yn
||P
∗
Yn
)PS (0) − log 2
= nD(PY |X=x0 ||PY∗ )PS (1) + o(n) ,
(77)
(78)
where (75) is by (52), (76) follows since S is binary, (77) is by noticing that PY n |S=0 = P ′Y n ,
and (78) is by (55). It is clear that (71) and (78) show the impossibility of (55) for this code.
Similarly, one shows that (55) cannot hold if the assumption of the deterministic encoder is
dropped. Indeed, then we can again take the very same (n, M ′ , ǫ′ ) code and make its
December 21, 2012
DRAFT
16
16
encoder
randomized so that with probability
n
ǫ− ǫ
1− ǫ
n
it outputs (x 0 , . . . , x 0) ∈ Xn and otherwise it outputs
the original codeword. The same analysis shows that (78) holds again and thus (55) fails.
December 21, 2012
DRAFT
17
17
Note that the counterexamples constructed above also demonstrate that in Theorem 2 (and
hence Theorem 1) the assumptions of maximal probability of error and deterministic encoders
are not superfluous, contrary to what is claimed by Ahlswede [13, Remark 1].
C. DMC with C1 = ∞
Next, we show the estimate for D(PY n ||P Y∗ n ) differing by a log 23 n factor from (34) for the
DMCs with C1 = ∞.
Theorem 6: For any DMC PY |X with capacity C > 0 (with or without input constraints) and
0 < ǫ < 1 there exists a constant b > 0 with the property that for any sequence of (n, Mn , ǫ)max,det
codes we have for all n ≥ 1
√
D(PY n ||P ∗Y n ) ≤ nC − log Mn + b n log3 n .
(79)
2
In particular, for any such sequence achieving capacity we have
1
D(PY n ||P ∗Y n ) → 0 .
n
(80)
Remark 6: Claim (80) is not true if either the maximal probability of error is replaced with
the average, or if we allow the encoder to be stochastic. Counterexamples are constructed exactly
as in Remark 5.
Proof: Let ci and Di , i = 1, . . . Mn denote the codewords and the decoding regions of the
code. Denote the sequence
ℓn = b1
p
n log n
(81)
with b1 > 0 to further constrained shortly. According to the isoperimetric inequality for Hamming
space [17, Corollary I.5.3], there is a constants a > 0 such that for every i = 1, . . . , Mn
ℓn
1 − PY n |X n =ci [Γℓn Di ] ≤ Q Q− 1 (ǫ) + √
a
(82)
n
2
≤ exp − b2ℓ
n
= n− b2
1
,
≤
n
n
(83)
(84)
(85)
where
December 21, 2012
DRAFT
ℓ
n
n
n
Γ D = {b ∈ Y : ∃y ∈ D s.t. |{j : yj = bj }| ≤ ℓ}
December 21, 2012
18
18
(86)
DRAFT
19
19
denotes the ℓ-th Hamming neighborhood of a set D and we assumed that b1 was chosen large
enough so there is b2 ≥ 1 satisfying (85).
Let
′
M
n =
Mn
nℓn n|Y|ℓn
(87)
and consider a subcode F = (F1 , . . . , FM′n ), i.e. an ordered list of Mn indices from {1, . . . , Mn }.
′
Then for every possible choice of F we denote by PX n (F ) and PY n (F ) the input/output distribution
induced by a given subcode, so that for example:
Mn
′
PY n (F )
1 X
PY n |X n =Fj .
= ′
M
n j=1
We aim to apply the random coding argument over all equally likely M
Mn′
(88)
n
choices of a subcode
F . Random coding among subcodes was originally invoked in [6] to demonstrate the existence
of a good subcode. One easily verifies that the expected induced output distribution is
△
E[PYn (F )] =
1
Mn
′
Mn
Mn
X
F1 =1
···
Mn
X
FM′n
PY n (F )
(89)
=1
= PY n .
(90)
Next, for every F we denote by ǫ′ (F ) the minimal possible average probability of error
achieved by an appropriately chosen decoder. With this notation we have, for every possible
value of F :
D(PY
n (F
= D(PY
∗
) ||P Y n
n |X n
)
||PY∗ n |PX n (F ) ) − I (Xn (F ); Y n (F ))
≤ nC − I (X n (F ); Y n (F ))
(91)
(92)
≤ nC − (1 − ǫ′ (F )) log nM ′ + log 2
(93)
≤ nC − log Mn ′ + nǫ′ (F ) log |X | + log 2
(94)
√
3
≤ nC − log Mn + nǫ′ (F ) log |X | + b3 n log 2 n
(95)
December 21, 2012
DRAFT
where (91) is by (52), (92) is by (19), (93) is by Fano’s inequality, (94) is because log M ′n
≤
20
20
n log |X | and (95) holds for some b3 > 0 by the choice of M n in (87) and by
′
log
December 21, 2012
n ≤ ℓ log n .
n
ℓn
(96)
DRAFT
21
21
Taking the expectation of both sides of (95), applying convexity of relative entropy and (90)
we get
D(PY n ||PY∗ n )
≤ nC − log Mn + n E[ǫ′ (F )] log |X | + b3
√
3
n log 2 n .
(97)
Accordingly, it remains to show that
n E[ǫ′ (F )] ≤ 2 .
(98)
To that end, for every subcode F define the suboptimal randomized decoder:
∀F j ∈ L(y, F )
Ŵ (y) = F j
(with probability
1
)
|L(y,F )|
,
(99)
where L(y, F ) is a list of those indices i ∈ F for which y ∈ Γℓn Di . Let FW be equiprobable on
F , then averaged over the selection of F we have
n
n
n
E[L(Y , F ) | FW ∈ L(Y , F )] ≤ 1 +
because each y ∈ Y n can belong to at most
n
ℓn
ℓn
|Y|ℓn
Mn
(M n′ − 1) ,
(100)
|Y|ℓn enlarged decoding regions Γℓn Di and
since each Fj is chosen independently and equiprobably among all possible Mn alternatives.
The average probability of error for the given decoder when also averaged over F can be upperbounded as
E[ǫ′ (F )]
≤
P[FW 6∈ L(Y n , F )]
|L(Y n , F )| − 1
+E
1{ FW ∈ L(Y n , F )}
n
|L(Y , F )|
≤
P[FW 6∈ L(Y n , F )]
n
ℓn
+
1
+
n
2
,
≤
n
≤
n
ℓn
′
(102)
Mn
|Y|ℓn M n′
Mn
where (102) is by Jensen’s inequality applied to
December 21, 2012
|Y|ℓn Mn
(101)
(103)
(104)
x− 1
x
and (100), (103) is by (85), and (104) is
DRAFT
′
by (87). Since (104) also serves as an upper bound to E[ǫ (F )] the proof of (98) is complete.
December 21, 2012
22
22
DRAFT
23
23
D. Gaussian channel
Theorem 7: For any 0 < ǫ < 1 and P > 0 there exists a = a(ǫ, P ) > 0 such that the output
distribution PY n of any (n, Mn , ǫ)max,det code for the AW GN (P ) channel satisfies
√
D(PY n ||P ∗Y n ) ≤ nC − log M + a n ,
(105)
where P Y∗ n = N (0, 1 + P )n . In particular for any capacity-achieving sequence of such codes we
have
1
D(PY n ||PY∗n ) → 0 .
n
(106)
Remark 7: (106) need not hold if the maximal probability of error is replaced with the average
or if the encoder is allowed to be stochastic. Counterexamples are constructed similarly to those
for Remark 5 with x0 = 0. Note also that Theorem 7 need not hold if the power-constraint is
in the average-over-the-codebook sense; see [14, Section 4.3.3].
Proof: Denote by lower-case pY n |X n =x and pY n densities of PY n |X n =x and PY n . Then an
elementary computation shows
∇ log pY n (y) =
log e
∇pY n (y)
pY n (y)
(107)
M
=
log e X
1
− 12 ||y− cj ||2
n ∇e
pY n (y) j=1 M (2π) 2
log e
M
1
1
(108)
2
(109)
=
pY n (y)
n
X
M (2π) n2
j=1
n
(cj − y)e−
= (E[X |Y = y] − y) log e .
2
||y− cj ||
(110)
For convenience denote
X̂ n = E[X n |Y n ]
and notice that since kX n k ≤
√
nP we have also
ˆ
√
n
X ≤
December 21, 2012
(111)
nP .
(112)
DRAFT
24
24
Then
1 E[k∇ log p n (Y n )k2 | X n ]
Y
log2 e
= E
Y n − X̂ n
≤
2 E kY n k
≤
2 E kY n k
2
2
Xn
Xn + 2E
(113)
ˆn
X
2
Xn
X n + 2nP
(114)
(115)
2
= 2 E kX n + Z n k2 X n + 2nP
(116)
≤
4kX n k2 + 4n + 2nP
(117)
≤
(6P + 4)n ,
(118)
where (114) is by
ka + bk2 ≤ 2kak2 + 2kbk2 ,
(119)
(115) is by (112), in (116) we introduced Z n ∼ N (0, In ) which is independent of X n , (117) is
by (119) and (118) is by the power-constraint imposed on the codebook.
Conditioned on X n , the random vector Y n is Gaussian. Thus, from Poincaré’s inequality for
the Gaussian measure, e.g. [19, (2.16)], we have
Var[log pY n (Y n ) | X n] ≤ E[k∇ log pY n k2 | X n ]
(120)
and together with (118) this yields the required estimate
Var[log pY n (Y n ) | X n] ≤ a1 n
(121)
for some a1 > 0. The argument then proceeds step by step as in the proof of Theorem 5
with (121) taking the place of (62) and recalling that property (19) holds for the AWGN channel
too.
IV. D ISCUSS ION
We have shown that there is a constant a = a(ǫ) independent of n and M such that
√
D(PY n ||P ∗Y n ) ≤ nC − log M + a n ,
December 21, 2012
DRAFT
(122)
December 21, 2012
25
25
DRAFT
20
20
where PY n is the output distribution induced by an arbitrary (n, M, ǫ)max,det code. Therefore,
there any (n, M, ǫ)max,det necessarily satisfies
√
log M ≤ nC + a(ǫ) n
(123)
as is classically known [20]. In particular, (122) implies that any ǫ-capacity-achieving code must
satisfy (1). In this section we discuss this and other implications of this result, such as:
1) (122) implies that the empirical marginal output distribution
n
1X
P̄n =
PY
n j=1 i
△
converges to P
∗
Y
(124)
in a strong sense (Section IV-A);
2) (122) guarantees estimates of the precision in the approximation (3) (Sections IV-B and IV-E),
3) (122) provides estimates for the deviations of f (Y n ) from its average (Sections IV-C).
4) relation to optimal transportation (Section IV-D),
5) implications of (1) for the empirical input distribution of the code (Sections IV-G and IV-H).
A. Empirical distributions and empirical averages
Considering the empirical marginal distributions, the convexity of relative entropy and (1)
result in
D(P̄n ||P ∗ ) ≤
∗
Y
1
D(PY n ||P ) → 0 ,
n
Y
(125)
n
where P̄n is the empirical marginal output distribution (124).
More generally, we have [2, (41)]
D(P̄ n(k) ||PY∗k ) ≤
(k)
where P̄n
k
D(PY n ||PY∗n ) → 0 ,
n− k + 1
(126)
is a k-th order empirical output distribution
P̄n(k) =
n−
k+1
X
1
P j+k− 1 .
n − k + 1 j=1 Yj
(127)
Knowing that a sequence of distributions Pn converges in relative entropy to a distribution P ,
i.e.
D(Pn ||P ) → 0
(128)
implies convergence properties for the expectations of functions:
December 21, 2012
DRAFT
21
21
1) The following inequality (often called Pinsker’s) is established in [21]:
||Pn − P ||T2V ≤
1
D(Pn ||P ) ,
2 log e
(129)
where
△
kP − QkT V =
sup |P (A) − Q(A)|
(130)
A
From (129) we have that (128) implies
Z
Z
f dPn → f dP
(131)
for all bounded functions f .
2) In fact, (131) also holds for unbounded f as long as it satisfies Cramer’s condition under
P , i.e.
Z
etf dP < ∞
(132)
for all t in some neighborhood of 0; see [22, Lemma 3.1].
Together (131) and (125) show that for a wide class of functions f : Y → R empirical averages
over distributions induced by good codes converge to the average over the capacity achieving
output distribution (caod):
"
#
Z
n
1X
∗
E
f (Yj ) → f dP Y .
n j=1
(133)
From (126) a similar conclusion holds for k-th order empirical averages.
B. Averages of functions of Y n
To go beyond empirical averages, we need to provide some definitions.
Definition 1: The function F : Y n → R is called (b, c)-concentrated with respect to measure
µ on Y n if for all t ∈ R
Z
exp{t(F (Y n ) − F̄ )}dµ ≤ b exp{ct2 } ,
Z
F dµ .
F̄ =
(134)
A function F is called (b, c)-concentrated for the channel if it is (b, c)-concentrated for every
PY n |X n =x and P Y∗ n .
A couple of simple properties of (b, c)-concentrated functions:
1) Gaussian concentration around the mean:
2
P[|F (Y n ) − E[F (Y n )]| > t] ≤ b exp − t
.
4c
December 21, 2012
(135)
DRAFT
22
22
2) Small variance:
Z
n
∞
Var[F (Y )] =
0
Z
≤
∞
P[ |F (Y n ) − E[F (Y n )]|2 > t]dt
min b exp −
0
(136)
t
, 1 dt
4c
(137)
= 4c log(2be) .
(138)
Examples of concentrated functions (see [19] for a survey):
•
A bounded function F with kF k∞ ≤ A is (exp{A2
1 }, c)-concentrated
for any c and
(4c)−
any measure µ. Moreover, for a fixed µ and a sufficiently large c any bounded function is
(1, c)-concentrated.
•
•
If F is (b, c)-concentrated then λF is (b, λ2 c)-concentrated.
Let f : Y → R be (1, c)-concentrated with respect to µ. Then so is
n
1 X
F (yn ) = √
f (yj )
n j=1
(139)
with respect to µn . In particular, any F defined in this way from a bounded f is (1, c)concentrated for a memoryless channel (for a sufficiently large c independent of n).
•
If µ = N (0, 1)n and F is a Lipschitz function on Rn with Lipschitz constant kF kLip then
F is (1,
2
kF kLip
)-concentrated
2 log e
with respect to µ, e.g. [23, Proposition 2.1]:
(
)
Z
2
F
k
k
Lip
exp{t(F (yn ) − F̄ )}dµ(y n) ≤ exp
t2 .
2
log
e
n
R
Therefore any Lipschitz function is (1,
•
(1+P ) kF k2Lip
)-concentrated
2 log e
(140)
for the AWGN channel.
For discrete Y n we endow it with the Hamming distance
d(yn , z n ) = |{i : yi = zi }|
(141)
and define Lipschitz functions in the usual way. In this case, a simpler criterion is: F : Y n →
R is Lipschitz with constant ℓ if and only if
|F (y1 , . . . , yj , . . . , yn ) − F (y1 , . . . , b, . . . , yn ) | ℓ .
max
(142)
y n ,b,j ≤
Let µ be any product probability measure P1 × . . . × Pn on Y n , then the standard AzumaHoeffding estimate shows that
(
)
n kF k2Lip 2
exp{t(F (y ) − F̄ )}µ(y ) ≤ exp
t
2 log e
yn ∈ Y n
X
December 21, 2012
n
n
(143)
DRAFT
23
23
and thus any Lipschitz function F is (1,
measure on Y n .
2
nkF kLip
)-concentrated
2 log e
with respect to any product
Note that unlike the Gaussian case, the constant of concentration c worsens linearly with
dimension n. Generally, this growth cannot be avoided as shown by the coefficient
1
√
in the
exact solution of the Hamming isoperimetric problem [24]. At the same time, thisn growth
P
does not mean (143) is “weaker” than (140); for example, F = nj=1 φ(yj ) has Lipschitz
√
constant O( n) in Euclidean space and O(1) in Hamming. Surprisingly, however, for
convex functions the concentration (140) holds for product measures even under Euclidean
distance [25].
We now show how to approximate expectations of concentrated functions:
Proposition 8: Suppose that F : Y n → R is (b, c)-concentrated with respect to P ∗Y n . Then
q
n
∗n
(144)
| E[F (Y )] − E[F (Y )]| ≤ 2 cD(PY n ||P
Y n ) + c log b ,
∗
where
E[F (Y
∗
∗n
Z
)] =
F (y n )dPY n .
(145)
Proof: Recall the Donsker-Varadhan inequality [26, Lemma 2.1]: For any probability meaR
sures P and Q with D(P ||Q) < ∞ and a measurable function g such that exp{g}dQ < ∞
R
we have that gdP exists (but perhaps is − ∞) and moreover
Z
Z
(146)
gdP − log exp{g}dQ ≤ D(P ||Q) .
Since by (134) the moment generating function of F exists under P ∗Y n , applying (146) to tF
we get
t E[F (Y n )] − log E[exp{tF (Y ∗ n )}] ≤ D(PY n ||P
Y n) .
(147)
∗
From (134) we have
ct2 − t E[F (Y n )] + t E[F (Y ∗ n )] + D(PY n ||PY n ) + log b ≥ 0
(148)
∗
for all t. Thus the discriminant of the parabola in (148) must be non-positive which is preDecember 21, 2012
DRAFT
24
24
cisely (144).
Note that for empirical averages F (yn ) =
1 Pn
j=1 f (yi )
n
we may either apply the estimate for
concentration in the example (139) and then use Proposition 8, or directly apply Proposition 8
December 21, 2012
DRAFT
25
25
to (125) – the result is the same:
n
1X
E[f (Y )] − E[f (Y ∗ )] ≤
2
r
n
j
n
c
) → 0,
D(PY n ||P
∗ Yn
(149)
j=1
for any f which is (1, c)-concentrated with respect to P Y∗ .
For the Gaussian channel, Proposition 8 and (140) yield:
Corollary 9: For any 0 < ǫ < 1 there exist two constants a1 , a2 > 0 such that for any
(n, M, ǫ)max,det code for the AW GN (P ) channel and for any Lipschitz function F : Rn → R
we have
| E[F (Y n )] − E[F (Y ∗ n )]| ≤ a1 kF Lip
q
k
where C =
1
2
nC − log Mn + a2
√
n,
(150)
log(1 + P ) is the capacity.
Note that in the proof of Corollary 9 concentration of measure is used twice: once for PY n |X n
in the form of Poincaré inequality (proof of Theorem 7) and once in the form of (134) (proof
of Proposition 8).
C. Concentration of functions of Y n
Surprisingly not only can we estimate expectations of F (Y n ) by replacing the unwieldy PY n
with the simple P Y∗ n , but in fact the distribution of F (Y n ) exhibits a sharp peak at its expectation:
Proposition 10: Consider a channel for which (122) holds. Then for any F which is (b, c)concentrated for such channel, we have for every (n, M, ǫ)max,det code:
P[|F (Y n ) − E[F (Y ∗ n )]| > t] ≤ 3b
exp
and,
2
√
nC − log M + a n −
16c
t
(151)
√
Var[F (Y n )] ≤ 16c nC − log M + a n + log(6be) .
(152)
Proof: Denote for convenience:
December 21, 2012
△
= E[F (Y ∗ n )] ,
(153)
φ(xn ) = E[F (Y n )|X n = xn ] .
(154)
F̄
DRAFT
Then as a consequence of F being (b, c)-concentrated for PY n |X n =xn we have
26
26
2
P[|F (Y n ) − φ(xn )| > t|X n = xn ] ≤ b exp − t
.
4c
December 21, 2012
(155)
DRAFT
27
27
Consider now a subcode C1 consisting of all codewords such that φ(xn ) > F̄ + t for t > 0. The
number M1 = |C1 | of codewords in this subcode is
M1 = M P[φ(X n ) > F̄ + t] .
(156)
Let QY n be the output distribution induced by C1 . We have the following chain:
1 X
φ(xn )
F̄ + t ≤
M1 x∈ C1
Z
=
F (Y n )dQY n
q
≤
F̄ + 2 D(QY n ||P ) + c log b)
∗
c
Yn
q
√
≤ F̄ + 2 c (nC − log M1 + a n) + c log b
(157)
(158)
(159)
(160)
where (157) is by the definition of C1 , (158) is by (154), (159) is by Proposition 8 and the
assumption of (b, c)-concentration of F under P Y∗ n , and (160) is by (122).
Together (156) and (160) imply:
√
t 2
.
(161)
P[φ(X n ) > F̄ + t] ≤ b exp nC − log M + a n
4c
−
Applying the same argument to − F we obtain a similar bound on P[|φ(X n ) − F̄ | > t and thus
P[|F (Y n ) − F̄ | > t] ≤ P[|F (Y n ) − φ(X n )| > t/2] + P[|φ(X n ) − F̄ | > t/2]
(162)
2
≤ b exp − t
16c
≤ 3b exp −
1 + 2 exp{nC − log M +
√
a n}
√
t2
+ nC − log M + a
16c
,
(163)
(164)
n
where (163) is by (155) and (161) and (164) is by (123). Thus (151) is proven. Moreover, (152)
follows by (138).
D. Relation to optimal transportation
Since the seminal work of Marton [7], [27] one of the tools for proving (b, c)-concentration
of Lipschitz functions is the optimal transportation theory. Namely, Marton demonstrated that if
a probability measure µ on a metric space satisfies a T1 inequality
December 21, 2012
DRAFT
p
W1 (ν, µ) ≤ c′ D(ν||µ)
∀ν
28
28
(165)
December 21, 2012
DRAFT
29
29
then any Lipschitz f is (b, kf kLipc)-concentrated with respect to µ for some b = b(c, c′ ) and any
2
0<c<
c′
4
. In (165) W
1 (ν, µ) denotes the linear-cost transportation distance, or Wasserstein-1
distance, defined as
△
W1 (ν, µ) = inf E[d(Y, Y ′ )] ,
PY Y (166)
′
where d(·, ·) is the distance on the underlying metric space and the infimum is taken over all
couplings PY Y ′
Theorem
with fixed marginals PY = µ, PY ′ = ν. Note that according to [28,
1] we have kν − µkT V = W1 (ν, µ) when the underlying distance on Y is d(y, y ′ ) = 1{y =
y ′ },
sometimes called an L0 -distance.
In this section we show that (165) in fact directly implies the estimate of Proposition 8
without invoking either Marton’s argument or Donsker-Varadhan inequality. Indeed, assume that
F : Y n → R is a Lipschitz function and observe that for any coupling PY n ,Y ∗ n we have
| E[F (Y n )] − E[F (Y ∗ n )]| ≤ kF k E[d(Y n , Y ∗ n )] ,
Lip
(167)
where the distance d is either Hamming or Euclidean depending on the nature of Y n . Now taking
the infimum in the right-hand side of (167) with respect to all couplings we observe
| E[F (Y n )] − E[F (Y ∗ n )]| ≤ kF Lip
k W1 (PY n , PY n ∗ )
(168)
and therefore by the transportation inequality (165) we get
q
∗
n
∗n
| E[F (Y )] − E[F (Y )]| ≤ c′ kFLip D(PY n ||PY n )
(169)
k2
which is precisely what Proposition 8 yields for (1,
c′ k k2Lip
)-concentrated
F 4
functions.
Our argument can be turned around and used to prove linear-cost transportation T1 inequalities (165). Indeed, by the Kantorovich-Rubinstein duality [29, Chapter 1] we have
sup | E[F (Y n )] − E[F (Y ∗ n ]| = W1 (PY n , PY n ) ,
(170)
∗
F
where the supremum is over all F with kF kLip ≤ 1. Thus the argument in the proof of Proposition 8 shows that (165) must hold for any µ for which every 1-Lipschitz F is (1, c′ )-concentrated,
demonstrating an equivalence between T1 transportation and Gaussian-like concentration – a
December 21, 2012
DRAFT
30
30
result reported in [30, Theorem 3.1].
We also mention that unlike general iid measures, an iid Gaussian µ = N (0, 1)n satisfies a
much stronger T2 -transportation inequality [31]
p
W2 (ν, µ) ≤ c′ D(ν||µ)
∀ν ≪ µ ,
(171)
December 21, 2012
DRAFT
31
31
where remarkably c′ does not depend on n and the Wasserstein-2 distance W2 is defined as
p
△
W2 (ν, µ) = inf E[d2 (Y, Y ′ )] ,
PY Y
′
(172)
the infimum being over all couplings as in (166).
E. Empirical averages of non-Lipschitz functions
One drawback of proving Proposition 8 relying on the transportation inequality (165) is that
it does not show anything for non-Lipschitz functions. In this section we demonstrate how the
proof of Proposition 8 can be extended to functions that do not satisfy the strong concentration
assumptions.
Proposition 11: Let f : Y → R be a (single-letter) function such that for some θ > 0 we have
m1 = E[exp{θf (Y ∗ )}] < ∞ ,
(173)
(one-sided Cramer condition) and
m2 = E[f 2 (Y ∗ )] < ∞ .
Then there exists b = b(m1 , m2 , θ) > 0 such that for all n ≥
n
1X
n
E[f (Y )] ≤ E[f (Y ∗ )] +
j
j=1
(174)
16
θ4
we have
D(PYn ||PY∗ n ) +
√
b n
3
n4
,
provided that P Y∗ n = (P Y∗ )n .
Remark 8: Substituting the estimate (122) and assuming log M = nC + O(
that both Proposition 8 and (175) give the same order n−
1
4
(175)
√
n) we can see
for the deviation of the empirical
average from E[f (Y ∗ )]. Note also that Proposition 11 applies to functions which only have onesided exponential moments; this will be useful in Theorem 23. Finally, the remainder term is
√
D(PY n ||PY∗ n )
at the expense of
n but can be improved to
order-optimal for D(PY n ||P Y∗ n ) ∼
n
q
adding technical conditions on n, θ and m1 .
Proof: It is clear that if the moment-generating function t 7→ E[exp{tf (Y ∗ )}] exists
for
t = θ > 0 then it also exists for all 0 ≤ t ≤ θ. Notice that since
December 21, 2012
DRAFT
−2
x exp{− x } ≤ e
2
2
log e ,
∀x ≥ 0
32
32
(176)
December 21, 2012
DRAFT
33
33
we have for all 0 ≤ t ≤ 2θ :
−2
log e
E[f 2(Y ∗ ) exp{tf (Y ∗ )}] ≤ E[f 2(Y ∗ )1{f < 0}] +
E [exp{θf (Y ∗ )}1{f ≥ 0}(]177)
e
(θ − t)2
e− 2 m1 log e
≤ m2 +
(178)
(θ − t)2
−2
≤ m2 + 4e m1 log e
(179)
θ2
△
= b(m1 , m2 , θ) · 2 log e .
(180)
Then a simple estimate
log E[exp{tf (Y ∗ )}] ≤ t E[f (Y ∗ )] + bt2 ,
0≤ t≤
2
(181)
θ
,
can be obtained by taking the logarithm of the identity
Zt Zs
t
1
∗
∗
E[exp{tf (Y )} ] = 1 +
ds
E[f 2(Y ∗ ) exp{tf (Y ∗ )}]du
E[f (Y )] +
2
log e
0
log e 0
and applying upper bounds (180) and log x ≤ (x − 1) log e.
P
Next, we define F (yn ) = 1n nj=1 f (yi ) and consider the chain:
t E[F (Y n )] ≤
log E[exp{tF (Y ∗ n )}] + D(PY n ||PY n )
(182)
(183)
∗
Y
t
∗
= n log E[exp{ f (Y ∗ )}] + D(PY n ||P )
n
n
≤ t E[f (Y ∗ )] +
bt2
n
(184)
∗
+ D(PY n ||PY n ) ,
(185)
where (183) is by (147), (184) is because P Y∗ n = (P Y∗ )n and (185) is by (181) assuming nt ≤ 2θ .
3
The proof concludes by taking t = n 4 in (185).
A natural extension of Proposition 11 to functions such as
n−
r+1
X
1
j+r− 1
F (y ) =
f (yj
)
n − r + 1 j=1
n
(186)
is made by replacing the step (184) with an estimate
log E[exp{tF (Y ∗ )}] ≤
n − r+1
log
r
exp
tr
f (Y ∗ r )
n
,
(187)
E
which in turn is shown by splitting the sum into r subsums with independent terms and then
December 21, 2012
DRAFT
34
34
applying Holder’s inequality:
E[X1 · · · Xr ] ≤ (E[|X1 |r ] · · · E[|X1 |r ]) r
1
(188)
(for n not a multiple of r a small remainder term needs to be added to (187)).
December 21, 2012
DRAFT
35
35
F. Functions of degraded channel outputs
Notice that if the same code is used over a channel QY |X which is stochastically degraded
with respect to PY |X then by the data-processing for relative entropy, the upper bound (122)
holds for D(QY n ||Q∗Y n ), where QY n is the output of the QY |X channel and Q∗Y n is the output
of QY |X when the input is distributed according to a capacity-achieving distribution of PY |X .
Thus, in all the discussions the pair (PY n , P Y∗ n ) can be replaced with (QY n , Q∗Y n ) without any
change in arguments or constants. This observation can be useful in questions of information
theoretic security, where the wiretapper has access to a degraded copy of the channel output.
G. Input distribution: DMC
As shown in Section IV-A we have for every ǫ-capacity-achieving code:
n
1X
PY → P Y∗ .
P̄n =
n j=1 j
(189)
As noted in [2], convergence of output distributions can be propagated to statements about the
input distributions. For example, this is obvious for the case of a DMC with a non-singular
(more generally, injective) matrix PY |X . For other DMCs, the following argument extends that
of [2, Theorem 4]. By Theorem 4 and 5 we know that
1
I (X n ; Y n ) → C .
n
(190)
By concavity of mutual information, we must necessarily have
I (X̄ ; Ȳ ) → C ,
where PX̄ =
1
n
Pn
j=1
(191)
PXj . By compactness of the simplex of input distributions and continuity
of the mutual information on that simplex the distance to the (compact) set of capacity achieving
distributions Π must vanish:
d(PX̄ , Π) → 0 .
(192)
∗ is unique ,then (192) shows the convergence of P ¯ →
If the capacity achieving distribution P X
X
PX∗ in the (strong) sense of total variation.
December 21, 2012
DRAFT
30
30
H. Input distribution: AWGN
In the case of the AWGN, just like in the discrete case, (50) implies that for any capacity
achieving sequence of codes we have
n
PX̄(n) =
1 XP → P △
= N (0, P ) ,
Xj
X
n j=1
(193)
∗
however, in the sense of weak convergence of distributions only. Indeed, first, a collection of
all distributions on R with second moment not exceeding P is weakly compact by Prokhorov’s
may be assumed
criterion. Thus, the sequence of empirical input marginal distributions P (n)
X̄
to weakly converge to some distribution QX . On the other hand, for the sequence of induced
empirical output distributions P Ȳ(n) from (50) we get
D(P Ȳ(n) ||PY∗ ) → 0 .
(194)
Thus by the weak lower-semicontinuity of relative entropy [32] we have D(QY ||P ∗Y ) = 0, where
QY = QX ∗ N (0, 1) is the output distribution induced by QX over the AWGN channel PY |X and
∗ is convolution. Since the map QX 7→ QX ∗ N (0, 1) is injective (e.g. by computing
characteristic
∗ and thus (193) holds.
functions), we must also have QX = P X
We now discuss whether (193) can be claimed in a stronger topology than the weak one.
is purely atomic and P is purely diffuse, we have
Since PX̄ ∗
X
||PX̄ − PX∗ ||T V = 1 ,
(195)
and convergence in total variation (let alone in relative entropy) cannot hold.
On the other hand, it is quite clear (and also follows from a quantitative estimate in (267)
P
below) that the second moment of 1n PX j necessarily converges to that of N (0, P ). Together
weak convergence and control of second moments imply [29, (12), p.7]
!
n
X
1
W22
PX , P ∗ → 0 .
n j=1 j X
(196)
Therefore (193) holds in the sense of topology metrized by the W2 -distance.
Note that convexity properties of W 22 (·, ·) imply
!
nX
n
1X 2
21
∗
W PX , P ∗
W
PX , P
≤
2
j
X
n j=1 j X
n j=1 2
1
December 21, 2012
2
(197)
∗
DRAFT
≤
December 21, 2012
31
n
W2 (PX n , PX n ) ,
31
(198)
DRAFT
32
32
where we denoted
△
∗ n
∗
= N (0, P In ) .
PX
n = (P X )
(199)
Comparing (196) and (198), it is natural to conjecture a stronger result: For any capacityachieving sequence of codes
1
(200)
W (P n , P ∗X n ) → 0 .
n 2 X
Another reason to conjecture (200) arises from considering the behavior of Wasserstein dis√
tance under convolutions. Indeed from the T2 -transportation inequality (171) and the relative
entropy bound (122) we have
1 2
W (PX n ∗ N (0, In ), P∗X n ∗ N (0, In )) → 0 ,
n 2
(201)
since by definition
PY n = PX n ∗ N (0, In )
Yn
(202)
P=∗ PX n ∗ N (0, In ) ,
(203)
∗
where ∗ denotes convolution of distributions on Rn . Trivially, for any P, Q and N – probability
measures on Rn it is true that (e.g. [29, Proposition 7.17])
W2 (P ∗ N , Q ∗ N ) ≤ W2 (P, Q) .
(204)
Thus, overall we have
1
0← √
∗
n
1
W2 (PX n ∗ N (0, In ), XPn ∗ N (0, In )) ≤ √
∗
n
W2 (PX n ,XPn ) ,
(205)
and (200) simply conjectures that (conversely to (204)) the convolution with Gaussian kernel is
unable to significantly decrease W2 .
Despite the foregoing intuitive considerations, the conjecture (200) is false. Indeed, define
D∗ (M, n) to be the minimum achievable average square distortion among all vector quantizers
of the memoryless Gaussian source N (0, P ) for blocklength n and cardinality M . In other
words,
D∗ (M, n) =
1
inf W22 (P X∗ n , Q) ,
n Q
(206)
where the infimum is over all probability measures Q supported on M equiprobable atoms in
December 21, 2012
DRAFT
n
R . Standard rate-distortion (converse) lower bound dictates
1
1
P
log M ≥ log ∗
n
D (M,
2
n)
December 21, 2012
33
33
(207)
DRAFT
34
34
and hence
W22 (PX n , PX∗ n ) ≥
nD ∗(n, M )
≥
nP exp −
(208)
2
log M
n
,
(209)
which shows that for any sequence of codes with log Mn = O(n), the normalized transportation
distance stays strictly bounded away from zero:
1
lim inf √ W2 (PX n , P X∗ n ) > 0 .
n→∞
n
V. B INARY
HYPOTHESIS TESTING
PY n
(210)
VS .
P Y∗ n
We now turn to the question of distinguishing PY n from P Y∗ n in the sense of binary hypothesis
testing. First, a simple data-processing reasoning yields
d(α||βα(PY n , P Y∗ n )) ≤ D(PY n ||P ) ,
∗
(211)
Yn
where we have denoted the binary relative entropy
△
d(x||y) = x log
1− x
x
+ (1 − x) log 1 − y .
y
(212)
From (122) and (211) we conclude: Every (n, M, ǫ)max,det code must satisfy
βα (PY n , P Y∗n
≥
)
M
2
1
α
C
exp{− n
α
−
√ a
n }
α
(213)
for all 0 < α ≤ 1. Therefore, in particular we see that the hypothesis testing problem for
discriminating PY n from P Y∗ n has zero Stein’s exponent (and thus a zero Chernoff exponent),
provided the sequence of (n, Mn , ǫ)max,det codes, defining PY n , is capacity achieving.
The following result shows that (213) is not tight:
Theorem 12: Consider one of the three types of channels introduced in Section II. Then every
(n, M, ǫ)avg code must satisfy
βα (PY n , P Y∗n ) ≥ M exp{− nC − a2
√
n}
ǫ ≤ α ≤ 1,
(214)
where a2 = a2 (ǫ, a1 ) > 0 depends only on ǫ and the constant a1 from (24).
To prove Theorem 12 we need the following strengthening of the meta-converse results [3,
December 21, 2012
DRAFT
Theorems 27 and 31]:
December 21, 2012
35
35
DRAFT
36
36
Theorem 13: Consider an (M, ǫ)avg code for an arbitrary random transformation PY |X . Let
PX be the distribution of the encoder output if all codewords are equally likely and PY be the
induced output distribution. Then for any QY and ǫ ≤ α ≤ 1 we have
βα (PY , QY ) ≥ M βα− ǫ (PXY , PX QY ) .
(215) If the code is (M, ǫ)max,det then additionally we have for every δ > 0:
δ
βα (PY , QY ) ≥
M inf βα− ǫ− δ (PY |X=x , QY )
ǫ + δ ≤ α ≤ 1,
1 − α + δ x∈ C
(216)
where the infimum is taken over the codebook C.
Proof: On the probability space corresponding to a given (M, ǫ)avg code, define the following
random variable
Z = 1{Ŵ (Y ) = W, Y ∈ E} ,
(217)
where E is an arbitrary subset satisfying
PY [E] ≥ α .
(218)
Precisely as in the original meta-converse [3, Theorem 26] the main idea is to use Z as a
suboptimal hypothesis test for discriminating PX Y against PX QY . Following the same reasoning
as in [3, Theorem 27] one notices that
(PX QY )[Z = 1] ≤ QY [E]
M
(219)
and
PX Y [Z = 1] ≥ α − ǫ .
(220)
Therefore, by definition of βα we must have
βα− ǫ (PXY , PX QY )
≤
QY [E]
.
M
(221)
To complete the proof of (215) we just need to take the infimum in (221) over all E satisfying (218).
To prove (216), we again consider any set E satisfying (218). Let ci , i = 1, . . . , M be the
codewords of the (M, ǫ)max,det code. Define
pi = PY |X=ci [E]
December 21, 2012
(222)
DRAFT
qi = QY [Ŵ = i, E] ,
December 21, 2012
37
37
i = 1, . . . , M.
(223)
DRAFT
38
38
Since the sets {Ŵ = i} are disjoint, the (arithmetic) average of qi is lower-bounded by
E[qW ] ≤
QY [E]
,
M
(224)
whereas because of (218) we have
E[pW ] ≥ α .
(225)
Thus, the following lower bound holds:
E
QY [E]
pW − δqW
M
≥
QY [E]
(α − δ)
M
(226)
implying that there must exist i ∈ {1, . . . , M } such that
QY [E] − δq QY [E]
i≥
(α − δ) .
pi
M
M
(227)
PY |X=ci [E] ≥
(228)
For such i we clearly have
α− δ
QY [E] 1 − α − δ
QY [Ŵ = i, E] ≤
.
δ
M
(229)
By the maximal probability of error constraint we deduce
PY |X=ci [E, Ŵ = i] ≥ α − ǫ − δ
(230)
and thus by the definition of βα :
βα− ǫ− δ (PY |X=ci , QY ) ≤
QY [E] 1 − α − δ
.
δ
(231)
M
Taking the infimum in (231) over all E satisfying (218) completes the proof of (216).
Proof of Theorem 12: To show (214) we first notice that as a consequence of (19), (24)
and [3, Lemma 59] (see also [14, (2.71)]) we have for any xn ∈ Xn :
(
)
r
2a
n
α
∗
1
βα (PY n |X n =xn , P Y n )
.
exp − nC
α
2
≥
−
(232)
From [14, Lemma 32] and the fact that the function of α in the right-hand side of (232) is
convex we obtain that for any PX n
βα (PX n Y n , PX n P Y∗ n )
December 21, 2012
α
2
exp
≥
(
− nC −
DRAFT
r
39
39
)
2a1 n
α
.
(233)
Finally, (233) and (215) imply (214).
December 21, 2012
DRAFT
40
40
VI. AEP
FOR THE OUTPUT PROCESS
Yn
We say that a sequence of distributions PY n on Y n (with Y a countable set) satisfies the
asymptotic equipartition property (AEP) if
1
1
log
− H (Y n ) → 0
n
n
PY n (Y )
(234)
in probability.
Although the sequence of output distributions induced by codes are far from being (a finite
chunk of) a stationary ergodic process, we will show that (234) is satisfied for ǫ-capacityachieving codes (and other codes). Thus, in particular, if the channel outputs are to be losslessly
compressed and stored for later decoding 1nH (Y n ) bits per sample would suffice (cf. (57)). In
√
fact, the log P Y n1(Y n ) concentrates up to
n around the entropy H (Y n ). Such questions are also
interesting in other contexts and for other types of distributions, see [33].
Theorem 14: Consider a DMC PY |X with C1 < ∞ (with or without input constraints) and a
capacity achieving sequence of (n, Mn , ǫ)max,det codes. Then the output AEP (234) holds.
Proof: In the proof of Theorem 5 it was shown that log PY n (y n ) is Lipschitz with Lipschitz
constant upper bounded by a1 , see (60). Thus by (143) and Proposition 10 we find that for any
capacity-achieving sequence of codes
Var[log PY n (Y n )] = o(n2 ) ,
n → ∞.
(235)
Thus (234) holds by Chebyshev inequality.
Surprisingly, for many practically interesting DMCs (such as those with additive noise in a
finite group), the estimate (235) can be improved to O(n) even without assuming the code to
be capacity-achieving.
Theorem 15: Consider a DMC PY |X with C1 < ∞ (with or without input constraints) and
such that H (Y |X = x) is constant on X . Then for any sequence of (n, Mn , ǫ)max,det codes there
exists a constant a = a(ǫ) such that for all n sufficiently large
Var log
1
≤ an .
PY n (Y n )
(236)
In particular, the output AEP (234) holds.
Proof: First, let X be a random variable and A some event (think P[Ac ] ≪ 1) such that
|X − E[X ]| ≤ L
December 21, 2012
(237)
DRAFT
41
41
if X 6∈ A. Then
Var[X ] = E[(X − E[X ])2 1A ] + E[(X − E[X ])2 1Ac ]
≤
E[(X − E[X ])2 1A ] + P[Ac ]L2
P[Ac ]
P[A]
= P[A] Var[X |A] +
≤
(238)
2
(239)
!
(E[X ] − E[X |A ]) P[A ]L
c
P[Ac ] 2
Var[X |A] +
L ,
P[A]
c
(240)
2
(241)
where (239) is by (237), (240) is after rewriting E[(X − E[X ])2 |A] in terms of Var[X |A] and
applying the obvious identity
c
c
E[X |A] = E[X ] − P[A ] E[X |A ]
P[A]
(242)
and (241) is because (237) implies | E[X |Ac] − E[X ]| ≤ L.
Next, fix n and for any codeword xn ∈ Xn denote for brevity
d(xn ) = D(PY
n |X n =xn
v(xn ) = E log
||PY n )
(243)
1
n)
X n = xn
PY n (Y
n
= d(x ) + H (Y n |X n = xn ) .
(244)
(245)
If we could show that for some a1 > 0
Var[d(X n )] ≤ a1 n
(246)
the proof would be completed as follows:
Var log
1
PY n (Y
n)
= Var log
1
n)
PY n (Y
≤ a2 n + Var[v(X n )]
X n + Var[v(X n )]
(247)
(248)
= a2 n + Var[d(X n )]
(249)
≤ (a1 + a2 )n ,
(250)
where (248) follows for an appropriate constant a2 > 0 from (62), (249) is by (245) and
H (Y n |X n = xn ) does not depend on xn by assumption9, and (250) is by (246).
9
This argument also shows how to construct a counterexample when H (Y |X = x) is non-constant: merge two constant
composition subcodes of types P1 and P2 such that H (W |P1 ) = H (W |P2 ) where W = PY |X is the channel matrix. In this
case one clearly has Var[log PY n (y n )] ≥ Var[v(X n )] = const · n2 .
December 21, 2012
DRAFT
42
42
To show (246), first, notice that
n
PX n |Y n (y |xn )
iX n ;Y n (x ; y ) = log
≤ log Mn .
PX n (xn )
n
n
(251)
Second, as shown in (65) one may take Sm = a3 n in Corollary 3. In turn, this implies that one
q
′
3 n and δ
=
in Theorem 2, that is:
can take ∆ = 2a
1− ǫ
2
1− ǫ
inf Plog
PY n |X n =xn
(Y n ) < d(x) + ∆ X n = xn ≥
1+ǫ
.
2
(252)
PY n
Then applying Theorem 1 with ρ(xn ) = d(xn ) + ∆ to the subcode consisting of all codewords
xn
with {d(X n ) ≤ log Mn − 2∆} we get
P[d(X n ) ≤ log Mn − 2∆]
since
≤
2 exp{− ∆} ,
1− ǫ
(253)
E[ρ(X n )|d(X n )] ≤ log Mn − 2∆] ≤ log Mn − ∆ .
(254) Now, we apply (241) to d(X n ) with L = log Mn and A = {d(X n ) > log Mn − 2∆}.
Since
Var[X |A] ≤ ∆2 this yields
2 log2 Mn
exp{− ∆}
1− ǫ
√
for all n such that 1−2 ǫ exp{− ∆} ≤ 21 . Since ∆ = O( n) and log Mn = O(n) we
Var[d(X n )] ≤ ∆2 +
(255)
conclude
from (255) that there must be a constant a1 such that (246) holds.
We also have a similar extension for the AWGN channel, except that this time output-AEP
does not have a natural interpretation in terms of fundamental limits of compression.
Theorem 16: Consider the AW GN channel. Then for any sequence of (n, Mn , ǫ)max,det codes
there exists a constant a = a(ǫ) such that for all n sufficiently large
Var log
1
≤ an ,
pY n (Y n )
(256)
where pY n is the density of PY n with respect to Lebesgue measure Leb on Rn . In particular,
the AEP holds for Y n .
Proof: The argument follows that of Theorem 15 step by step with (121) used in place
of (62).
Corollary 17: If in the setting of Theorem 16, the codes are the spherical codes (i.e., energies
of all codewords X n are equal) or, more generally,
December 21, 2012
DRAFT
Var[||X n ||2 ] = o(n2 ),
December 21, 2012
43
43
(257)
DRAFT
44
44
then
dP n
1
log Y∗ (Y n ) − D(PY
dPY n
n
n
||PY∗n ) → 0
(258)
in PY n -probability.
Yn
Proof: To apply Chebyshev’s inequality to log dP
(Y n ) we need in addition to (256) show
dP n
∗
Y
Var[log p∗Y n (Y n )] = o(n2 ) ,
(259)
||y n ||2
where p∗Y n (y n ) = (2π(1 + P ))− 2 e− 2(1+P ) . Introducing i.i.d. Zj ∼ N (0, 1) we have
#
"
n
2
X
log
e
n
2
Var[log
(Y n )] =
X Z + ||Z || .
Var ||X n ||2 + 2
∗
p
n
j j
Y
4(1 + P )2
j=1
n
(260)
Variances of second and third terms are clearly O(n), while the variance of the first term is
o(n2 ) by assumption (257). Then (260) implies (259) via
Var[A + B + C ] ≤ 3(Var[A] + Var[B] + Var[C ]) .
VII. E XPECTATIO NS
OF NON - LINEAR POLYNOMIALS OF
G AUSSIAN
(261)
CODES
This section contains results special to the AWGN channel. Because of the algebraic structure
available on Rn it is natural to ask whether we can provide approximations for polynomials.
Since Theorem 7 shows the validity of (122), all the results for Lipschitz (in particular linear)
functions from Section IV follow. Polynomials of higher degree, however, do not admit bounded
Lipschitz constants. In this section we discuss the case of quadratic polynomials (Section VII-A)
and polynomials of higher degree (Section VII-B). We present results directly in terms of
the polynomials in (X1 , . . . , Xn ) on the input space. This is strictly stronger than considering
polynomials on the output space, since E[q(Y n )] = E[q(X n + Z n )] and thus by taking integrating
over distribution of Z n problem reduces to computing the expectation of a (different) polynomial
of X n . The reverse reduction is not possible, clearly.
A. Quadratic forms
We denote the canonical inner product on Rn as
n
X
(a, b) =
aj bj ,
(262)
j=1
December 21, 2012
DRAFT
45
45
and write the quadratic form corresponding to matrix A as
n X
n
X
(Ax, x) =
ai,j xi xj .
(263)
j=1 i=1
n
n
Note that when X ∼ N (0, P ) we have trivially
E[(AX n , X n )] = P tr A ,
(264)
where tr is the trace operator. Therefore, the next result shows that the distribution of good codes
must be close to isotropic Gaussian distribution, at least in the sense of evaluating quadratic
forms:
Theorem 18: For any P > 0 and 0 < ǫ < 1 there exists a constant b = b(P, ǫ) > 0 such that
for all (n, M, ǫ)max,det codes and all quadratic forms A such that
− In ≤ A ≤ In
(265)
we have
√
2(1 + P ) n
| E[(AX , X )] − P tr A| ≤
q
√
log e
n
n
√
C − log M + b
n
(266)
n
and (a refinement for A = In )
n
X
√
2(1 + P )
|
E[X 2] − nP | ≤
(nC − log M + b n) .
(267)
j
log e
j=1
Remark 9: By using the same method as in the proof of Proposition 10 one can also show that
the estimate (266) holds on a per-codeword basis for an overwhelming majority of codewords.
Proof: Denote
Σ = E[xxT ] ,
(268)
V = (In + Σ)− 1 ,
(269)
QY n = N (0, In + Σ) ,
(270)
R(y|x) = log
dPY n |X n =x
(y) ,
log e
ln det(In + Σ) + (Vy, y) − ||y − x||2 ,
2
= E[R(Y n |x)|X n = x] ,
=
d(x)
dQY n
(271)
(272)
(273)
log e
December 21, 2012
DRAFT
=
(ln det(In + Σ) + (Vx, x) + tr(V − In ))
2
v(x) = Var[R(Y n |x)|X n = x] .
December 21, 2012
46
46
(274)
(275)
DRAFT
40
40
Denote also the spectrum of Σ by {λi , i = 1, . . . , n} and its eigenvectors by {vi , i = 1, . . . , n}.
We have then
|E[(AX n , X n )] − P tr A| = |tr(Σ − P In )A|
n
X
=
≤
(λi − P )(Avi , vi )
i=1
n
X
|λi − P | ,
(276)
(277)
(278)
i=1
where (277) follows by computing the trace in the eigenbasis of Σ and (278) is by (265).
From (274), it is straightforward to check that
D(PY
n |X n
||QY n |PX n ) = E[d(X n )]
1
=
log det(In + Σ)
2
n
1X
=
log(1 + λj ) .
2 j=1
(279)
(280)
(281)
By using (261) we estimate
v(x) ≤
3 log2 e
1
1
Var[||Z n ||2 ] + Var[(V Z n , Z n )] + Var[(V x, Z n )]
4
4
(282)
9
△
+ 3P log2 e = nb12
(283)
4
where (283) results from applying the following identities and bounds for Z n ∼ N (0, In ):
≤ n
Var[||Z n ||2 ] = 3n ,
(284)
Var[(a, Z n )] = ||a||2 ,
(285)
Var[(VZ n , Z n )] = 3 tr V2 ≤ 3n
(286)
||Vx||2 ≤ ||x||2 ≤ nP .
(287)
Finally from Corollary 3 applied with Sm = b21n and (281) we have
n
√
1X
log(1 + λ ) ≥ log M − b1
n − log
2
j
1− ǫ
2
j=1
√
≥ log M − b n
(288)
(289)
n
December 21, 2012
DRAFT
=
December 21, 2012
2
(log(1 + P ) − δn ) ,
41
41
(290)
DRAFT
42
42
where we abbreviated
s
2
9
4
+ 3P
2
log e + log
1− ǫ
1− ǫ
√
= 2(nC + b n − log M ) .
b =
δn
(291)
(292)
To derive (267) consider the chain:
− δn ≤
n
1X
1 + λi
log
n j=1
1+P
n
1 X 1 + λi
n j=1 1 + P
(293)
!
≤
log
≤
log e X(λ − P )
i
n(1 + P )
(294)
n
(295)
j=1
log e
(E[||X n ||2 ] − nP )
(296)
=
n(1 + P )
where (293) is (290), (294) is by Jensen’s inequality, (295) is by log x ≤ (x − 1) log e. Note
that (267) is equivalent to (296). Finally, (266) follows from (278), (293) and the next Lemma
applied with X equiprobable on
1+λi
1+P , i
= 1, . . . , n .
Lemma 19: Let X > 0 and E[X ] ≤ 1, then
s
E[|X − 1|] ≤
1
2 E ln
X
(297)
Proof: Define two distributions on R+ :
△
P [E] = P[X ∈ E]
(298)
△
Q[E] = E[X · 1{X ∈ E}] + (1 − E[X ])1{0 ∈ E} .
(299)
Then, we have
2||P − Q||T V
= 1 − E[X ] + E[|X − 1|]
D(P ||Q) = E
log 1
X
.
(300)
(301)
and (297) follows by (129).
The proof of Theorem 18 relied on a direct application of the main inequality (in the form
of Corollary 3) and is independent of the previous estimate (122). At the expense of a more
December 21, 2012
DRAFT
43
43
technical proof we could derive an order-optimal form of Theorem 18 starting from (122) using
concentration properties of Lipschitz functions. Indeed, notice that because E[Z n ] = 0 we have
E[(AY n , Y n )] = E[(AX n , X n )] + tr A .
(302)
Thus, (266) follows from (122) if we can show
| E[(AY n , Y n )] − E[(AY ∗ n , Y ∗ n )]| ≤
b
q
nD(PY n ||PY∗ n ) .
(303)
This is precisely what Corollary 9 would imply if the function y 7→ (Ay, y) were
√
Lipschitz with constant O( n). However, (Ay, y) is generally not Lipschitz when
considered on the
entire of Rn . On the other hand, it is clear that from the point of√view of evaluation of both
the E[(AY n , Y n )] and E[(AY ∗ n , Y ∗ n )] only vectors of norm O( n) are important, and
when
√
restricted to the ball S = {y : kyk2 ≤ b n} quadratic form (Ay, y) does have a
required
Lipschitz constant of O(
√
n). This approximation idea can be made precise using Kirzbraun’s
theorem (see [34] for a short proof) to extend √
(Ay, y) beyond the ball S preserving the maximum
absolute value and the Lipschitz constant O( n). Another method of showing (303) is by using
the Bobkov-Götze extension of Gaussian concentration (140) to non-Lipschitz functions [30,
Theorem 1.2] to estimate the moment generating function of (AY ∗ n , Y ∗ n ) and apply (147) with
q
t = n1 D(PY n ||PY∗ n ). Both methods yield (303), and hence (266), but with less sharp constants
than those in Theorem 18.
B. Behavior of ||x||q
The next natural question is to consider polynomials of higher degree. The simplest example
P
of such polynomials are F (x) = nj=1 xqj for some power q, to analysis of which we proceed
now. To formalize the problem, consider 1 ≤ q ≤ ∞ and define the q-th norm of the
input vector in the usual way
△
||x||q =
n
X
|xi |q
! 1q
.
(304)
i=1
The aim of this section is to investigate the values of ||x||q for the codewords of good codes
for the AWGN channel. Notice that when the coordinates of x are independent Gaussians we
expect to have
December 21, 2012
n
DRAFT
X
|xi |q ≈ n E[|Z |q ] ,
44
44
(305)
i=1
December 21, 2012
DRAFT
45
45
where Z ∼ N (0, 1). In fact it can be shown that there exists a sequence of capacity achieving
codes and constants Bq , 1 ≤ q ≤ ∞ such that every codeword x at every blocklength n satisfies10 :
1
1
||x||q ≤ Bq n q = O(n q )
and
||x||∞ ≤ B∞
p
1 ≤ q < ∞,
p
log n = O( log n) .
(306)
(307)
But do (306)-(307) hold (possibly with different constants) for any good code?
It turns out that the answer depends on the range of q and on the degree of optimality of the
code. Our findings are summarized in Table I. The precise meaning of each entry will be clear
from Theorems 20, 23 and their corollaries. The main observation is that the closer the code
size comes to M ∗ (n, ǫ), the better ℓq -norms reflect those of random Gaussian codewords (306)(307). Loosely speaking, very little can be said about ℓq -norms of capacity-achieving codes,
while O(log n)-achieving codes are almost indistinguishable from the random Gaussian ones. In
particular, we see that, for example, for capacity-achieving codes it is not possible to approximate
expectations of polynomials of degrees higher than 2 (or 4 for dispersion-achieving codes) by
assuming Gaussian inputs, since even the asymptotic growth rate with n can be dramatically
different. The question of whether we can approximate expectations of arbitrary polynomials for
O(log n)-achieving codes remains open.
We proceed to support the statements made in Table I.
1
q− 4
In fact, each estimate in Table I, except n q log 2q n, is tight in the following sense: if the
√
entry is nα , then there exists a constant Bq and a sequence of O(log n)-, dispersion-, O( n)-,
or capacity-achieving (n, Mn , ǫ)max,det codes such that each codeword x ∈ Rn satisfies for all
n≥ 1
kxkq ≥ Bq nα .
(308)
If the entry in the table states o(nα ) then there is Bq such that for any sequence τn → 0 there
√
exists a sequence of O(log n)-, dispersion-, O( n)-, or capacity-achieving (n, Mn , ǫ)max,det codes
10
This does not follow from a simple random coding argument since we want the property to hold for every codeword, which
constitutes exponentially many constraints. However, the claim can indeed be shown by invoking the κβ-bound [3, Theorem
25] with a suitably chosen constraint set F.
December 21, 2012
DRAFT
46
46
TABLE I
B E H AV I O R O F ℓq
NORMS
kxkq
O F C O D E WO R D S F RO M C O D E S F O R T H E
1≤q≤2
Code
2<q≤4
1
random Gaussian
any O(log n)-achieving
n
any dispersion-achieving
√
any O( n)-achieving
n
n
n
1
q
n
1
q
n
1
q
n
any code
nq
1
1
q
n log
1
q
1
q
1
2
o(n )
1
Note: All estimates, except n q log
1
q
nq
1
n2
q−4
2q
q−4
2q
q=∞
√
1
nq
1
q
any capacity-achieving
4<q<∞
1
nq
AWGN C H A N N E L .
n
1
4
o(n )
1
n4
1
2
o(n )
1
n2
√
log n
log n
1
o(n 4 )
1
n4
1
o(n 2 )
1
n2
n, are shown to be tight.
such that each codeword satisfies for all n ≥ 1
kxkq ≥ Bq τn nα .
(309)
First, notice that a code from any row is an example of a code for the next row, so we only
need to consider each entry which is worse than the one directly above it. Thus it suffices to
1
1
1
1
show the tightness of o(n 4 ), n 4 , o(n 2 ) and n 2 .
To that end recall that by [3, Theorem 54] the maximum number of codewords M ∗ (n, ǫ) at
a fixed probability of error ǫ for the AWGN channel satisfies
log M ∗ (n, ǫ) = nC
−
√
nV Q− 1 (ǫ) + O(log n) ,
(310)
log 2 e P (P +2)
2 (P +1)2
is the channel dispersion. Next, we fix a sequence δ n → 0, such that
√
nδn → ∞ and construct the following sequence of codes. The first coordinate x1 =
nδn P for
where V (P ) =
every codeword and the rest (x2 , . . . , xn ) are chosen as coordinates of an optimal AWGN code
for blocklength n − 1 and power-constraint (1 − δn )P . Following the argument of [3, Theorem
67] the number of codewords Mn in such a code will be at least
p
log Mn = (n − 1)C (P − δn ) − (n − 1)V (P − δn )Q− 1 (ǫ) + O(1)
p
= nC (P ) − nV (P )Q− 1 (ǫ) + O(nδn ) .
(311)
(312)
At the same time, because x1 of each codeword x is abnormally high we have
p
kxkq ≥ nδn P .
(313)
December 21, 2012
DRAFT
47
47
So all the examples are constructed by choosing a suitable δn as follows:
•
Row 1: see (306)-(307).
•
Row 2: nothing to prove.
•
Row 3: for entries o(n 4 ) taking δn =
1
τn2
√
yields a dispersion-achieving code according
n
to (312); the estimate (309) follows from (313).
•
√
1
Row 4: for entries n 4 taking δn = √1 yields an O( n)-achieving code according to (312);
n
the estimate (308) follows from (313).
•
1
Row 5: for entries o(n 2 ) taking δn = τn2 yields a capacity-achieving code according to (312);
the estimate (309) follows from (313).
•
√
1
Row 6: for entries n 2 we can take a codebook with one codeword ( nP , 0, . . . , 0).
Remark 10: The proof can be modified to show that in each case there are codes that simul1
taneously achieve all entries in the respective row of Table I (except n q log
q− 4
2q
n).
We proceed to proving upper bounds. First, we recall some simple relations between the ℓq
norms of vectors in Rn . To estimate a lower-q norm in terms of a higher one, we invoke Holder’s
inequality:
kxkq ≤ n q − p kxkp ,
(314)
1
1 ≤ q ≤ p ≤ ∞.
1
To provide estimates for q > p, notice that obviously
kxk∞ ≤ kxkp .
(315)
Then, we can extend to q < ∞ via the following chain:
1−
kxkq ≤
p
p
kxk∞ q kxkqp
≤
kxkp ,
(316)
q≥ p
(317) Trivially, for q = 2 the answer is given by the power constraint
kxk2 ≤
√
nP
(318)
Thus by (314) and (317) we get: Each codeword of any code for the AW GN (P ) channel must
satisfy
kxkq ≤
December 21, 2012
1
1 ≤ q ≤ 2,
1
2 < q ≤ ∞.
√
P
nq,
·
n2,
(319)
DRAFT
48
48
This proves the entries in the first column and the last row of Table I.
Before proceeding to justify the upper bounds for q > 2 we point out an obvious problem
with trying to estimate kxkq for each codeword. Given any code whose codewords lie exactly
on the power sphere, we can always apply an orthogonal transformation to it so that one of the
√
codewords becomes ( nP , 0, 0, . . . 0). For such a codeword we have
kxkq =
√
nP
(320)
and the upper-bound (319) is tight. Therefore, to improve upon the (319) we must necessarily
consider subsets of codewords of a given code. For simplicity below we show estimates for the
half of all codewords.
The following result, proven in the Appendix, takes care of the sup-norm:
Theorem 20 (q = ∞): For any 0 < ǫ < 1 and P > 0 there exists a constant b = b(P, ǫ) such
that for any11 n ≥ N (P, ǫ) and any (n, M, ǫ)max,det -code for the AW GN (P ) channel at least
half of the codewords satisfy
√
M
,
(321)
nC − nV Q − 1(ǫ) + 2 log n + log b log
2
−
where C and V are the capacity and the dispersion. In particular, the expression in (·) is nonkxk2 ≤ 4(1 + P )
∞
log e
negative for all codes and blocklengths.
Remark 11: What sets Theorem 20 apart from other results is its sensitivity to whether the
code achieves the dispersion term. This is unlike estimates of the form (122), which only sense
√
whether the code is O( n)-achieving or not.
From Theorem 20 the explanation of the entries in the last column of Table I becomes obvious:
the more terms the code achieves in the asymptotic expansion of log M ∗ (n, ǫ) the closer kxk∞
√
becomes to the O( log n), which arises from a random Gaussian codeword (307). To be specific,
we give the following exact statements:
Corollary 21 (q = ∞ for O(log n)-codes): For any 0 < ǫ < 1 and P > 0 there exists a
constant b = b(P, ǫ) such that for any (n, Mn , ǫ)max,det -code for the AW GN (P ) with
√
nV Q− 1 (ǫ) − K log n
log Mn ≥ nC
−
11
(322)
N (P, ǫ) = 8(1 + 2P − 1 )(Q− 1 (ǫ))2 for ǫ < 21 and N (P, ǫ) = 1 for ǫ ≥ 21 .
December 21, 2012
DRAFT
49
49
for some K > 0 we have that at least half of the codewords satisfy
p
kxk∞ ≤ (b + K ) log n + b .
(323)
Corollary 22 (q = ∞ for capacity-achieving codes): For any capacity-achieving sequence of
(n, Mn , ǫ)max,det -codes there exists a sequence τn → 0 such that for at least half of the codewords
we have
kxk∞ ≤ τn n 2 .
1
(324)
Similarly, for any dispersion-achieving sequence of (n, Mn , ǫ)max,det -codes there exists a sequence τn → 0 such that for at least half of the codewords we have
kxk∞ ≤ τn n 4 .
1
(325)
Remark 12: By (309), the sequences τn in Corollary 22 are necessarily code-dependent.
For the q = 4 we have the following estimate (see Appendix for the proof):
Theorem 23 (q = 4): For any 0 < ǫ <
1
2
and P > 0 there exist constants b1 > 0 and b2 > 0,
depending on P and ǫ, such that for any (n, M, ǫ)max,det -code for the AW GN (P ) channel at
least half of the codewords satisfy
kxk2 ≤
4
2
nC + b2
M
√
n − log
b1
,
(326)
2
where C is the capacity of the channel. In fact, we also have a lower bound
E[kxk4 ] ≥ 3nP 2 − (nC − log M + b3
(327)
√
1
n)n 4 ,
4
for some b3 = b3 (P, ǫ) > 0.
Remark 13: Note that E[kzk44] = 3nP 2 for z ∼ N (0, P )n .
We can now complete the proof of the results in Table I:
1) Row 2: q = 4 is Theorem 23; 2 < q ≤ 4 follows by (314) with p = 4; q = ∞ is
Corollary 21; for 4 < q < ∞ we apply interpolation via (316) with p = 4.
2) Row 3: q ≤ 4 is treated as in Row 2; q = ∞ is Corollary 22; for 4 < q < ∞ apply
interpolation (316) with p = 4.
3) Row 4: q ≤ 4 is treated as in Row 2; q ≥ 4 follows from (317) with p = 4.
4) Row 5: q = ∞ is Theorem 22; for 2 < q < ∞ we apply interpolation (316) with p = 2.
December 21, 2012
DRAFT
50
The upshot of this section is that we cannot approximate values of non-quadratic
polynomials
50
√
in x (or y) by assuming iid Gaussian entries, unless the code is O( n)-achieving, in which
December 21, 2012
DRAFT
51
51
case we can go up to degree 4 but still will have to be content with one-sided (lower) bounds
only, cf. (327).12
Before closing this discussion we demonstrate the sharpness of the arguments in this section
by considering the following example. Suppose that a power of a codeword x from a capacitydispersion optimal code is measured by an imperfect tool, such that its reading is described
by
n
1X
E=
(xi )2Hi ,
n i=1
(328)
where Hi ’s are i.i.d bounded random variables with expectation and variance both equal to 1. For
large blocklengths n we expect E to be Gaussian with mean P and variance 1n kxk44. On the one
hand, Theorem 23 shows that the variance will not explode; (327) shows that it will be at least as
large as that of a Gaussian codebook. Finally, to establish the asymptotic normality rigorously,
the usual approach based on checking Lyapunov condition will fail as shown by (309), but the
Lindenberg condition does hold as a consequence of Theorem 22. If in addition, the code is
O(log n)-achieving then
−
P[|E − E[E ]| > δ] ≤ e
(329)
VIII. E XTENSION
nδ 2 √
b1 +b2 δ log n
.
TO OTHER CHANNELS : TILTING
Let us review the scheme of investigating functions of the output F (Y n ) that was employed
in this paper so far. First, an inequality (122) was shown by verifying that QY = PY n satisfies
conditions of Theorem 2. Then approximation of the form
F
(Y
n
)
≈
E[F
(Y
n
)]
≈
E[F
(Y
∗n
)]
(330) follows by the arguments in Section IV, cf. Propositions 8 and 10. Therefore, in some
sense application of Theorem 2 with QY
= PY
n
implies an approximation (330)
simultaneously for
all concentrated (e.g. Lipschitz) functions. On one hand, this is very convenient since all the
channel-specific work is isolated in proving (122). On the other hand, verifying conditions of
Theorem 2 for QY = PY n may not in general be feasible even for memoryless channels. In this
section we show how Theorem 2 can be used to investigate (330) for a specific function F in
12
Using quite similar methods, (327) can be extended to certain bi-quadratic forms, i.e. 4-th degree polynomials
December 21, 2012
P
x2 2A
wai jhere
DRAFT
52
= (ai− j ) is a Toeplitz positive semi-definite matrix.
December 21, 2012
i,j
−
i x52
j,
DRAFT
53
53
the absence of the universal estimate (122). A different application of the same technique was
previously reported in [8, Section VIII].
In this section we focus on studying an arbitrary random transformation PY |X : X → Y,
for convenience we introduce Y ♭ distributed according to auxiliary distribution QY and denote
analogously to (145)
♭
△
E[F (Y )] =
Z
F (y)dQ Y ,
(331)
where F – is arbitrary function of Y. Suppose that F is such that
ZF = log E[exp{F (Y ♭ )}] < ∞ ,
(332)
Denote by Q(FY ) an F -tilting of QY , namely
)
dQ(F
Y = exp{F − ZF }dQY
(333)
The core idea of our technique is that if F is sufficiently regular and QY satisfies conditions
of Theorem 2, then Q(F
Y
)
also does. Consequently, the expectation of F under PY (induced by
the code) can be investigated in terms of the moment-generating function of F under QY . For
brevity we only present a variance-based version (similar to Corollary 3):
Theorem 24: Let QY and F be such that (332) holds and
dPY |X=x
(Y ) X = x < ∞ ,
QY
x
= sup Var[F (Y )|X = x] .
S = sup Var log
SF
x
(334)
(335)
Then there exists a constant a = a(ǫ, S) > 0 such that for any (M, ǫ)max,det code we have for
all 0 ≤ t ≤ 1
p
t E[F (Y )] − log E[exp{tF (Y ♭ )}] ≤ D(PY | X ||QY |PX ) − log M + a S + t2 SF
(336)
√
SF 2
1+
t (337)
≤ D(PY |X ||QY |PX ) − log M + a
2S
S
Proof: Note that since
dPY |X
log
Y
(F )
dQ
= log
dPY |X
dQY
− F (Y ) + ZF
(338)
we have for any 0 ≤ t ≤
1:
D(P Y |X ||QY |PX ) = D(P Y |X ||Q Y |P X ) − t E[F (Y )] + log E[exp{tF (Y ♭ )}] ,
(tF )
December 21, 2012
(339)
DRAFT
50
50
and
"
Var log
#
dPY |X
dQY(F )
X = x ≤ 2(S + t2 SF ) ,
(340)
where we used Var[A + B] ≤ 2(Var[A] + Var[B]). Then from Corollary 3 applied with QY
)
and S replaced by Q(tF
and 2S + 2t2 SF , respectively, we obtain (336), while (337) follows by
Y
√
1 + x ≤ 1 +2 x .
To see that (337) is actually useful, notice that for example Corollary 9 can easily be recovered
by taking QY = P Y∗ n , applying (19), estimating the moment-generating function via (140) and
SF via Poincaré inequality:
SF ≤ bkF k2 .
Lip
(341)
The advantage of such a direct approach is that in fact a “dimension-dependent” Poincare
inequality would also succeed in this case:
SF ≤ bnkFk 2 ,
Lip
(342)
whereas the full “dimensionless” Poincaré inequality (341) was required in order to show (122).
A PPENDIX
In this appendix we prove results from Section VII-B.
To prove Theorem 20 the basic intuition is that any codeword which is abnormally peaky
(i.e., has a high value of kxk∞ ) is wasteful in terms of allocating its power budget. Thus a good
capacity- or dispersion-achieving codebook cannot have too many of such wasteful codewords.
A non-asymptotic formalization of this intuitive argument is as follows:
Lemma 25: For any ǫ ≤
1
2
and P > 0 there exists a constant b = b(P, ǫ) such that given any
(n, M, ǫ)max,det code for the AW GN (P ) channel, we have for any 0 ≤ λ ≤ P :13
n
o
√
p
−1
P[kX n k
exp nC (P − λ) − nV (P − λ)Q (ǫ) + 2 log n ,
∞
b
≥ λn] ≤
(343)
M
where C (P ) and V (P ) are the capacity and the dispersion of the AW GN (P ) channel, and X n
is the output of the encoder assuming equiprobable messages.
13
For ǫ >
1
2
one must replace V (P − λ) with V (P ) in (343). This does not modify any of the arguments required to prove
Theorem 20.
December 21, 2012
DRAFT
51
51
Proof: Our method is to apply the meta-converse in the form of [3, Theorem 30] to the
√
subcode that satisfies kX n k ≥
λn. Application of a meta-converse requires selecting a
∞
suitable auxiliary channel QY n |X n . We specify this channel now. For any x ∈ Rn let j ∗ (x) be
the first index s.t. |xj | = ||x||∞, then we set
QY n |X n (y n |x) = PY |X (yj ∗ |xj ∗ )
Y
∗
PY (yj )
(344)
j=j ∗ (x)
We will show below that for some b1 = b1 (P ) any M -code over the Q-channel (344) has average
probability of error ǫ′ satisfying:
3
2
1 − ǫ′ ≤ b1 n .
M
dP
(345)
=x
(Y n ) we see that it coincides with
On the other hand, writing the expression for log dQYY n|X
|X n =x
the expression for log
dPY n |X n =x
dPY∗ n
n
n
except that the term corresponding to j ∗ (x) will be missing;
compare with [14, (4.29)]. Thus, one can repeat step by step the analysis in the proof of [3,
Theorem 65] with the only difference that nP should be replaced by nP − kxk2 reflecting the
∞
reduction in the energy due to skipping of j ∗ . Then, we obtain for some b2 = b2 (α, P ):
! v
!
u
2
2
u
xk
xk
k ∞
k ∞
1
+tnV P −
Q− 1 (ǫ)− log n− b2 ,
P −
log β1− ǫ (PY n |X n =x , QY n |X n =x ) ≥
n
n
2
− nC
(346)
√
which holds simultaneously for all x with kxk ≤
nP . Two remarks are in order: first, the
analysis in [3, Theorem 64] must be done replacing n with n − 1, but this difference is absorbed
into b2 . Second, to see that b2 can be chosen independent of x notice that B(P ) in [3, (620)]
tends to 0 with P → 0 and hence can be bounded uniformly for all P ∈ [0, Pmax ].
√
Denote the cardinality of the subcode {kxk∞ ≥ λn} by
√
Mλ = M P[kxk∞ ≥ λn] .
(347)
Then according to [3, Theorem 30], we get
inf β1− ǫ (PY n |X n =x , QY n |X n =x ) ≤ 1 − ǫ′ ,
x
(348)
where the infimum is over the codewords of Mλ -subcode. Applying both (345) and (346) we
get
inf − nC
x
P −
December 21, 2012
2
kxk∞
n
!
v
u
u
+ tnV
!
2
1 log n− b2 ≤ − log Mλ +log b1 +3
kxk∞
log n
Q− 1 (ǫ) − 2
P −
2
n
DRAFT
52
52
(349)
December 21, 2012
DRAFT
53
53
and, further, since the function of ||x||∞ in left-hand side of (349) is monotone in kxk∞:
p
3
1
− nC (P − λ) + nV (P − λ)Q− 1 (ǫ) − log n − ≤ − log + log b + log n . (350)
M
b
2
λ
1
2
2
Thus, overall
log Mλ ≤ nC (P − λ) −
p
nV (P − λ)Q− 1 (ǫ) + 2 log n + b2 + log b1 ,
(351)
which is equivalent to (343) with b = b1 exp{b2 }.
It remains to show (345). Consider an (n, M, ǫ′ )avg,det -code for the Q-channel and let Mj , j =
1, . . . , n denote the cardinality of the set of all codewords with j ∗ (x) = j. Let
ǫ′
j
denote
the minimum possible average probability of error of each such codebook achievable with the
maximum likelihood (ML) decoder. Since
1
1 − ǫ′ ≤
nX
M
Mj (1 −j ǫ′ )
(352)
j=1
it suffices to prove
q
1−
ǫ′
j≤
2nP
π
+2
Mj
(353)
for all j. Without loss of generality assume j = 1 in which case the observations Y 2n are
useless for determining the value of the true codeword. Moreover, the ML decoding regions
Di , i = 1, . . . , Mj for each codeword are disjoint intervals in R1 (so that decoder outputs message
estimate i whenever Y1 ∈ Di ). Note that for Mj ≤ 2 there is nothing to prove, so assume
otherwise. Denote the first coordinates of the Mj codewords by xi , i = 1, . . . , Mj and assume
√
nP and that D2 , . . . DMj − 1
(without loss of generality) that − nP ≤ x1 ≤ x2 ≤ · · · ≤ xMj
√
≤
December 21, 2012
DRAFT
54
54
are finite intervals. We have the following chain then
1−
Mj
1 X
PY |X (D i |x i)
Mj i=1
(354)
≤
Mj − 1
2
1 X
+
PY |X (Di |xi )
Mj Mj j=2
(355)
≤
Mj − 1
2
1 X
Leb(Di )
+
1 − 2Q
Mj Mj
2
(356)
Mj − 1
X
2
Mj − 2
1
1 − 2Q
+
Leb(Di )
Mj
Mj
2Mj − 4 i=2
(357)
ǫj′
=
j=2
≤
≤
≤
√
2
Mj − 2
nP
+
1 − 2Q
!!
Mj
Mj
q
2nP
Mj − 2
2
π
+
,
Mj
Mj
(358)
(359)
where in (354) PY |X=x = N (x, 1), (355) follows by upper-bounding probability of successful
decoding for i = 1 and i = Mj by 1, (356) follows since clearly for a fixed value of the length
Leb(Di ) the optimal location of the interval Di , maximizing the value PY |X (Di |xi ), is centered
at xi , (357) is by Jensen’s inequality applied to x → 1 − 2Q(x) concave for x ≥ 0, (358) is
because
Mj − 1
[
i=2
Di ⊂ [− nP ,
√
√
nP ]
(360)
and Di are disjoint, and (359) is by
r
1 − 2Q(x) ≤
2
x,
π
x ≥ 0.
(361)
Thus, (359) completes the proof of (353), (345) and the theorem.
Proof of Theorem 20: Notice that for any 0 ≤ λ ≤ P we have
C (P − λ) ≤ C (P ) −
(362)
On the other hand, by concavity of
December 21, 2012
λ log e
.
2(1 + P )
p
V (P ) and since V (0) = 0 we have for any 0 ≤ λ ≤ P
p
V (P )
DRAFT
p
−
December 21, 2012
V (P − λ) ≥
p
55
V (P )
P
λ.
55
(363)
DRAFT
56
56
Thus, taking s = λn in Lemma 25 we get with the help of (362) and (363):
n
o
1
P[kxk2∞ ≥ s] ≤ exp ∆n − (b1 − b2 n− 2 )s ,
(364)
where we denoted for convenience
log e
,
2(1 + P )
p
V (P ) − 1
Q (ǫ) ,
=
P
p
= nC (P ) − nV (P )Q− 1 (ǫ) + 2 log n − log M + log b .
b1 =
(365)
b2
(366)
∆n
(367)
Note that Lemma 25 only shows validity of (364) for 0 ≤ s ≤ nP , but since for s > nP the
left-hand side is zero, the statement actually holds for all s ≥ 0. Then for n ≥ N (P, ǫ) we have
(b1 − b2 n− 2 ) ≥
1
b1
2
(368)
and thus further upper-bounding (364) we get
P[kxk2 ≥ s] ≤
∞
∆n −
exp
b1 s
2
.
(369)
Finally, if the code is so large that ∆n < 0, then (369) would imply that P[kxk2 ≥ s] < 1 for
∞
all s ≥ 0, which is clearly impossible. Thus we must have ∆n ≥ 0 for any (n, M, ǫ)max,det code.
The proof concludes by taking s =
2(log 2+∆n )
b1
in (369).
Proof of Theorem 23: To prove (326) we will show the following statement: There exist
two constants b0 and b1 such that for any (n, M1 , ǫ) code for the AW GN (P ) channel with
codewords x satisfying
kxk4 ≥ bn 4
1
(370)
we have an upper bound on the cardinality:
(
)
r
√
4
nV
M1 ≤
exp nC + 2
− b1 (b − b02 n ,
1− ǫ
1− ǫ
)
provided b ≥ b0 (P, ǫ). From here (326) follows by first upper-bounding (b − b0 )2 ≥
b
then verifying easily that the choice
December 21, 2012
(371)
− b02 and
2
2
2
b2 = √
b1 n
DRAFT
(372)
57
57
√
M
n − log )
2
2
q
4
V
with b2 = b20 b1 + 2 1−
takes the right-hand side of (371) below log M2 .
ǫ + log 1− ǫ
(nC + b
December 21, 2012
DRAFT
58
58
To prove (371), denote
1
4
6
1+ǫ
S =b−
(373)
and choose b large enough so that
△
δ = S − 64
1
√
1 + P > 0.
(374)
Then, on one hand we have
PY n [kY n k4 ≥ Sn 4 ] = P[kX n + Z n k4 ≥ Sn 4 ]
1
1
(375)
≥ P[kX n k − kZ n k ≥ Sn 14 ]
4
4
(376)
≥ P[kZ n k ≤ n14 (S − b)]
(377)
4
1+ǫ
,
(378)
≥
2
where (376) is by the triangle inequality for k·k4 , (377) is by the constraint (370) and (378) is
P
by the Chebyshev inequality applied to kZ n k44 = nj=1 Zj4 . On the other hand, we have
1
1
1
∗
P Y n [kY nk4 ≤ Sn 4 ] = P Y n [kYn k4 ≤ (6√ 1 + P + δ)n 4 ]
∗
4
P ∗Y n [{kY n k4 n≤ 6 4
k 1
∗
n
≥ PY n [{kY k4
≥
1 + P n 4 } + {kY
1
√
4
≤ δn
{kY nk2 ≤
≤ 6 1+Pn }+
√
1 − exp{− b1 δ 2 n} ,
1
4
≥
√
(379)
1
4
1
4
]
(380)
}
]
(381)
δn }
1
4
(382)
where (379) is by the definition of δ in (374), (380) is by the triangle inequality for k·k4 which
implies the inclusion
{y : kyk4 ≤ a + b} ⊃ {y : kyk4 ≤ a} + {y : kyk4 ≤ b}
(383) with + denoting the Minkowski sum of sets, (381) is by (317) with p = 2, q = 4; and
(382) holds for some b1 = b1 (P ) > 0 by the Gaussian isoperimetric inequality [35] which is
applicable
since
P∗
December 21, 2012
n
1
4
√
1
1
DRAFT
Y n [kY
by the Chebyshev inequality applied to
1+Pn4] ≥
k4 ≤ 6
Pn
j=1
59
(384)
2
59
Y 4 (note: Y n ∼ N (0, 1 + P )n under P ). As a
∗
j
Yn
side remark, we add that the estimate of the large-deviations of the sum of 4-th powers of iid
√
Gaussians as exp{− O( n)} is order-optimal.
December 21, 2012
DRAFT
60
60
Together (378) and (382) imply
β 1+ǫ (PY n , P ∗ ) ≤ exp{− b1 δ 2
√
n} .
Y n (385)
2
√
On the other hand, by [3, Lemma 59] we have for any x with kxk2 ≤ nP and any 0 < α < 1:
)
(
r
2nV
α
∗
βα (PY n |X n =x , P Y n )
,
(386)
exp − nC
α
2
≥
−
where C and V are the capacity and the dispersion of the AW GN (P ) channel. Then, by
convexity in α of the right-hand side of (386) and [14, Lemma 32] we have for any input
distribution PX n :
βα (PX n Y n , PX n P Y∗ n
≥
)
(
α
exp − nC
2
−
)
r
2nV
α
.
(387)
We complete the proof of (371) by invoking Theorem 13 in the form (215) with QY = P ∗Y n
and α =
1+ǫ
:
2
) ≥ M1 β 1− ǫ (PX n Y n , PX n P ) .
β 1+ǫ (PY n , P
∗
Yn
2
∗
2
(388)
Yn
Applying bounds (385) and (387) to (388) we conclude that (371) holds with
1
1 + 6
4
6
b0 =
√ 4 1+P .
1+ǫ
(389)
Next, we proceed to the proof of (327). On one hand, we have
n
X
E
Yj4
n
X
=
j=1
j=1
n
X
=
E (Xj + Zj )4
(390)
E[Xj4 + 6X j2 Z j2 + Z j4]
(391)
j=1
≤ E[kxk44 ] + 6nP + 3n ,
(392)
where (390) is by the definition of the AWGN channel, (391) is because X n and Z n are
P 2
independent and thus odd terms vanish, (392) is by the power-constraint
X j ≤ nP . On
the other hand, applying Proposition 11 with f (y) = − y4 , θ = 2 and using (122) we obtain14
n
X
√
E Yj4 ≥ 3n(1 + P )2 − (nC − log M + b3
j=1
December 21, 2012
DRAFT
1
n)n 4 ,
61
61
(393)
for some b3 = b3 (P, ǫ) > 0. Comparing (393) and (392) statement (327) follows.
14
Of course, a similar Gaussian lower bound holds for any cumulative sum, in particular for any power
December 21, 2012
P
E[|Yj | q], q ≥ 1.
DRAFT
62
62
We remark that by extending Proposition 11 to expectations like
1
n− 1
Pn− 1
j=1
2
E[Yj 2 Yj+1
], cf. (187),
we could provide a lower bound similar to (327) for more general 4-th degree polynomials in
P
x. For example, it is possible to treat the case of p(x) = i,j ai− j xi2 x2j , where A = (ai− j ) is
a Toeplitz positive semi-definite matrix. We would proceed as in (392), computing E[p(Y n )]
in two ways, with the only difference that the peeled off quadratic polynomial would require
application of Theorem 18 instead of the simple power constraint. Finally, we also mention that
the method (392) does not work for estimating E[kxk66] because we would need an upper bound
√
E[kxk44 ] . 3nP 2 , which is not possible to obtain in the context of O( n)-achieving codes
as
the counterexamples (308) show.
R EFERENCE S
[1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423 and 623–656, Jul./Oct.
1948.
[2] S. Shamai and S. Verdú, “The empirical distribution of good codes,” IEEE Trans. Inf. Theory, vol. 43, no. 3, pp. 836–846,
1997.
[3] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory,
vol. 56, no. 5, pp. 2307–2359, May 2010.
[4] A. Tchamkerten, V. Chandar, and G. W. Wornell, “Communication under strong asynchronism,” IEEE Trans. Inf. Theory,
vol. 55, no. 10, pp. 4508–4528, Oct. 2009.
[5] R. Ahlswede, P. Gács, and J. Körner, “Bounds on conditional probabilities with applications in multi-user communication,”
Probab. Th. Rel. Fields, vol. 34, no. 2, pp. 157–177, 1976.
[6] R. Ahlswede and G. Dueck, “Every bad code has a good subcode: a local converse to the coding theorem,” Probab. Th.
Rel. Fields, vol. 34, no. 2, pp. 179–182, 1976.
[7] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. Theory, vol. 32, no. 3, pp. 445–446, 1986.
[8] Y. Polyanskiy and S. Verdú, “Relative entropy at the channel output of a capacity-achieving code,” in Proc. 2011 49th
Allerton Conference, Allerton Retreat Center, Monticello, IL, USA, Oct. 2011.
[9] F. Topsøe, “An information theoretical identity and a problem involving capacity,” Studia Sci. Math. Hungar., vol. 2, pp.
291–292, 1967.
[10] J. H. B. Kemperman, “On the Shannon capacity of an arbitrary channel,” Indagationes Math., vol. 77, no. 2, pp. 101–115,
1974.
[11] J. Wolfowitz, “The coding of messages subject to chance errors,” Illinois J. Math., vol. 1, pp. 591–606, 1957.
[12] U. Augustin, “Gedächtnisfreie kanäle für diskrete zeit,” Z. Wahrscheinlichkeitstheorie und Verw. Geb., vol. 6, pp. 10–61,
1966.
[13] R. Ahlswede, “An elementary proof of the strong converse theorem for the multiple-access channel,” J. Comb. Inform.
Syst. Sci, vol. 7, no. 3, pp. 216–230, 1982.
December 21, 2012
DRAFT
63
63
[14] Y. Polyanskiy, “Channel coding: non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton Univ., Princeton, NJ,
USA, 2010, available: http://people.lids.mit.edu/yp/homepage/.
[15] H. V. Poor and S. Verdú, “A lower bound on the error probability in multihypothesis testing,” IEEE Trans. Inf. Theory,
vol. 41, no. 6, pp. 1992–1993, 1995.
[16] M. V. Burnashev, “Data transmission over a discrete channel with feedback. random transmission time,” Prob. Peredachi
Inform., vol. 12, no. 4, pp. 10–30, 1976.
[17] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.
Cambridge
University Press, 2011.
[18] S. Bobkov and F. Götze, “Discrete isoperimetric and Poincaré-type inequalities,” Probab. Theory Relat. Fields, vol. 114,
pp. 245–277, 1999.
[19] M. Ledoux, “Concentration of measure and logarithmic Sobolev inequalities,” Seminaire de probabilites XXXIII, pp. 120–
216, 1999.
[20] J. Wolfowitz, Coding Theorems of Information Theory. Englewood Cliffs, NJ: Prentice-Hall, 1962.
[21] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observation,” Studia Sci. Math.
Hungar., vol. 2, pp. 229–318, 1967.
[22] ——, “I -divergence geometry of probability distributions and minimization problems,” Ann. Probab., vol. 3, no. 1, pp.
146–158, Feb. 1975.
[23] M. Ledoux, “Isoperimetry and Gaussian analysis,” Lecture Notes in Math., vol. 1648, pp. 165–294, 1996.
[24] L. Harper, “Optimal numberings and isoperimetric problems on graphs,” J. Comb. Th., vol. 1, pp. 385–394, 1966.
[25] M. Talagrand, “An isoperimetric theorem on the cube and the Khintchine-Kahane inequalities,” Proc. Amer. Math. Soc.,
vol. 104, pp. 905–909, 1988.
[26] M. Donsker and S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time. I. II.” Comm.
Pure Appl. Math., vol. 28, no. 1, pp. 1–47, 1975.
[27] K. Marton, “Bounding d̄-distance by information divergence: a method to prove measure concentration,” Ann. Probab.,
vol. 24, pp. 857–866, 1990.
[28] V. Strassen, “The existence of probability measures with given marginals,” Ann. Math. Stat., vol. 36, pp. 423–439, 1965.
[29] C. Villani, Topics in optimal transportation. Providence, RI:. American Mathematical Society, 2003, vol. 58.
[30] S. Bobkov and F. Götze, “Exponential integrability and transportation cost related to logarithmic Sobolev inequalities,” J.
Functional Analysis, vol. 163, pp. 1–28, 1999.
[31] M. Talagrand, “Transportation cost for Gaussian and other product measures,” Geom. Funct. Anal., vol. 6, no. 3, pp.
587–600, 1996.
[32] I. Gelfand, A. Kolmogorov, and A. Yaglom, “On the general definition of the amount of information (in russian),” Dokl.
Akad. Nauk SSSR, vol. 111, no. 4, pp. 745–748, 1956.
[33] S. Bobkov and M. Madiman, “Concentration of the information in data with log-concave distributions,” Ann. Probab.,
vol. 39, no. 4, pp. 1528–1543, 2011.
[34] I. J. Schoenberg, “On a theorem of Kirzbraun and Valentine,” Am. Math. Monthly, vol. 60, no. 9, pp. 620–622, Nov. 1953.
[35] V. Sudakov and B. Tsirelson, “Extremal properties of half-spaces for spherically invariant measures,” Zap. Nauch. Sem.
LOMI, vol. 41, pp. 14–24, 1974.
December 21, 2012
DRAFT
Download