Lecture 06 Bernstein’s inequality.
18.465
Last time we proved Bennett’s inequality:
E
X = 0,
E
X 2 = σ 2 , | X | < M = const , X
1
, · · · , X n independent copies of X , and t ≥ 0.
Then
P
� n
�
X i
�
�
≥ t ≤ exp − nσ
M
2
2
φ
� tM nσ 2
��
, i =1 where φ ( x ) = (1 + x ) log(1 + x ) − x .
If X is small, φ ( x ) = (1 + x )( x − x
2
2
+ · · · ) − x = x + x 2 − x
2
2
− x + · · · = x
2
2
+ · · · .
If X is large, φ ( x ) ∼ x log x .
We can weaken the bound by decreasing φ ( x ).
Take 1 φ ( x ) = x
2
2+
2
3 x to obtain Bernstein’s
P
� n
�
X i i =1
� �
≥ t ≤ exp − nσ
�
M
2
2
�
� tM nσ 2
= exp −
2 nσ 2 +
2
3 tM
�
2 t
2 +
2
2 tM
3 nσ
�
2
�� inequality :
= e
− u where u =
2 nσ t
2
2 +
2
3 tM
.
Solve for t : t
2
1
3
−
2
3 uM t − 2 nσ
� u 2 M 2
2 u = 0 t = uM +
9
+ 2 nσ 2 u.
Substituting,
P
� n
�
X i
≥
� u 2 M
9
2 i =1
+ 2 nσ 2 u + uM
3
�
≤ e
− u or
Using inequality
√ a + b ≤
P
� n
�
X i
≤
� u 2 M
9
2
√ a + i =1
√ b ,
+ 2 nσ 2 u +
P
� n
�
X i =1 i
≤
√
2 nσ 2 u + uM
�
≥ 1 − e
− u
3
2 uM
�
≥ 1 − e
− u
3
For non-centered X i
, replace X i with X i
−
E
X or
E
X − X i
.
Then | X i
−
E
X | ≤ 2 M and so with high probability
�
( X i
−
E
X ) ≤
√
2 nσ 2 u +
4 uM
3
.
Normalizing by n ,
1
�
X n i
−
E
X ≤
�
2 σ 2 n u 4 uM
+
3 n and
E
X −
1 n
�
X i
≤
�
2 σ 2 n u 4 uM
+
3 n
.
1 exercise: show that this is the best approximation
11
Lecture 06 Bernstein’s inequality.
18.465
Whenever
�
2 σ 2 n u ≥ 4 uM
3 n
, we have u ≤ nσ
2
8 M 2
.
So, | 1 n
� X i
−
E
X | �
�
2 σ 2 n u for u � nσ
2
(range of normal deviations).
This is predicted by the Central Limit Theorem (condition for CLT is nσ 2 → ∞ ).
If nσ 2 does not go to infinity, we get Poisson behavior.
Recall from the last lecture that the we’re interested in concentration inequalities because we want to know
P
( f ( X ) = Y ) while we only observe
1 n
� n i =1
I ( f ( X i
) = Y i
).
In Bernstein’s inequality take
��
X i
�� to be I ( f ( X i
) = Y i
).
Then, since 2 M = 1, we get
E
I ( f ( X i
) = Y i
) −
1 n
�
I ( f ( X i n i =1
) = Y i
) ≤
�
2
P
( f ( X i
) = Y i
) (1 −
P
( f ( X i
) = Y i
)) u
+
2 u n 3 n because
E
I ( f ( X i
) = Y
P i
(
) = f ( X
P i
)
( f
=
( X
Y i i
)
) = Y i
≤
1 n
) = n
� i =1
I
E
(
I f
2
( and
X i
) therefore
= Y i
Var( I ) = σ 2 =
) +
�
2
P
( f ( X i
) = n
Y i
E
I
) u
2 −
+
(
2
E u
3 n
I ) 2 .
Thus, with probability at least 1 − e
− u .
When the training error is zero,
P
( f ( X i
) = Y i
) ≤
�
2
P
( f ( X i
) = Y i
) u
+
2 u n 3 n
.
If we forget about 2 u/ 3 n for a second, we obtain
P
( f ( X i
) = Y i
)
2
≤ 2
P
( f ( X i
) = Y i
) u/n and hence
P
( f ( X i
) = Y i
) ≤
2 u
.
n
The above zero-error rate is better than n
− 1 / 2 predicted by CLT.
12