Lecture 06 Bernstein’s inequality. 18.465

advertisement

Lecture 06 Bernstein’s inequality.

18.465

Last time we proved Bennett’s inequality:

E

X = 0,

E

X 2 = σ 2 , | X | < M = const , X

1

, · · · , X n independent copies of X , and t ≥ 0.

Then

P

� n

X i

≥ t ≤ exp − nσ

M

2

2

φ

� tM nσ 2

��

, i =1 where φ ( x ) = (1 + x ) log(1 + x ) − x .

If X is small, φ ( x ) = (1 + x )( x − x

2

2

+ · · · ) − x = x + x 2 − x

2

2

− x + · · · = x

2

2

+ · · · .

If X is large, φ ( x ) ∼ x log x .

We can weaken the bound by decreasing φ ( x ).

Take 1 φ ( x ) = x

2

2+

2

3 x to obtain Bernstein’s

P

� n

X i i =1

� �

≥ t ≤ exp − nσ

M

2

2

� tM nσ 2

= exp −

2 nσ 2 +

2

3 tM

2 t

2 +

2

2 tM

3 nσ

2

�� inequality :

= e

− u where u =

2 nσ t

2

2 +

2

3 tM

.

Solve for t : t

2

1

3

2

3 uM t − 2 nσ

� u 2 M 2

2 u = 0 t = uM +

9

+ 2 nσ 2 u.

Substituting,

P

� n

X i

� u 2 M

9

2 i =1

+ 2 nσ 2 u + uM

3

≤ e

− u or

Using inequality

√ a + b ≤

P

� n

X i

� u 2 M

9

2

√ a + i =1

√ b ,

+ 2 nσ 2 u +

P

� n

X i =1 i

2 nσ 2 u + uM

≥ 1 − e

− u

3

2 uM

≥ 1 − e

− u

3

For non-centered X i

, replace X i with X i

E

X or

E

X − X i

.

Then | X i

E

X | ≤ 2 M and so with high probability

( X i

E

X ) ≤

2 nσ 2 u +

4 uM

3

.

Normalizing by n ,

1

X n i

E

X ≤

2 σ 2 n u 4 uM

+

3 n and

E

X −

1 n

X i

2 σ 2 n u 4 uM

+

3 n

.

1 exercise: show that this is the best approximation

11

Lecture 06 Bernstein’s inequality.

18.465

Whenever

2 σ 2 n u ≥ 4 uM

3 n

, we have u ≤ nσ

2

8 M 2

.

So, | 1 n

� X i

E

X | �

2 σ 2 n u for u � nσ

2

(range of normal deviations).

This is predicted by the Central Limit Theorem (condition for CLT is nσ 2 → ∞ ).

If nσ 2 does not go to infinity, we get Poisson behavior.

Recall from the last lecture that the we’re interested in concentration inequalities because we want to know

P

( f ( X ) = Y ) while we only observe

1 n

� n i =1

I ( f ( X i

) = Y i

).

In Bernstein’s inequality take

��

X i

�� to be I ( f ( X i

) = Y i

).

Then, since 2 M = 1, we get

E

I ( f ( X i

) = Y i

) −

1 n

I ( f ( X i n i =1

) = Y i

) ≤

2

P

( f ( X i

) = Y i

) (1 −

P

( f ( X i

) = Y i

)) u

+

2 u n 3 n because

E

I ( f ( X i

) = Y

P i

(

) = f ( X

P i

)

( f

=

( X

Y i i

)

) = Y i

1 n

) = n

� i =1

I

E

(

I f

2

( and

X i

) therefore

= Y i

Var( I ) = σ 2 =

) +

2

P

( f ( X i

) = n

Y i

E

I

) u

2 −

+

(

2

E u

3 n

I ) 2 .

Thus, with probability at least 1 − e

− u .

When the training error is zero,

P

( f ( X i

) = Y i

) ≤

2

P

( f ( X i

) = Y i

) u

+

2 u n 3 n

.

If we forget about 2 u/ 3 n for a second, we obtain

P

( f ( X i

) = Y i

)

2

≤ 2

P

( f ( X i

) = Y i

) u/n and hence

P

( f ( X i

) = Y i

) ≤

2 u

.

n

The above zero-error rate is better than n

− 1 / 2 predicted by CLT.

12

Download