Lecture 06 Bernstein’s inequality. 18.465

Lecture 06 Bernstein’s inequality.

18.465

Last time we proved Bennett’s inequality:

E

X = 0,

E

X 2 = σ 2 , | X | < M = const , X

1

, · · · , X n independent copies of X , and t ≥ 0.

Then

P

� n

�

X i

�

�

≥ t ≤ exp − nσ

M

2

2

φ

� tM nσ 2

��

, i =1 where φ ( x ) = (1 + x ) log(1 + x ) − x .

If X is small, φ ( x ) = (1 + x )( x − x

2

2

+ · · · ) − x = x + x 2 − x

2

2

− x + · · · = x

2

2

+ · · · .

If X is large, φ ( x ) ∼ x log x .

We can weaken the bound by decreasing φ ( x ).

Take 1 φ ( x ) = x

2

2+

2

3 x to obtain Bernstein’s

P

� n

�

X i i =1

� �

≥ t ≤ exp − nσ

�

M

2

2

�

� tM nσ 2

= exp −

2 nσ 2 +

2

3 tM

�

2 t

2 +

2

2 tM

3 nσ

�

2

�� inequality :

= e

− u where u =

2 nσ t

2

2 +

2

3 tM

.

Solve for t : t

2

1

3

−

2

3 uM t − 2 nσ

� u 2 M 2

2 u = 0 t = uM +

9

+ 2 nσ 2 u.

Substituting,

P

� n

�

X i

≥

� u 2 M

9

2 i =1

+ 2 nσ 2 u + uM

3

�

≤ e

− u or

Using inequality

√ a + b ≤

P

� n

�

X i

≤

� u 2 M

9

2

√ a + i =1

√ b ,

+ 2 nσ 2 u +

P

� n

�

X i =1 i

≤

√

2 nσ 2 u + uM

�

≥ 1 − e

− u

3

2 uM

�

≥ 1 − e

− u

3

For non-centered X i

, replace X i with X i

−

E

X or

E

X − X i

.

Then | X i

−

E

X | ≤ 2 M and so with high probability

�

( X i

−

E

X ) ≤

√

2 nσ 2 u +

4 uM

3

.

Normalizing by n ,

1

�

X n i

−

E

X ≤

�

2 σ 2 n u 4 uM

+

3 n and

E

X −

1 n

�

X i

≤

�

2 σ 2 n u 4 uM

+

3 n

.

1 exercise: show that this is the best approximation

11

Lecture 06 Bernstein’s inequality.

18.465

Whenever

�

2 σ 2 n u ≥ 4 uM

3 n

, we have u ≤ nσ

2

8 M 2

.

So, | 1 n

� X i

−

E

X | �

�

2 σ 2 n u for u � nσ

2

(range of normal deviations).

This is predicted by the Central Limit Theorem (condition for CLT is nσ 2 → ∞ ).

If nσ 2 does not go to inﬁnity, we get Poisson behavior.

Recall from the last lecture that the we’re interested in concentration inequalities because we want to know

P

( f ( X ) = Y ) while we only observe

1 n

� n i =1

I ( f ( X i

) = Y i

).

In Bernstein’s inequality take

��

X i

�� to be I ( f ( X i

) = Y i

).

Then, since 2 M = 1, we get

E

I ( f ( X i

) = Y i

) −

1 n

�

I ( f ( X i n i =1

) = Y i

) ≤

�

2

P

( f ( X i

) = Y i

) (1 −

P

( f ( X i

) = Y i

)) u

+

2 u n 3 n because

E

I ( f ( X i

) = Y

P i

(

) = f ( X

P i

)

( f

=

( X

Y i i

)

) = Y i

≤

1 n

) = n

� i =1

I

E

(

I f

2

( and

X i

) therefore

= Y i

Var( I ) = σ 2 =

) +

�

2

P

( f ( X i

) = n

Y i

E

I

) u

2 −

+

(

2

E u

3 n

I ) 2 .

Thus, with probability at least 1 − e

− u .

When the training error is zero,

P

( f ( X i

) = Y i

) ≤

�

2

P

( f ( X i

) = Y i

) u

+

2 u n 3 n

.

If we forget about 2 u/ 3 n for a second, we obtain

P

( f ( X i

) = Y i

)

2

≤ 2

P

( f ( X i

) = Y i

) u/n and hence

P

( f ( X i

) = Y i

) ≤

2 u

.

n

The above zero-error rate is better than n

− 1 / 2 predicted by CLT.

12

Lecture 06 Bernstein’s inequality. 18.465

Related documents

Products

Support

Lecture 06 Bernstein’s inequality. 18.465

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib