PAC

advertisement
The PAC model
& Occam’s razor
LECTURER: YISHAY MANSOUR
SCRIBE: TAL SAIAG AND DANIEL LABENSKI
PRESENTATION BY SELLA NEVO
PAC Model - Introduction

Probably Approximately Correct

The goal is to learn a hypothesis such that in high
confidence, we will have a small error rate
Part I – Intuitive Example
PAC Model - Example

A ‘typical person’ is a label given to someone in a range of
height weight

Our hypothesis class H is the set of all possible axis-aligned
rectangles

Each sample is a height and weight, along with the label
‘typical’ or not (+/-).
Problem setup

Input: Examples and their label: <(x,y), +/->

Output: R’, a good approximation rectangle of the
real target rectangle R

Assumption: Samples are independently and
identically distributed according to some distribution D
Good hypothesis

R-R’ – False Negatives

R’-R – False Positives

RR’ – Errors

We wish to find a hypothesis, such that w.p. of at least
1-:
Pr[error] = D(RR’)  
Learning strategy

First, we wish our rectangle to be consistent with the
results


This is possible since we assumed R is in H
We can choose any rectangle between Rmin and Rmax
Rmax
Rmin
Learning strategy

Our algorithm:

Request a “sufficiently large” number of samples

Find the left-most, right-most, up-most, down-most
positive points to define Rmin

Set R’ = Rmin
Sufficiently large sample size

How many samples are necessary?

Set A,  and . We will now find the number of samples
necessary – m.

We know that R’  R, hence we have only false negatives.

We divide them into 4 (overlapping) groups:
Using T’i

1
If D maintains that for each T’i, D(T 'i )  
4
then the error rate of R’ is at most:
3
1
Pr[ Error ]  D( RR' )  D(T 'i )     
i 1 4

However, this approach is problematic

Our calculations depend upon T’i, which depend on R’, which
depends on our samples

We want to find a constant number of samples – m – which
can be calculated regardless of specific results
Defining Ti

We will define a new set of strips, that depend only on
R and not on R’
1
D
(
T
)


 We construct Ti around R, with width such that
i
4

Note – we cannot find these Ti strips (Even after
sampling). Yet we can be sure they exist.
Using Ti

We want to achieve T 'i  Ti . If that is the case, then:
Pr[ Error ]  D( RR' )  D(T 'i )  D(Ti )  

Note that if there is at least one sampled point in Ti,
then
T 'i  Ti

Hence, we want to calculate the probability that no
sample point resides within Ti
Probability of large error

Formally: Pr[error   ]  Pr[i  1...4( x, y)  S , ( x, y)  Ti ]


By definition of Ti: Pr[ x  Ti ]  1 

Since our sample data is i.i.d from dist. D:
4

Pr[( x, y )  S , ( x, y )  Ti ]  (1  ) m
4
Probability of large error


m
Hence: Pr[error   ]  4(1  )
4

x
Using the inequality 1  x ,e
we obtain:

Pr[error   ]  4(1  ) m  4e
4


 m
4

Thus, in order to achieve accuracy  and confidence
of at least 1-, we need m  4 ln 4 samples


Remarks on the example

The analysis holds for any fixed distribution D (As long
as we take i.i.d samples)

The lower bound of the sample size m(, ) behaves as
we would expect – increases as  or  decrease.

The strategy is efficient – we need only search for the
max and min points to define our tightest-fit rectangle
Part II – Formal
Presentation
Preliminaries

The goal is to learn an unknown target function out of
a known hypothesis class

Unlike the Bayesian approach, no prior knowledge is
needed

Examples from the target function are drawn
randomly according to a fixed, unknown probability
distribution and are i.i.d
Preliminaries

We assume the sample and the test data are drawn
from the same unknown distribution

The solution is efficient:

The sample size for error  and confidence 1- is a
1
function of 1 and ln 


The time to process the sample is polynomial in the
sample size
Definition of basic concepts

Let X be the instance space or example space

Let D be any fixed distribution over X

A concept class over X is a set:
C  {c | c : X  {0,1}}

Let ct  C be the target concept
Definition of basic concepts

Let h be a hypothesis in a concept class H

The error of h with respect to D and ct is:
error (h)  Pr[h( x)  ct ( x)]  D(hct ( x))
D

Let EX(ct,D) be a procedure (oracle) that runs in a unit
time, and returns a labeled example <x, ct(x)> drawn
independently from D.

In the PAC model, the oracle is the only souce of
examples
Definition of the PAC model

Let C and H be concept classes over X. We say that C is PAClearnable by H if there exists an algorithm A with the following
property:

For every ct , C for every distribution D on X, and for every 0 < ,
 < ½, if A is given access to EX(ct, D) and inputs , , then with
probability at least 1- , A outputs a hypothesis such that:

If ct
 H (Realizable case), then h satisfies:
error (h)  

If ct
 H (Unrealizable), then h satisfies:
error (h)    min error (h' )
h 'H
Definition of the PAC model

We say that C is efficiently PAC learnable, if A runs in
time polynomial in:
 1

 ln 1


n – the size of the input

m – the size of the target function (For example, the
number of bits needed to characterize it)
Finite Hypothesis Class



In this section, we will see how to learn a good
hypothesis from a finite hypothesis class H.
We define a hypothesis h to be -Bad if error (h)  
In order to learn the good hypothesis, we will analyze
the bad hypotheses
The Realizable Case
ct  H

We now look at the case

After processing m(, ) samples, we find an h that is
consistent with our samples


We know one exists since ct is in H
We now try to bound the probability for A to return h
that is -Bad

If it is below , we have succeeded
The Realizable Case

First, we will look at a fixed h that is -Bad:
Pr[h is   Bad & h( xi )  ct ( xi ) for 1  i  m]  (1   )m  em

Now, we bound the failure probability of A:
Pr[ A returns an   Bad hypothesis ]
 Pr[h    Bad & h( xi )  ct ( xi ) for 1  i  m]

Pr[ h( x )  c ( x )


h  Bad
i
t
i
for 1  i  m]
 | {h : h is   Bad & h  H } | (1   ) m
 | H | (1   ) m  | H | e m
The Realizable Case

In order to satisfy the condition for PAC learning, we must
satisfy: | H | e m  
which implies a sample size of:
m
1

ln
|H |

The Unrealizable Case

We now turn to the case ct  H

We define h* to be the hypothesis with minimal error in H, and
  error (h*)

We must relax our goal. A hypothesis with error  does not
necessarily exist, so we demand error (h)    

The empirical error of h after m(, ) instances is
1 m
error (h) 
 I (h( xi )  ct (xi ))
m i 1
The Unrealizable Case

The new algorithm A: sample m(, ) and return
h  arg min error ( h)
hH

The algorithm is called Empirical Risk Minimization (ERM)

We will bound the sample size required such that the distance
between the true and estimated error is small for every
hypothesis
The Unrealizable Case

We hope to achieve w.p. at least 1- :
h  H , | error ( h)  error ( h) | 


2
If that is the case, then we obtain:
error ( h)  error ( h) 

2
 error ( h*) 

2
 error ( h*) 

2
The Unrealizable Case

In order to bound the necessary sample size to estimate the
error, we use the Chernoff bound:

2( ) m


2
Pr | error (h)  error (h) |    2e
2


2
Hence, for all hypothesis in H:

2( ) 2 m


Pr h  H :| error (h)  error (h) |    2 | H | e 2

2


And so, m( ,  ) 
2
2
ln
2| H |

Example –Boolean Disjunctions

Given a set of boolean variables T={x1,… xn}, we need to learn
an Or function over the literals, for example: x1  x3  x5

C is the set of all possible disjunctions. Note that |C|=3n

We will use H = C
Boolean Disjunctions – ELIM algorithm


We maintain a group L, which includes all literals we believe to
be in the disjunction
L is initialized to be x1 , x1 ,..., xn , xn 

For each negative sample, we remove all literals in the sample
from L

Due to previous analysis we know the algorithm learns when m
is at least:
m
1

ln
|H|


1

ln
3n


n ln 3 1 1
 ln



Example – Learning parity

Assume our concept class C is the set of xor functions over n
variables


Each sample is a bit vector, and the target functions result
over the vector


For example, x1  x7  x9
For example, (01101,1)
Our algorithm will use gauss elimination over all samples, thus
returning a solution consistent with the whole sample
Example – Learning parity

This example fits within the realizable case.

According to our analysis, the needed sample size m is:
m
1

ln
|H |


n

1 1
ln 2  ln


Part III – Occam’s Razor
Occam’s Razor

“Entities should not be multiplied unnecessarily”
- William of Occam, c. 1320

We will show that, under very general assumptions,
Occam’s Algorithm produces hypotheses that will be
predictive for future observations
Occam Algorithm - Definition

An (,) Occam-algorithm for a functions class C,
using a hypotheses class H, is an algorithm which has
two parameters: a constant   0 and a compression
factor 0    1 .

Given a sample S of size m, which is labeled
according to ct, i.e., i {1,..., m} ct ( xi )   i ,
the algorithm outputs a hypothesis h  H, such that:
1.
2.
The hypothesis h is consistent with the sample
 
The size of h is at most n m , where n is the size of ct and
m is the sample size.
Occam Algorithm and PAC

We now show that the existence of an Occamalgorithm for C implies polynomial PAC learnability
Theorem: Let A be an (,) Occam Algorithm for C, using
H. Then A is PAC with
1 /(1  )
 n

m   ln 2 
 

2 1
 ln


Occam Algorithm and PAC
Proof: Fix m and n.
 
A returns a hypothesis h s.t. size (h)  n m . The number of
 
hypotheses of this size is 2n m .
1
|H |
m
1
For PAC we need m   ln 

. And so:
1 1
n m  ln 2  ln


1
1 1
m  2 max{ n m  ln 2, ln }

m


2

n m  ln 2   n ln 2 



2
1
1 
Example – Boolean Or
with k literals

Again, we wish to learn a boolean disjunction for n
variables

However, this time we know there are at most k literals,
therefore the hypothesis class size is nk << 3k

We will show an Occam algorithm that creates a
hypothesis h with size O(k ln n ln m)
Example – Boolean Or
with k literals
Algorithm: Greedy Algorithm for Set Cover
Input: S1 , S 2 ,..., St  V  {1,..., m}
Output: Si1 , Si2 ,..., Sik s.t.  Si j  V
SetCoverGreedy(V):
(1) S  , j  0, V0  V
(2) while Vj  :

(3) Choose Si = arg max S r  V j
(4) S  S  {i}
(5) Vj+1 = Vj – Si
(6) j  j+1
(7) return S
Sr

Set Cover Algorithm- Analysis

Claim: If Sopt covers V with k sets, then the greedy
algorithm returns a cover with at most 1k ln m.

Proof:
j : St  Sopt  V j  Sopt 

Vj
k
We bound how fast the sets Vj shrink:
V j 1
 1
 1
 Vj 
 1  V j  1  
k  k
 k
Vj
( j 1)
V0  e

j 1
k
m
When the bound is below 1 then Vj = . Therefore, after
1k ln m steps the algorithm will stop
Solving Boolean Or with k
Literals

We solve the boolean Or problem via a reduction to
Set Cover

We define the set to be covered:
T  x : x,  S  ( All positive examples)

And the subsets to use:
xi  x : x  T , xi  x
( Each literal " contains" the sets it is included in )
Solving Boolean Or with k
Literals

We know there exists a group of k literals which satisfy
the examples, so the greedy algorithm must return
O(k ln m) literals that satisfy the examples.

There exist 2n literals, so each literal can be encoded
with O (log 2n), hence size (h)  k ln n ln m

Since ln m  m  and ln n  n for any  ,   0, we get a
(,) Occam-algorithm where , are arbitrarily small.
Solving Boolean Or with k
Literals

A tighter bound for the sample size (m) can be
computed directly:
m
1

k ln m ln n  
1

ln
1


4

1
1


ln n  ln
2
Download