PAC

The PAC model & Occam’s razor LECTURER: YISHAY MANSOUR SCRIBE: TAL SAIAG AND DANIEL LABENSKI PRESENTATION BY SELLA NEVO PAC Model - Introduction  Probably Approximately Correct  The goal is to learn a hypothesis such that in high confidence, we will have a small error rate Part I – Intuitive Example PAC Model - Example  A ‘typical person’ is a label given to someone in a range of height weight  Our hypothesis class H is the set of all possible axis-aligned rectangles  Each sample is a height and weight, along with the label ‘typical’ or not (+/-). Problem setup  Input: Examples and their label: <(x,y), +/->  Output: R’, a good approximation rectangle of the real target rectangle R  Assumption: Samples are independently and identically distributed according to some distribution D Good hypothesis  R-R’ – False Negatives  R’-R – False Positives  RR’ – Errors  We wish to find a hypothesis, such that w.p. of at least 1-: Pr[error] = D(RR’)   Learning strategy  First, we wish our rectangle to be consistent with the results   This is possible since we assumed R is in H We can choose any rectangle between Rmin and Rmax Rmax Rmin Learning strategy  Our algorithm:  Request a “sufficiently large” number of samples  Find the left-most, right-most, up-most, down-most positive points to define Rmin  Set R’ = Rmin Sufficiently large sample size  How many samples are necessary?  Set A,  and . We will now find the number of samples necessary – m.  We know that R’  R, hence we have only false negatives.  We divide them into 4 (overlapping) groups: Using T’i  1 If D maintains that for each T’i, D(T 'i )   4 then the error rate of R’ is at most: 3 1 Pr[ Error ]  D( RR' )  D(T 'i )      i 1 4  However, this approach is problematic  Our calculations depend upon T’i, which depend on R’, which depends on our samples  We want to find a constant number of samples – m – which can be calculated regardless of specific results Defining Ti  We will define a new set of strips, that depend only on R and not on R’ 1 D ( T )    We construct Ti around R, with width such that i 4  Note – we cannot find these Ti strips (Even after sampling). Yet we can be sure they exist. Using Ti  We want to achieve T 'i  Ti . If that is the case, then: Pr[ Error ]  D( RR' )  D(T 'i )  D(Ti )    Note that if there is at least one sampled point in Ti, then T 'i  Ti  Hence, we want to calculate the probability that no sample point resides within Ti Probability of large error  Formally: Pr[error   ]  Pr[i  1...4( x, y)  S , ( x, y)  Ti ]   By definition of Ti: Pr[ x  Ti ]  1   Since our sample data is i.i.d from dist. D: 4  Pr[( x, y )  S , ( x, y )  Ti ]  (1  ) m 4 Probability of large error   m Hence: Pr[error   ]  4(1  ) 4  x Using the inequality 1  x ,e we obtain:  Pr[error   ]  4(1  ) m  4e 4    m 4  Thus, in order to achieve accuracy  and confidence of at least 1-, we need m  4 ln 4 samples   Remarks on the example  The analysis holds for any fixed distribution D (As long as we take i.i.d samples)  The lower bound of the sample size m(, ) behaves as we would expect – increases as  or  decrease.  The strategy is efficient – we need only search for the max and min points to define our tightest-fit rectangle Part II – Formal Presentation Preliminaries  The goal is to learn an unknown target function out of a known hypothesis class  Unlike the Bayesian approach, no prior knowledge is needed  Examples from the target function are drawn randomly according to a fixed, unknown probability distribution and are i.i.d Preliminaries  We assume the sample and the test data are drawn from the same unknown distribution  The solution is efficient:  The sample size for error  and confidence 1- is a 1 function of 1 and ln    The time to process the sample is polynomial in the sample size Definition of basic concepts  Let X be the instance space or example space  Let D be any fixed distribution over X  A concept class over X is a set: C  {c | c : X  {0,1}}  Let ct  C be the target concept Definition of basic concepts  Let h be a hypothesis in a concept class H  The error of h with respect to D and ct is: error (h)  Pr[h( x)  ct ( x)]  D(hct ( x)) D  Let EX(ct,D) be a procedure (oracle) that runs in a unit time, and returns a labeled example <x, ct(x)> drawn independently from D.  In the PAC model, the oracle is the only souce of examples Definition of the PAC model  Let C and H be concept classes over X. We say that C is PAClearnable by H if there exists an algorithm A with the following property:  For every ct , C for every distribution D on X, and for every 0 < ,  < ½, if A is given access to EX(ct, D) and inputs , , then with probability at least 1- , A outputs a hypothesis such that:  If ct  H (Realizable case), then h satisfies: error (h)    If ct  H (Unrealizable), then h satisfies: error (h)    min error (h' ) h 'H Definition of the PAC model  We say that C is efficiently PAC learnable, if A runs in time polynomial in:  1   ln 1   n – the size of the input  m – the size of the target function (For example, the number of bits needed to characterize it) Finite Hypothesis Class    In this section, we will see how to learn a good hypothesis from a finite hypothesis class H. We define a hypothesis h to be -Bad if error (h)   In order to learn the good hypothesis, we will analyze the bad hypotheses The Realizable Case ct  H  We now look at the case  After processing m(, ) samples, we find an h that is consistent with our samples   We know one exists since ct is in H We now try to bound the probability for A to return h that is -Bad  If it is below , we have succeeded The Realizable Case  First, we will look at a fixed h that is -Bad: Pr[h is   Bad & h( xi )  ct ( xi ) for 1  i  m]  (1   )m  em  Now, we bound the failure probability of A: Pr[ A returns an   Bad hypothesis ]  Pr[h    Bad & h( xi )  ct ( xi ) for 1  i  m]  Pr[ h( x )  c ( x )   h  Bad i t i for 1  i  m]  | {h : h is   Bad & h  H } | (1   ) m  | H | (1   ) m  | H | e m The Realizable Case  In order to satisfy the condition for PAC learning, we must satisfy: | H | e m   which implies a sample size of: m 1  ln |H |  The Unrealizable Case  We now turn to the case ct  H  We define h* to be the hypothesis with minimal error in H, and   error (h*)  We must relax our goal. A hypothesis with error  does not necessarily exist, so we demand error (h)      The empirical error of h after m(, ) instances is 1 m error (h)   I (h( xi )  ct (xi )) m i 1 The Unrealizable Case  The new algorithm A: sample m(, ) and return h  arg min error ( h) hH  The algorithm is called Empirical Risk Minimization (ERM)  We will bound the sample size required such that the distance between the true and estimated error is small for every hypothesis The Unrealizable Case  We hope to achieve w.p. at least 1- : h  H , | error ( h)  error ( h) |    2 If that is the case, then we obtain: error ( h)  error ( h)   2  error ( h*)   2  error ( h*)   2 The Unrealizable Case  In order to bound the necessary sample size to estimate the error, we use the Chernoff bound:  2( ) m   2 Pr | error (h)  error (h) |    2e 2   2 Hence, for all hypothesis in H:  2( ) 2 m   Pr h  H :| error (h)  error (h) |    2 | H | e 2  2   And so, m( ,  )  2 2 ln 2| H |  Example –Boolean Disjunctions  Given a set of boolean variables T={x1,… xn}, we need to learn an Or function over the literals, for example: x1  x3  x5  C is the set of all possible disjunctions. Note that |C|=3n  We will use H = C Boolean Disjunctions – ELIM algorithm   We maintain a group L, which includes all literals we believe to be in the disjunction L is initialized to be x1 , x1 ,..., xn , xn   For each negative sample, we remove all literals in the sample from L  Due to previous analysis we know the algorithm learns when m is at least: m 1  ln |H|   1  ln 3n   n ln 3 1 1  ln    Example – Learning parity  Assume our concept class C is the set of xor functions over n variables   Each sample is a bit vector, and the target functions result over the vector   For example, x1  x7  x9 For example, (01101,1) Our algorithm will use gauss elimination over all samples, thus returning a solution consistent with the whole sample Example – Learning parity  This example fits within the realizable case.  According to our analysis, the needed sample size m is: m 1  ln |H |   n  1 1 ln 2  ln   Part III – Occam’s Razor Occam’s Razor  “Entities should not be multiplied unnecessarily” - William of Occam, c. 1320  We will show that, under very general assumptions, Occam’s Algorithm produces hypotheses that will be predictive for future observations Occam Algorithm - Definition  An (,) Occam-algorithm for a functions class C, using a hypotheses class H, is an algorithm which has two parameters: a constant   0 and a compression factor 0    1 .  Given a sample S of size m, which is labeled according to ct, i.e., i {1,..., m} ct ( xi )   i , the algorithm outputs a hypothesis h  H, such that: 1. 2. The hypothesis h is consistent with the sample   The size of h is at most n m , where n is the size of ct and m is the sample size. Occam Algorithm and PAC  We now show that the existence of an Occamalgorithm for C implies polynomial PAC learnability Theorem: Let A be an (,) Occam Algorithm for C, using H. Then A is PAC with 1 /(1  )  n  m   ln 2     2 1  ln   Occam Algorithm and PAC Proof: Fix m and n.   A returns a hypothesis h s.t. size (h)  n m . The number of   hypotheses of this size is 2n m . 1 |H | m 1 For PAC we need m   ln   . And so: 1 1 n m  ln 2  ln   1 1 1 m  2 max{ n m  ln 2, ln }  m   2  n m  ln 2   n ln 2     2 1 1  Example – Boolean Or with k literals  Again, we wish to learn a boolean disjunction for n variables  However, this time we know there are at most k literals, therefore the hypothesis class size is nk << 3k  We will show an Occam algorithm that creates a hypothesis h with size O(k ln n ln m) Example – Boolean Or with k literals Algorithm: Greedy Algorithm for Set Cover Input: S1 , S 2 ,..., St  V  {1,..., m} Output: Si1 , Si2 ,..., Sik s.t.  Si j  V SetCoverGreedy(V): (1) S  , j  0, V0  V (2) while Vj  :  (3) Choose Si = arg max S r  V j (4) S  S  {i} (5) Vj+1 = Vj – Si (6) j  j+1 (7) return S Sr  Set Cover Algorithm- Analysis  Claim: If Sopt covers V with k sets, then the greedy algorithm returns a cover with at most 1k ln m.  Proof: j : St  Sopt  V j  Sopt   Vj k We bound how fast the sets Vj shrink: V j 1  1  1  Vj   1  V j  1   k  k  k Vj ( j 1) V0  e  j 1 k m When the bound is below 1 then Vj = . Therefore, after 1k ln m steps the algorithm will stop Solving Boolean Or with k Literals  We solve the boolean Or problem via a reduction to Set Cover  We define the set to be covered: T  x : x,  S  ( All positive examples)  And the subsets to use: xi  x : x  T , xi  x ( Each literal " contains" the sets it is included in ) Solving Boolean Or with k Literals  We know there exists a group of k literals which satisfy the examples, so the greedy algorithm must return O(k ln m) literals that satisfy the examples.  There exist 2n literals, so each literal can be encoded with O (log 2n), hence size (h)  k ln n ln m  Since ln m  m  and ln n  n for any  ,   0, we get a (,) Occam-algorithm where , are arbitrarily small. Solving Boolean Or with k Literals  A tighter bound for the sample size (m) can be computed directly: m 1  k ln m ln n   1  ln 1   4  1 1   ln n  ln 2

PAC

Related documents

Products

Support

PAC

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib