The PAC model & Occam’s razor LECTURER: YISHAY MANSOUR SCRIBE: TAL SAIAG AND DANIEL LABENSKI PRESENTATION BY SELLA NEVO PAC Model - Introduction Probably Approximately Correct The goal is to learn a hypothesis such that in high confidence, we will have a small error rate Part I – Intuitive Example PAC Model - Example A ‘typical person’ is a label given to someone in a range of height weight Our hypothesis class H is the set of all possible axis-aligned rectangles Each sample is a height and weight, along with the label ‘typical’ or not (+/-). Problem setup Input: Examples and their label: <(x,y), +/-> Output: R’, a good approximation rectangle of the real target rectangle R Assumption: Samples are independently and identically distributed according to some distribution D Good hypothesis R-R’ – False Negatives R’-R – False Positives RR’ – Errors We wish to find a hypothesis, such that w.p. of at least 1-: Pr[error] = D(RR’) Learning strategy First, we wish our rectangle to be consistent with the results This is possible since we assumed R is in H We can choose any rectangle between Rmin and Rmax Rmax Rmin Learning strategy Our algorithm: Request a “sufficiently large” number of samples Find the left-most, right-most, up-most, down-most positive points to define Rmin Set R’ = Rmin Sufficiently large sample size How many samples are necessary? Set A, and . We will now find the number of samples necessary – m. We know that R’ R, hence we have only false negatives. We divide them into 4 (overlapping) groups: Using T’i 1 If D maintains that for each T’i, D(T 'i ) 4 then the error rate of R’ is at most: 3 1 Pr[ Error ] D( RR' ) D(T 'i ) i 1 4 However, this approach is problematic Our calculations depend upon T’i, which depend on R’, which depends on our samples We want to find a constant number of samples – m – which can be calculated regardless of specific results Defining Ti We will define a new set of strips, that depend only on R and not on R’ 1 D ( T ) We construct Ti around R, with width such that i 4 Note – we cannot find these Ti strips (Even after sampling). Yet we can be sure they exist. Using Ti We want to achieve T 'i Ti . If that is the case, then: Pr[ Error ] D( RR' ) D(T 'i ) D(Ti ) Note that if there is at least one sampled point in Ti, then T 'i Ti Hence, we want to calculate the probability that no sample point resides within Ti Probability of large error Formally: Pr[error ] Pr[i 1...4( x, y) S , ( x, y) Ti ] By definition of Ti: Pr[ x Ti ] 1 Since our sample data is i.i.d from dist. D: 4 Pr[( x, y ) S , ( x, y ) Ti ] (1 ) m 4 Probability of large error m Hence: Pr[error ] 4(1 ) 4 x Using the inequality 1 x ,e we obtain: Pr[error ] 4(1 ) m 4e 4 m 4 Thus, in order to achieve accuracy and confidence of at least 1-, we need m 4 ln 4 samples Remarks on the example The analysis holds for any fixed distribution D (As long as we take i.i.d samples) The lower bound of the sample size m(, ) behaves as we would expect – increases as or decrease. The strategy is efficient – we need only search for the max and min points to define our tightest-fit rectangle Part II – Formal Presentation Preliminaries The goal is to learn an unknown target function out of a known hypothesis class Unlike the Bayesian approach, no prior knowledge is needed Examples from the target function are drawn randomly according to a fixed, unknown probability distribution and are i.i.d Preliminaries We assume the sample and the test data are drawn from the same unknown distribution The solution is efficient: The sample size for error and confidence 1- is a 1 function of 1 and ln The time to process the sample is polynomial in the sample size Definition of basic concepts Let X be the instance space or example space Let D be any fixed distribution over X A concept class over X is a set: C {c | c : X {0,1}} Let ct C be the target concept Definition of basic concepts Let h be a hypothesis in a concept class H The error of h with respect to D and ct is: error (h) Pr[h( x) ct ( x)] D(hct ( x)) D Let EX(ct,D) be a procedure (oracle) that runs in a unit time, and returns a labeled example <x, ct(x)> drawn independently from D. In the PAC model, the oracle is the only souce of examples Definition of the PAC model Let C and H be concept classes over X. We say that C is PAClearnable by H if there exists an algorithm A with the following property: For every ct , C for every distribution D on X, and for every 0 < , < ½, if A is given access to EX(ct, D) and inputs , , then with probability at least 1- , A outputs a hypothesis such that: If ct H (Realizable case), then h satisfies: error (h) If ct H (Unrealizable), then h satisfies: error (h) min error (h' ) h 'H Definition of the PAC model We say that C is efficiently PAC learnable, if A runs in time polynomial in: 1 ln 1 n – the size of the input m – the size of the target function (For example, the number of bits needed to characterize it) Finite Hypothesis Class In this section, we will see how to learn a good hypothesis from a finite hypothesis class H. We define a hypothesis h to be -Bad if error (h) In order to learn the good hypothesis, we will analyze the bad hypotheses The Realizable Case ct H We now look at the case After processing m(, ) samples, we find an h that is consistent with our samples We know one exists since ct is in H We now try to bound the probability for A to return h that is -Bad If it is below , we have succeeded The Realizable Case First, we will look at a fixed h that is -Bad: Pr[h is Bad & h( xi ) ct ( xi ) for 1 i m] (1 )m em Now, we bound the failure probability of A: Pr[ A returns an Bad hypothesis ] Pr[h Bad & h( xi ) ct ( xi ) for 1 i m] Pr[ h( x ) c ( x ) h Bad i t i for 1 i m] | {h : h is Bad & h H } | (1 ) m | H | (1 ) m | H | e m The Realizable Case In order to satisfy the condition for PAC learning, we must satisfy: | H | e m which implies a sample size of: m 1 ln |H | The Unrealizable Case We now turn to the case ct H We define h* to be the hypothesis with minimal error in H, and error (h*) We must relax our goal. A hypothesis with error does not necessarily exist, so we demand error (h) The empirical error of h after m(, ) instances is 1 m error (h) I (h( xi ) ct (xi )) m i 1 The Unrealizable Case The new algorithm A: sample m(, ) and return h arg min error ( h) hH The algorithm is called Empirical Risk Minimization (ERM) We will bound the sample size required such that the distance between the true and estimated error is small for every hypothesis The Unrealizable Case We hope to achieve w.p. at least 1- : h H , | error ( h) error ( h) | 2 If that is the case, then we obtain: error ( h) error ( h) 2 error ( h*) 2 error ( h*) 2 The Unrealizable Case In order to bound the necessary sample size to estimate the error, we use the Chernoff bound: 2( ) m 2 Pr | error (h) error (h) | 2e 2 2 Hence, for all hypothesis in H: 2( ) 2 m Pr h H :| error (h) error (h) | 2 | H | e 2 2 And so, m( , ) 2 2 ln 2| H | Example –Boolean Disjunctions Given a set of boolean variables T={x1,… xn}, we need to learn an Or function over the literals, for example: x1 x3 x5 C is the set of all possible disjunctions. Note that |C|=3n We will use H = C Boolean Disjunctions – ELIM algorithm We maintain a group L, which includes all literals we believe to be in the disjunction L is initialized to be x1 , x1 ,..., xn , xn For each negative sample, we remove all literals in the sample from L Due to previous analysis we know the algorithm learns when m is at least: m 1 ln |H| 1 ln 3n n ln 3 1 1 ln Example – Learning parity Assume our concept class C is the set of xor functions over n variables Each sample is a bit vector, and the target functions result over the vector For example, x1 x7 x9 For example, (01101,1) Our algorithm will use gauss elimination over all samples, thus returning a solution consistent with the whole sample Example – Learning parity This example fits within the realizable case. According to our analysis, the needed sample size m is: m 1 ln |H | n 1 1 ln 2 ln Part III – Occam’s Razor Occam’s Razor “Entities should not be multiplied unnecessarily” - William of Occam, c. 1320 We will show that, under very general assumptions, Occam’s Algorithm produces hypotheses that will be predictive for future observations Occam Algorithm - Definition An (,) Occam-algorithm for a functions class C, using a hypotheses class H, is an algorithm which has two parameters: a constant 0 and a compression factor 0 1 . Given a sample S of size m, which is labeled according to ct, i.e., i {1,..., m} ct ( xi ) i , the algorithm outputs a hypothesis h H, such that: 1. 2. The hypothesis h is consistent with the sample The size of h is at most n m , where n is the size of ct and m is the sample size. Occam Algorithm and PAC We now show that the existence of an Occamalgorithm for C implies polynomial PAC learnability Theorem: Let A be an (,) Occam Algorithm for C, using H. Then A is PAC with 1 /(1 ) n m ln 2 2 1 ln Occam Algorithm and PAC Proof: Fix m and n. A returns a hypothesis h s.t. size (h) n m . The number of hypotheses of this size is 2n m . 1 |H | m 1 For PAC we need m ln . And so: 1 1 n m ln 2 ln 1 1 1 m 2 max{ n m ln 2, ln } m 2 n m ln 2 n ln 2 2 1 1 Example – Boolean Or with k literals Again, we wish to learn a boolean disjunction for n variables However, this time we know there are at most k literals, therefore the hypothesis class size is nk << 3k We will show an Occam algorithm that creates a hypothesis h with size O(k ln n ln m) Example – Boolean Or with k literals Algorithm: Greedy Algorithm for Set Cover Input: S1 , S 2 ,..., St V {1,..., m} Output: Si1 , Si2 ,..., Sik s.t. Si j V SetCoverGreedy(V): (1) S , j 0, V0 V (2) while Vj : (3) Choose Si = arg max S r V j (4) S S {i} (5) Vj+1 = Vj – Si (6) j j+1 (7) return S Sr Set Cover Algorithm- Analysis Claim: If Sopt covers V with k sets, then the greedy algorithm returns a cover with at most 1k ln m. Proof: j : St Sopt V j Sopt Vj k We bound how fast the sets Vj shrink: V j 1 1 1 Vj 1 V j 1 k k k Vj ( j 1) V0 e j 1 k m When the bound is below 1 then Vj = . Therefore, after 1k ln m steps the algorithm will stop Solving Boolean Or with k Literals We solve the boolean Or problem via a reduction to Set Cover We define the set to be covered: T x : x, S ( All positive examples) And the subsets to use: xi x : x T , xi x ( Each literal " contains" the sets it is included in ) Solving Boolean Or with k Literals We know there exists a group of k literals which satisfy the examples, so the greedy algorithm must return O(k ln m) literals that satisfy the examples. There exist 2n literals, so each literal can be encoded with O (log 2n), hence size (h) k ln n ln m Since ln m m and ln n n for any , 0, we get a (,) Occam-algorithm where , are arbitrarily small. Solving Boolean Or with k Literals A tighter bound for the sample size (m) can be computed directly: m 1 k ln m ln n 1 ln 1 4 1 1 ln n ln 2