A PAC Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department Maria-Florina Balcan Outline of the talk • Supervised Learning – PAC Model • Sample Complexity • Algorithm Design • Semi-supervised Learning – A PAC Style Model – Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions – Implications of our Analysis Maria-Florina Balcan Usual Supervised Learning Problem • Imagine you want a computer program to help you decide which email messages are spam and which are important. • Might represent each message by n features. (e.g., return address, keywords, spelling, etc.). • Take a sample S of data, labeled according to whether they were/weren't spam. • Goal of algorithm is to use data seen so far to produce good prediction rule (a "hypothesis") h for future data. Maria-Florina Balcan The concept learning setting E.g., Given data, some reasonable rules might be: •Predict SPAM if unknown AND (sex OR sales) •Predict SPAM if sales + sex – known > 0. •... Maria-Florina Balcan Supervised Learning, Big Questions • Algorithm Design – How might we automatically generate rules that do well on observed data? • Sample Complexity/Confidence Bound – What kind of confidence do we have that they will do well in the future? Maria-Florina Balcan Supervised Learning: Formalization (PAC) • PAC model – nice/standard model for learning from labeled data. • X - instance space • S={(x, l)} - set of labeled examples – examples - assumed to be drawn i.i.d. from some distr. D over X and labeled by some target concept c* – labels 2 {-1,1} - binary classification • Want to do optimization over S to find some hypothesis h, but we want h to have small error over D. – err(h)=Prx 2 D(h(x) c*(x)) Maria-Florina Balcan Basic PAC Learning Definitions • Algorithm A PAC-learns concept class C if for any target c* in C, any distribution D over X, any , > 0: – A uses at most poly(n,1/,1/,size(c*)) examples and running time. – With probability 1-, A produces h in C of error at most . confidence accuracy • Notation: – true error of h - empirical error of h Maria-Florina Balcan Sample Complexity: Uniform Convergence Finite Hypothesis Spaces • Realizable Case 1. Prob. a bad hypothesis is consistent with m examples is at most (1-)m 2. So, prob. exists a bad consistent hypothesis is at most |C|(1-)m 3. Set to , solve to get # examples needed at most 1/[ln(|C|) + ln(1/)] • If not too many rules to choose from, then unlikely some bad one will fool you just by chance. Maria-Florina Balcan Sample Complexity: Uniform Convergence Finite Hypothesis Spaces Realizable Case Agnostic Case • What if there is no perfect h? – Gives hope for local optimization over the training data. Maria-Florina Balcan Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces • C[S] – the set of splittings of dataset S using concepts from C. • C[m] - maximum number of ways to split m points using concepts in C; i.e. • C[m,D] - expected number of splits of m points from D with concepts in C. • Neat Fact #1: previous results still hold if we replace |C| with C[2m]. • Neat Fact #2: can even replace with C[2m,D]. Maria-Florina Balcan Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces For instance: Sauer’s Lemma, C[m]=O(mVC-dim(C)) implies: Maria-Florina Balcan Outline of the talk • Supervised Learning – PAC Model • Sample Complexity • Algorithms • Semi-supervised Learning – Proposed Model – Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions – Implications of our Analysis Maria-Florina Balcan Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) • Hot topic in recent years in Machine Learning. • Many applications have lots of unlabeled data, but labeled data is rare or expensive: • Web page, document classification • OCR, Image classification Maria-Florina Balcan Combining Labeled and Unlabeled Data • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM • Co-training • Graph-based methods Maria-Florina Balcan Can we extend the PAC model to deal with Unlabeled Data? • PAC model – nice/standard model for learning from labeled data. • Goal – extend it naturally to the case of learning from both labeled and unlabeled data. – Different algorithms are based on different assumptions about how data should behave. – Question – how to capture many of the assumptions typically used? Maria-Florina Balcan Example of “typical” assumption • The separator goes through low density regions of the space/large margin. – assume we are looking for linear separator – belief: should exist one with large separation + _ + _ SVM Labeled data only + _ + _ + _ + _ Transductive SVM Maria-Florina Balcan Another Example • Agreement between two parts : co-training. – examples contain two sufficient sets of features, i.e. an example is x=h x1, x2 i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) – for example, if we want to classify web pages: x = h x1, x2 i Prof. Avrim Blum My Advisor x - Link info & Text info Prof. Avrim Blum My Advisor x1- Link info x2- Text info Maria-Florina Balcan Co-training Link info Text info My Advisor + + - Maria-Florina Balcan Proposed Model • Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution. • “learn C” becomes “learn (C,)” (i.e. learn class C under compatibility notion ) • Express relationships that one hopes the target function and underlying distribution will possess. • Goal: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. Maria-Florina Balcan Proposed Model, cont • Goal: use unlabeled data & our belief to reduce size(C) down to size(highly compatible functions in C) in the previous bounds. • Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. • Require that the degree of compatibility be something that can be estimated from a finite sample. Maria-Florina Balcan Proposed Model, cont • Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution. • Require that the degree of compatibility be something that can be estimated from a finite sample. • Require to be an expectation over individual examples: – (h,D)=Ex 2 D[(h, x)] compatibility of h with D, (h,x) 2 [0,1] – errunl(h)=1-(h, D) incompatibility of h with D (unlabeled error rate of h) Maria-Florina Balcan Margins, Compatibility • Margins: belief is that should exist a large margin separator. Highly compatible + _ + _ • Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance of h. • Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where: • (h,x)=0 if dist(x,h) · • (h,x)=1 if dist(x,h) ¸ Maria-Florina Balcan Margins, Compatibility • Margins: belief is that should exist a large margin separator. Highly compatible + _ + _ • If do not want to commit to in advance, define (h,x) to be a smooth function of dist(x,h), e.g.: • Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance of h. Maria-Florina Balcan Co-training, Compatibility • Co-training: examples come as pairs h x1, x2 i and the goal is to learn a pair of functions h h1, h2 i • Hope is that the two parts of the example are consistent. • Legal (and natural) notion of compatibility: – the compatibility of h h1, h2 i and D: – can be written as an expectation over examples: Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces, Doubly Realizable Case • Assume (h,x) 2 {0,1}; define CD,() = {h 2 C : errunl(h) · }. Theorem • Bound the number of labeled examples as a measure of the helpfulness of D w.r.t to – a helpful distribution is one in which CD,() is small Maria-Florina Balcan Semi-Supervised Learning Natural Formalization (PAC) • We will say an algorithm "PAC-learns" if it runs in poly time using samples poly in respective bounds. • E.g., can think of ln|C| as # bits to describe target without knowing D, and ln|CD,()| as number of bits to describe target knowing a good approximation to D, given the assumption that the target has low unlabeled error rate. Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces – c* not fully compatible: Theorem Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds Infinite Hypothesis Spaces Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) = (h,x). Maria-Florina Balcan Examples of results in our model Sample Complexity, -Cover-based bounds • For algorithms that behave in a specific way: – first use the unlabeled data to choose a representative set of compatible hypotheses – then use the labeled sample to choose among these Theorem • Can result in much better bound than uniform convergence! Maria-Florina Balcan Examples of results in our model • Let’s look at some algorithms. Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Algorithm for a simple (C,) • X={0,1}n, C – class of disjunctions, e.g. h=x1 Ç x2 Ç x3 Ç x4 Ç x7 • For x 2 X, let vars(x) be the set of variables set to 1 by x • For h 2 C, let vars(h) be the set of variables disjoined by h • (h,x)=1 if either vars(x) µ vars(h) or vars(x) Å vars(h)= • Strong notion of margin: – every variable is either a positive indicator or a negative indicator – no example should contain both positive and negative indicators • Can give a simple PAC-learning algorithm for this pair (C,). Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Algorithm for a simple (C,) • Use unlabeled sample U to build G on n vertices: – put an edge between i and j if 9 x in U with i,j 2 vars(x). • Use labeled data L to label the connected components. • Output h s. t. vars(h) is the union of the positively-labeled components. • If c* is fully compatible, then no component will get both positive and negative labels. • and • If |U| & |L| are as given in the bounds, then whp err(h)· . unlabeled set U labeled set L 011000 101000 001000 000011 100100 100000 + 000011 - + 1 2 3 4 5 6 h=x1Çx2Çx3Çx4 Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Algorithm for a simple (C,) • Especially non-helpful distribution – the uniform distr. over all examples x with |vars(x)|=1 – get n components; still needs (n) labeled examples • Helpful distribution - one such that w.h.p. the # of components is small – need a lower number of labeled examples Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Co-training of linear separators • Examples h x1, x2 i 2 Rn £ Rn. • Target functions c1 and c2 are linear separators, assume c1=c2=c*, and that no pair crosses the target plane. – f linear separator in Rn, errunl(f) - the fraction of the pairs that “cross f’s boundary” • Consistency problem: given a set of labeled and unlabeled examples, want to find a separator that is consistent with labeled examples and compatible with the unlabeled ones. + • It is NP-hard - Abie. + - Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Co-training of linear separators • Assume independence given the label (both points from D+ or from D-). • [Blum & Mitchell ’98] show can co-train (in polynomial time) if have enough labeled data to produce a weakly-useful hypothesis to begin with. • We show, can learn with only a single labeled example. • Key point: independence given the label implies that the functions with low errunl rate are: • • • • close to c* close to : c* close to the all positive function close to the all negative function Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Co-training of linear separators • Nice Tool: a “super simple algorithm” for weak learning a large-margin separator: – pick c at random • If margin=1/poly(n), then a random c has at least 1/poly(n) chance of being a weak predictor Maria-Florina Balcan Examples of results in our model Algorithmic Issues: Co-training of linear separators • Assume independence given the label. • Draw a large unlabeled sample S={(x1i,x2i)}. • If also assume large margin, – run the “super-simple alg” poly(n) times – feed each c into [Blum & Mitchell] booster – examine all the hypotheses produced, and pick one h with small errunl, that is far from all-positive and all-negative functions – use labeled example to choose either h or : h • w.h.p. one random c was a weakly-useful predictor; so on at least one of these steps we end up with a hypothesis h with small err(h), and so with small errunl(h) • If don’t assume large margin, – use Outlier Removal Lemma to make sure that at least 1/poly fraction of the points in S1={x1i} have margin at least 1/poly; this is sufficient. Maria-Florina Balcan Implications of our analysis Ways in which unlabeled data can help • If the target is highly compatible with D and have enough unlabeled data to estimate over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). • By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropy or the size of the smallest -cover). Maria-Florina Balcan Questions? Maria-Florina Balcan Maria-Florina Balcan