A PAC Model for Learning from Labeled and Unlabeled Data

advertisement
A PAC Model for Learning from
Labeled and Unlabeled Data
Maria-Florina Balcan & Avrim Blum
Carnegie Mellon University,
Computer Science Department
Maria-Florina Balcan
Outline of the talk
• Supervised Learning
– PAC Model
• Sample Complexity
• Algorithm Design
• Semi-supervised Learning
– A PAC Style Model
– Examples of results in our model
• Sample Complexity
• Algorithmic Issues: Co-training of linear separators
• Conclusions
– Implications of our Analysis
Maria-Florina Balcan
Usual Supervised Learning Problem
• Imagine you want a computer program to help you
decide which email messages are spam and which are
important.
• Might represent each message by n features. (e.g.,
return address, keywords, spelling, etc.).
• Take a sample S of data, labeled according to whether
they were/weren't spam.
• Goal of algorithm is to use data seen so far to produce
good prediction rule (a "hypothesis") h for future data.
Maria-Florina Balcan
The concept learning setting
E.g.,
Given data, some reasonable rules might be:
•Predict SPAM if unknown AND (sex OR sales)
•Predict SPAM if sales + sex – known > 0.
•...
Maria-Florina Balcan
Supervised Learning, Big Questions
• Algorithm Design
– How might we automatically generate rules that do
well on observed data?
• Sample Complexity/Confidence Bound
– What kind of confidence do we have that they will
do well in the future?
Maria-Florina Balcan
Supervised Learning: Formalization (PAC)
• PAC model – nice/standard model for learning from
labeled data.
• X - instance space
• S={(x, l)} - set of labeled examples
– examples - assumed to be drawn i.i.d. from some
distr. D over X and labeled by some target concept c*
– labels 2 {-1,1} - binary classification
• Want to do optimization over S to find some hypothesis
h, but we want h to have small error over D.
– err(h)=Prx 2 D(h(x)  c*(x))
Maria-Florina Balcan
Basic PAC Learning Definitions
• Algorithm A PAC-learns concept class C if for any
target c* in C, any distribution D over X, any ,  > 0:
– A uses at most poly(n,1/,1/,size(c*)) examples and
running time.
– With probability 1-, A produces h in C of error at
most .
confidence
accuracy
• Notation:
– true error of h
- empirical error of h
Maria-Florina Balcan
Sample Complexity:
Uniform Convergence
Finite Hypothesis Spaces
• Realizable Case
1. Prob. a bad hypothesis is consistent with m examples is at most (1-)m
2. So, prob. exists a bad consistent hypothesis is at most |C|(1-)m
3. Set to , solve to get # examples needed at most 1/[ln(|C|) + ln(1/)]
• If not too many rules to choose from, then unlikely some bad
one will fool you just by chance.
Maria-Florina Balcan
Sample Complexity:
Uniform Convergence
Finite Hypothesis Spaces
Realizable Case
Agnostic Case
•
What if there is no perfect h?
– Gives hope for local optimization over the training data.
Maria-Florina Balcan
Sample Complexity:
Uniform Convergence
Infinite Hypothesis Spaces
• C[S] – the set of splittings of dataset S using concepts from C.
• C[m] - maximum number of ways to split m points using concepts
in C; i.e.
• C[m,D] - expected number of splits of m points from D with
concepts in C.
• Neat Fact #1: previous results still hold if we replace |C| with
C[2m].
• Neat Fact #2: can even replace with C[2m,D].
Maria-Florina Balcan
Sample Complexity:
Uniform Convergence
Infinite Hypothesis Spaces
For instance:
Sauer’s Lemma, C[m]=O(mVC-dim(C)) implies:
Maria-Florina Balcan
Outline of the talk
• Supervised Learning
– PAC Model
• Sample Complexity
• Algorithms
• Semi-supervised Learning
– Proposed Model
– Examples of results in our model
• Sample Complexity
• Algorithmic Issues: Co-training of linear separators
• Conclusions
– Implications of our Analysis
Maria-Florina Balcan
Combining Labeled and Unlabeled Data
(a.k.a. Semi-supervised Learning)
• Hot topic in recent years in Machine Learning.
• Many applications have lots of unlabeled data,
but labeled data is rare or expensive:
• Web page, document classification
• OCR, Image classification
Maria-Florina Balcan
Combining Labeled and Unlabeled Data
• Several methods have been developed to try
to use unlabeled data to improve performance,
e.g.:
• Transductive SVM
• Co-training
• Graph-based methods
Maria-Florina Balcan
Can we extend the PAC model to deal
with Unlabeled Data?
• PAC model – nice/standard model for learning
from labeled data.
• Goal – extend it naturally to the case of learning
from both labeled and unlabeled data.
– Different algorithms are based on different assumptions
about how data should behave.
– Question – how to capture many of the assumptions
typically used?
Maria-Florina Balcan
Example of “typical” assumption
• The separator goes through low density regions of
the space/large margin.
– assume we are looking for linear separator
– belief: should exist one with large separation
+
_
+
_
SVM
Labeled data only
+
_
+
_
+
_
+
_
Transductive SVM
Maria-Florina Balcan
Another Example
• Agreement between two parts : co-training.
– examples contain two sufficient sets of features, i.e. an
example is x=h x1, x2 i and the belief is that the two parts
of the example are consistent, i.e. 9 c1, c2 such that
c1(x1)=c2(x2)=c*(x)
– for example, if we want to classify web pages: x = h x1, x2 i
Prof. Avrim Blum
My Advisor
x - Link info & Text info
Prof. Avrim Blum
My Advisor
x1- Link info
x2- Text info
Maria-Florina Balcan
Co-training
Link info
Text info
My Advisor
+
+
-
Maria-Florina Balcan
Proposed Model
• Augment the notion of a concept class C with a
notion of compatibility  between a concept and
the data distribution.
• “learn C” becomes “learn (C,)” (i.e. learn class C
under compatibility notion )
• Express relationships that one hopes the target
function and underlying distribution will possess.
• Goal: use unlabeled data & the belief that the target
is compatible to reduce C down to just {the highly
compatible functions in C}.
Maria-Florina Balcan
Proposed Model, cont
• Goal: use unlabeled data & our belief to reduce size(C)
down to size(highly compatible functions in C) in the
previous bounds.
• Want to be able to analyze how much unlabeled data is
needed to uniformly estimate compatibilities well.
• Require that the degree of compatibility be something
that can be estimated from a finite sample.
Maria-Florina Balcan
Proposed Model, cont
• Augment the notion of a concept class C with a notion of
compatibility  between a concept and the data
distribution.
• Require that the degree of compatibility be something
that can be estimated from a finite sample.
• Require  to be an expectation over individual examples:
– (h,D)=Ex 2 D[(h, x)] compatibility of h with D, (h,x) 2 [0,1]
–
errunl(h)=1-(h, D) incompatibility of h with D (unlabeled
error rate of h)
Maria-Florina Balcan
Margins, Compatibility
• Margins: belief is that should exist a large margin separator.
Highly compatible
+
_
+
_
• Incompatibility of h and D (unlabeled error rate of h) – the
probability mass within distance  of h.
• Can be written as an expectation over individual examples
(h,D)=Ex 2 D[(h,x)] where:
• (h,x)=0 if dist(x,h) · 
• (h,x)=1 if dist(x,h) ¸ 
Maria-Florina Balcan
Margins, Compatibility
• Margins: belief is that should exist a large margin
separator.
Highly compatible
+
_
+
_
• If do not want to commit to  in advance, define (h,x) to be
a smooth function of dist(x,h), e.g.:
• Illegal notion of compatibility: the largest  s.t. D has
probability mass exactly zero within distance  of h.
Maria-Florina Balcan
Co-training, Compatibility
• Co-training: examples come as pairs h x1, x2 i and the goal
is to learn a pair of functions h h1, h2 i
• Hope is that the two parts of the example are consistent.
• Legal (and natural) notion of compatibility:
– the compatibility of h h1, h2 i and D:
– can be written as an expectation over examples:
Maria-Florina Balcan
Examples of results in our model
Sample Complexity - Uniform convergence bounds
Finite Hypothesis Spaces, Doubly Realizable Case
• Assume (h,x) 2 {0,1}; define CD,() = {h 2 C : errunl(h) · }.
Theorem
• Bound the number of labeled examples as a measure of the
helpfulness of D w.r.t to 
– a helpful distribution is one in which CD,() is small
Maria-Florina Balcan
Semi-Supervised Learning
Natural Formalization (PAC)
• We will say an algorithm "PAC-learns" if it runs in
poly time using samples poly in respective bounds.
• E.g., can think of ln|C| as # bits to describe target
without knowing D, and ln|CD,()| as number of bits to
describe target knowing a good approximation to D,
given the assumption that the target has low
unlabeled error rate.
Maria-Florina Balcan
Examples of results in our model
Sample Complexity - Uniform convergence bounds
Finite Hypothesis Spaces – c* not fully compatible:
Theorem
Maria-Florina Balcan
Examples of results in our model
Sample Complexity - Uniform convergence bounds
Infinite Hypothesis Spaces
Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) = (h,x).
Maria-Florina Balcan
Examples of results in our model
Sample Complexity, -Cover-based bounds
• For algorithms that behave in a specific way:
– first use the unlabeled data to choose a representative set of
compatible hypotheses
– then use the labeled sample to choose among these
Theorem
•
Can result in much better bound than uniform convergence!
Maria-Florina Balcan
Examples of results in our model
• Let’s look at some algorithms.
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Algorithm for a simple (C,)
• X={0,1}n, C – class of disjunctions, e.g. h=x1 Ç x2 Ç x3 Ç x4 Ç x7
• For x 2 X, let vars(x) be the set of variables set to 1 by x
• For h 2 C, let vars(h) be the set of variables disjoined by h
• (h,x)=1 if either vars(x) µ vars(h) or vars(x) Å vars(h)=
• Strong notion of margin:
– every variable is either a positive indicator or a negative
indicator
– no example should contain both positive and negative indicators
• Can give a simple PAC-learning algorithm for this pair (C,).
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Algorithm for a simple (C,)
• Use unlabeled sample U to build G on n vertices:
– put an edge between i and j if 9 x in U with i,j 2 vars(x).
• Use labeled data L to label the connected components.
• Output h s. t. vars(h) is the union of the positively-labeled
components.
• If c* is fully compatible, then no component will get both positive and
negative labels.
•
and
• If |U| & |L| are as given in the bounds, then whp err(h)· .
unlabeled set U
labeled set L
011000
101000
001000
000011
100100
100000 +
000011 -
+
1
2
3
4
5
6
h=x1Çx2Çx3Çx4
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Algorithm for a simple (C,)
• Especially non-helpful distribution – the uniform
distr. over all examples x with |vars(x)|=1
– get n components; still needs (n) labeled examples
• Helpful distribution - one such that w.h.p. the # of
components is small
– need a lower number of labeled examples
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Co-training of linear separators
• Examples h x1, x2 i 2 Rn £ Rn.
• Target functions c1 and c2 are linear separators, assume
c1=c2=c*, and that no pair crosses the target plane.
– f linear separator in Rn, errunl(f) - the fraction of the pairs that
“cross f’s boundary”
• Consistency problem: given a set of labeled and unlabeled
examples, want to find a separator that is consistent with
labeled examples and compatible with the unlabeled ones.
+
• It is NP-hard - Abie.
+
-
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Co-training of linear separators
• Assume independence given the label (both points from D+ or
from D-).
• [Blum & Mitchell ’98] show can co-train (in polynomial time) if
have enough labeled data to produce a weakly-useful
hypothesis to begin with.
• We show, can learn with only a single labeled example.
• Key point: independence given the label implies that the
functions with low errunl rate are:
•
•
•
•
close to c*
close to : c*
close to the all positive function
close to the all negative function
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Co-training of linear separators
• Nice Tool: a “super simple algorithm” for weak
learning a large-margin separator:
– pick c at random
• If margin=1/poly(n), then a random c has at least
1/poly(n) chance of being a weak predictor
Maria-Florina Balcan
Examples of results in our model
Algorithmic Issues: Co-training of linear separators
• Assume independence given the label.
• Draw a large unlabeled sample S={(x1i,x2i)}.
• If also assume large margin,
– run the “super-simple alg” poly(n) times
– feed each c into [Blum & Mitchell] booster
– examine all the hypotheses produced, and pick one h with small
errunl, that is far from all-positive and all-negative functions
– use labeled example to choose either h or : h
• w.h.p. one random c was a weakly-useful predictor; so on at least
one of these steps we end up with a hypothesis h with small err(h),
and so with small errunl(h)
• If don’t assume large margin,
– use Outlier Removal Lemma to make sure that at least 1/poly
fraction of the points in S1={x1i} have margin at least 1/poly;
this is sufficient.
Maria-Florina Balcan
Implications of our analysis
Ways in which unlabeled data can help
• If the target is highly compatible with D and have enough
unlabeled data to estimate  over all h 2 C, then can
reduce the search space (from C down to just those h 2 C
whose estimated unlabeled error rate is low).
• By providing an estimate of D, unlabeled data can allow a
more refined distribution-specific notion of hypothesis
space size (such as Annealed VC-entropy or the size of
the smallest -cover).
Maria-Florina Balcan
Questions?
Maria-Florina Balcan
Maria-Florina Balcan
Download