Boosting and Semi-supervised Learning Le Song

advertisement
Boosting and Semi-supervised
Learning
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
String Kernels
Compare two sequences for similarity
K(
ACAAGAT
GCCATTG
TCCCCCG
GCCTCCT
GCTGCTG
,
GCATGAC
GCCATTG
ACCTGCT
GGTCCTA
)=0.7
Exact matching kernel
Counting all matching substrings
Flexible weighting scheme
Does not work well for noisy case
Successful applications in bio-informatics
Linear time algorithm using suffix trees
2
Exact matching string kernels
Bag of Characters
Count single characters, set 𝑤𝑠 = 0 for 𝑠 > 1
Bag of Words
s is bounded by whitespace
Limited range correlations
Set 𝑤𝑠 = 0 for all 𝑠 > 𝑛 given a fixed 𝑛
K-spectrum kernel
Account for matching substrings of length 𝑘, set 𝑤𝑠 = 0 for all
𝑠 ≠𝑘
3
Suffix trees
Definitions: compact tree built from all the suffixes of a string.
Eg. suffix tree of ababc denoted by S(ababc)
Node Label = unique path from the root
Suffix links are used to speed up parsing of strings: if we are at
node 𝑎𝑥 then suffix links help us to jump to node 𝑥
Represent all the substrings of a given string
Can be constructed in linear time and stored in linear space
Each leaf corresponds to a unique suffix
Leaves on the subtree give number of occurrence
4
Graph Kernels
Each data point itself is a graph, and kernel is a similarity
measure between graphs
5
Use graph isomorphism test design kernel
Similarity score be high for identical graphs, low for very
different graphs
Efficient computation
The resulting similarity measure has to be positive semidefinite
Graph isomorphism test tries to test whether two graphs are
identical (eg. Weisfeiler-Lehman algorithm)
Modify a graph isomorphism test to design kernels
6
Weisfeiler-Lehman algorithm
Use multiset labels to capture neighborhood structure in a
graph
If two graphs are the same, then the multiset labels should be
the same as well
7
Weisfeiler-Lehman algorithm II
Relabel graph and construct new multiset label
Check whether the new labels are the same
8
Weisfeiler-Lehman algorithm II
If the new multilabel sets are not the same, stop the algorithm
and declare the two graphs are not the same
Effectively it is unrolling the graph into trees rooted at each
node and use multilabels to identify these trees
9
Design kernel from isomorphism test
10
Graph kernel example
Feature vectors are produced in each iteration, and they are
concatenated into a long feature vector
Feature vectors counts the occurrences of unique labels
11
Graph kernel example II
Relabel the graph, and obtain new feature vector
When the maximum iteration is reached, just do inner product
of the feature feature vector to obtain the kernel value
12
Combining classifiers
Ensemble learning: a machine learning paradigm where
multiple learners are used to solve the problem
Previously:
Ensemble:
Problem
Problem
… ...
Learner
Learner
Learner
… ...
Learner
The generalization ability of the ensemble is usually
significantly better than that of an individual learner
Boosting is one of the most important families of ensemble
methods
13
Combining classifiers
Average results from several different models
Bagging
Stacking (meta-learning)
Boosting
Why?
Better classification performance than individual classifiers
More resilience to noise
Concerns
Take more time to obtain the final model
Overfitting
14
Bagging
Bagging: Bootstrap aggregating
Generate B bootstrap samples of the training data: uniformly
random sampling with replacement
Train a classifier or a regression function using each bootstrap
sample
For classification: majority vote on the classification results
For regression: average on the predicted values
Original
Training set 1
Training set 2
Training set 3
Training set 4
1
2
7
3
4
2
7
8
6
5
3
8
5
2
1
4
3
6
7
4
5
7
4
5
6
6
6
2
6
4
7
3
7
2
3
8
1
1
2
8
15
Stacking classifiers
Level-0 models are based on different learning models and use
original data (level-0 data)
Level-1 models are based on results of level-0 models (level-1
data are outputs of level-0 models) -- also called “generalizer”
If you have lots of models, you can stacking into deeper
hierarchies
16
Boosting
Boosting: general methods of converting rough rules of thumb
into highly accurate prediction rule
A family of methods which produce a sequence of classifiers
Each classifier is dependent on the previous one and focuses on
the previous one’s errors
Examples that are incorrectly predicted in the previous classifiers
are chosen more often or weighted more heavily when
estimating a new classifier.
Questions:
How to choose “hardest” examples?
How to combine these classifiers?
17
AdaBoost
18
Adaboost flow chart
Original training set
Data set 1
training instances that are wrongly
predicted by Learner1 will play more
important roles in the training of
Learner2
Data set 2
… ...
Data set T
… ...
Learner1
Learner2
… ...
LearnerT
weighted combination
19
Toy Example
Weak classifier (rule of thumb): vertical or horizontal halfplanes
Uniform weights on all examples
20
Boosting round 1
Choose a rule of thumb (weak classifier)
Some data points obtain higher weights because they are
classified incorrectly
21
Boosting round 2
Choose a new rule of thumb
Reweight again. For incorrectly classified examples, weight
increased
22
Boosting round 3
Repeat the same process
Now we have 3 classifiers
23
Boosting aggregate classifier
Final classifier is weighted combination of weak classifiers
24
Theoretical properties
Y. Freund and R. Schapire [JCSS97] have proved that the training
error of AdaBoost is bounded by:
where
Thus, if each base classifier is slightly better than random so
that
for some
, then the training error drops
exponentially fast in T since the above bound is at most
25
How will train/test error behave?
Expect: training error to continue to drop (or reach zero)
First guess: test error to increase when the continued classifier
becomes too complex
Overfitting: hard to know when to stop training
26
Actual experimental observation
Test error does not increase, even after 1000 rounds
Test error continues to drop even after training error is zeros!
27
Theoretical properties (con’t)
Y. Freund and R. Schapire [JCSS97] have tried to bound the
generalization error as:
Pr[𝑓 𝑥 ≠ 𝑦] ≤ 𝑂
𝑇𝑑
𝑚
Where 𝑚 is the number of samples and d is the VCdimension of the weak learner
It suggests Adaboost will overfit if T is large. However,
empirical studies show that Adaboost often does not overfit.
Classification error only measure whether classification right or
wrong, also need to consider confidence of classifier
Margin-based bound: for any 𝜃 > 0 with high probability
Pr[𝑚𝑎𝑟𝑔𝑖𝑛𝑓 𝑥, 𝑦 ≤ 𝜃] ≤ 𝑂
𝑑
𝑚𝜃 2
28
Applications of boosting
AdaBoost and its variants have been applied to diverse
domains with great success. Here I only show one example
P. Viola & M. Jones [CVPR’01] combined AdaBoost with a
cascade process for face detection
They regarded rectangular features as weak classifiers
29
Applications of boosting
By using AdaBoost to weight the weak classifiers, they got two
very intuitive features for face detection
Finally, a very strong face detector: On a 466MHz SUN machine,
a 384288 image requires only 0.067 seconds! (in average,
only 8 features needed to be evaluated per image)
30
Boosting for face detection
31
Semi-supervised learning
Supervised Learning = learning from labeled data. Dominant
paradigm in Machine Learning.
E.g, say you want to train an email classifier to distinguish spam
from important messages
32
Semi-supervised learning
Supervised Learning = learning from labeled data. Dominant
paradigm in Machine Learning.
E.g, say you want to train an email classifier to distinguish spam
from important messages
Take sample S of data, labeled according to whether they
were/weren’t spam.
33
Basic paradigm has many successes
recognize speech
steer a car
classify documents
classify proteins
recognizing faces, objects in images
...
34
Labeled data can be rare or expensive
Need to pay someone to do it, requires special testing …
Unlabeled data is much cheaper
Can we make use of cheap unlabeled data?
Unlabeled data is missing the most important information
But maybe still has useful regularities that we can use.
Three supervised method
Co-training
Semi-Supervised (Transductive) SVM [Joachims98]
Graph-based methods
35
Co-training
Many problems have two different sources of info you can use
to determine label.
E.g., classifying webpages: can use words on page or words on
links pointing to the page.
Prof. Avrim Blum
My Advisor
x - Link info & Text info
Prof. Avrim Blum
My Advisor
x1- Link info
x2- Text info
36
Co-training
Idea: Use small labeled sample to learn initial rules.
E.g., “my advisor” pointing to a page is a good indicator it is a
faculty home page.
E.g., “I am teaching” on a page is a good indicator it is a faculty
home page.
my advisor
37
Co-training
Then look for unlabeled examples where one rule is confident
and the other is not. Have it label the example for the other.
Training 2 classifiers, one on each type of info. Using each to
help train the other.
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
38
Co-training
Turns out a number of problems can be set up this way.
E.g., [Levin-Viola-Freund03] identifying objects in images. Two
different kinds of preprocessing.
E.g., [Collins&Singer99] named-entity extraction.
– “I arrived in London yesterday”
39
Co-training
Setting is each example x = (x1,x2), where x1, x2 are two “views”
of the data.
Have separate algorithms running on each view. Use each to
help train the other.
Basic hope is that two views are consistent. Using agreement
as proxy for labeled data.
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
40
Toy example: intervals
As a simple example, suppose x1, x2 ∈ 𝑅. Target function is some
interval [a,b].
b2
+
+
a2
a1
+
+
+
b1
41
Webpage classification
12 labeled examples, 1000 unlabeled
(sample run)
42
Images classification
Visual detectors with different kinds of processing
• [Levin-Viola-Freund ‘03]:
• Images with 50 labeled
cars. 22,000 unlabeled
images.
• Factor 2-3+ improvement.
43
Semi-Supervised SVM (S3VM)
Suppose we believe decision boundary goes through low
density regions of the space/large margin.
Aim for classfiers with large margin wrt labeled and unlabeled
data. (L+U)
+
_
+
_
+
_
+
_
+
_
+
_
SVM
Labeled data only
S3VM
44
Semi-Supervised SVM (S3VM)
Unfortunately, optimization problem is now NP-hard. Algorithm
instead does local optimization.
Start with large margin over labeled data. Induces labels on U.
Then try flipping labels in greedy fashion.
Or, branch-and-bound, other methods (Chapelle etal06)
Quite successful on text data.
+
+
_
_
+
+
_
_
+
+
_
_
45
Graph-based methods
Suppose that very similar examples probably have the same
label
If you have a lot of labeled data, this suggests a NearestNeighbor type of algorithm
If you have a lot of unlabeled data, perhaps can use them as
“stepping stones”
E.g., handwritten digits [Zhu07]:
46
Graph-based methods
Idea: construct a graph with edges between very similar
examples.
Unlabeled data can help “glue” the objects of the same class
together.
47
Graph-based methods
Idea: construct a graph with edges between very similar
examples.
Unlabeled data can help “glue” the objects of the same class
together.
48
Graph-based methods
Idea: construct a graph with edges between very similar
examples
Unlabeled data can help “glue” the objects of the same class
together
Suppose just two labels: 0 & 1. Solve for labels f(x) for
unlabeled examples x to minimize:
Label propagation: average of neighbor labels
Minimum cut e=(u,v)|f(u)-f(v)|
+
Minimum “soft-cut” e=(u,v)(f(u)-f(v))2
Spectral partitioning
+
49
Download