Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Defense September 8,2008

advertisement
Techniques For Exploiting Unlabeled Data
Thesis Defense
Mugizi Rwebangira
September 8,2008
Committee:
Avrim Blum, CMU (Co-Chair)
John Lafferty, CMU (Co-Chair)
William Cohen, CMU
Xiaojin (Jerry) Zhu, Wisconsin
Motivation
Supervised Machine Learning:
Labeled Examples {(xi,yi)}
induction
Model x →y
Problems: Document classification, image classification, protein
sequence determination.
Algorithms: SVM, Neural Nets, Decision Trees, etc.
2
Motivation
In recent years, there has been growing
interest in techniques for using unlabeled
data:
More data is being collected than ever before.
Labeling examples can be expensive and/or
require human intervention.
3
Examples
Images: Abundantly available (digital
cameras) labeling requires humans
(captchas).
Web Pages: Can be easily crawled on the
web, labeling requires human intervention.
Proteins: sequence can be easily determined,
structure determination is a hard problem.
4
Motivation
Semi-Supervised Machine Learning:
Labeled Examples {(xi,yi)}
x →y
Unlabeled Examples {xi}
5
Motivation
+
+
-
6
However…
Techniques not as well developed as
supervised techniques:
Best practices for using unlabeled data:
Techniques for adapting supervised
algorithms to semi-supervised algorithms
7
Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Conclusion and Questions
8
Graph Mincut (Blum & Chawla,2001)
9
Construct an (unweighted) Graph
10
Add auxiliary “super-nodes”

+



-

11
Obtain s-t mincut

+




-
Mincut
12
Classification

+




-
Mincut
13
Problem
Plain mincut can give very unbalanced cuts.
+


-
14
Solution
Add random weights to the edges
Run plain mincut and obtain a classification.
Repeat the above process several times.
For each unlabeled example take a majority vote.
15
Before adding random weights

+




-
Mincut
16
After adding random weights

+




-
Mincut
17
PAC-Bayes
• PAC-Bayes bounds suggests that when the
graph has many small cuts consistent with
the labeling, randomization should improve
generalization performance.
• In this case each distinct cut corresponds to
a different hypothesis.
• Hence the average of these cuts will be less
likely to overfit than any single cut.
18
Markov Random Fields
• Ideally we would like to assign a weight to
each cut in the graph (a higher weight to
small cuts) and then take a weighted vote
over all the cuts in the graph.
• This corresponds to a Markov Random
Field model.
• We don’t know how to do this efficiently,
but we can view randomized mincuts as an
approximation.
19
How to construct the graph?
• k-NN
– Graph may not have small balanced cuts.
– How to learn k?
• Connect all points within distance δ
– Can have disconnected components.
– How to learn δ?
• Minimum Spanning Tree
– No parameters to learn.
– Gives connected, sparse graph.
– Seems to work well on most datasets.
20
Experiments
• ONE vs. TWO: 1128 examples .
• (8 X 8 array of integers, Euclidean distance).
• ODD vs. EVEN: 4000 examples .
• (16 X 16 array of integers, Euclidean distance).
• PC vs. MAC: 1943 examples .
• (20 newsgroup dataset, TFIDF distance) .
21
ONE vs. TWO
22
ODD vs. EVEN
23
PC vs. MAC
24
Summary
Randomization helps plain mincut achieve a comparable
performance to Gaussian Fields.
We can apply PAC sample complexity analysis and interpret it in
terms of Markov Random Fields.
There is an intuitive interpretation for the confidence of a
prediction in terms of the “margin” of the vote.
“Semi-supervised Learning Using Randomized Mincuts”,
A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
ICML 2004
25
Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
26
(Supervised) Linear Regression
y
*
*
*
*
x
27
Semi-Supervised Regression
y
*
*
*
*
+
++++ + +
++ x
28
Smoothness assumption
Things that are close together should have similar values
One way of doing this:
Minimize
ξ(f) = ∑ wij(fi-fj)2
Where wij is the similarity between examples i and j.
And fi and fj are the predictions for example i and j.
Gaussian Fields (Zhu, Ghahramani & Lafferty)
29
Local Constancy
The predictions made by Gaussian Fields are locally constant
y
*u
+
u +Δ
x
More formally: m (u + Δ) ≈ m(u)
30
Local Linearity
For many regression tasks we would prefer predictions to be
locally linear.
y
*u
+
u +Δ
x
More formally: m (u + Δ) ≈ m(u) + m’(u) Δ
31
Problem
Develop a version of Gaussian Fields which is Local Linear
Or a semi-supervised version of Linear Regression
Local Linear Semi-supervised Regression
32
Local Linear Semi-supervised Regression
By analogy with
∑ wij(fi-fj)2
βj
βjo
βio
}
(βio – XjiTβj)2
XjiTβj
βi
xj
xi
33
Local Linear Semi-supervised Regression
So we find β to minimize the following objective function
ξ(β) = ∑ wij (βio – XjiTβj)2
Where wij is the similarity between xi and xj.
34
Synthetic Data: Gong
Gong function y = (1/x)sin (15/x)
σ2 = 0.1 (noise)
35
Experimental Results: GONG
Weighted Kernel Regression, MSE=25.7
36
Experimental Results: GONG
Local Linear Regression, MSE=14.4
37
Experimental Results: GONG
LLSR, MSE=7.99
38
PROBLEM: RUNNING TIME
If we have n examples and dimension d then to compute a closed
form solution we have to invert an (n(d+1) * n(d+1)) matrix.
This is prohibitively expensive, especially if d is large.
For example if n=1500 and d=199 then we have to invert a matrix
of size 720
GB in Matlab’s double precision format.
39
SOLUTION: ITERATION
It turns out that because of the form of the equation we can start from
an arbitrary initial guess and do an iterative computation that
provably converges to the desired solution.
In the case of n=1500 and d=199, instead of dealing with a matrix
of size 720 GB we only have to store 2.4
makes the algorithm much more practical.
MB in memory which
40
Experiments on Real Data
We do model selection using Leave One Out Cross validation
We compare:
Weighted Kernel Regression (WKR) – a purely supervised method.
Local Linear Regression (LLR) – another purely supervised method.
Local Learning Regularization (LL-Reg) – an up to date semi-supervised method
Local Linear Semi-Supervised Regularization (LLSR)
For each algorithm and dataset we give:
1. The mean and standard deviation of 10 runs.
2. The results of an OPTIMAL choice of parameters.
41
Experimental Results
Dataset
n
d nl
LLSR
LLSR-OPT
WKR
WKR-OPT
Carbon
58 1
10
27±25
19±11
70±36
37±11
Alligators
25 1
10 288±176
209±162
336±210
324±211
Smoke
25 1
10
82±13
79±13
83±19
80±15
392 7 100
50±2
49±1
57±3
57±3
Autompg
Dataset
n
d nl
LLR
LLR-OPT
LL-Reg
LL-Reg-OPT
Carbon
58 1
10
57±16
54±10
162±199
74±22
Alligators
25 1
10 207±140
207±140
289±222
248±157
Smoke
25 1
10
82±12
80±13
82±14
70±6
392 7 100
53±3
52±3
53±4
51±2
Autompg
42
Summary
LLSR is a natural semi-supervised generalization of Linear Regression
While the analysis is not as clear as with semi-supervised classification,
semi-supervised regression can perform better than supervised
regression if the function has a smooth manifold similar to the
GONG function.
FUTURE WORK:
Carefully analyzing the assumptions under which unlabeled data can
be useful in regression.
43
Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
44
Kernels
K(x,y): Informally considered as a measure of similarity between x and y
Kernel trick: K(x,y) = Φ(x)∙Φ(y) (Mercer’s theorem)
This allows us to implicitly project non-linearly separable data into a
high dimensional space where a linear separator can be found .
Kernel must satisfy strict mathematical definitions
1. Continuous
2. Symmetric
3. Positive semi-definite
45
Problems with Kernels
There is a conceptual disconnect between the notion of kernels as
similarity functions and the notion of finding max-margin separators
in possibly infinite dimensional Hilbert spaces.
The properties of kernels such as being Positive Semi-Definite are
rather restrictive and in particular similarity functions used in certain
domains, such as the Smith-Waterman score in molecular biology do
do not fit in this framework.
WANTED: A method for using similarity functions that is both
easy and general.
46
The Balcan-Blum approach
An approach fitting these requirements was recently proposed by
Balcan and Blum.
Gave a general definition of a good similarity function for learning.
Showed that kernels are special case of their definition.
Gave an algorithm for learning with good similarity functions.
47
The Balcan-Blum approach
Suppose S(x,y) \in (-1,+1) is our similarity function. Then
1. Draw d examples {x1, x2, x3, … xd} uniformly at random from the
data set.
2. For each example x compute the mapping x → {S(x,x1), S(x,x2),
S(x,x3), … S(x,xd)}
KEY POINT: This method can make use of
UNLABELED DATA.
48
Combining Feature based and Graph
Based Methods
Feature based methods directly operate on the native features:e.g. Decision Tree, MaxEnt, Winnow, Perceptron
Graph based methods operate on the graph of similarities between
examples, e.g Kernel methods, Gaussian Fields, Graph mincut and
most semi-supervised learning methods.
These methods can work well on different datasets, we want to find a
way to find a way to COMBINE these approaches into one algorithm.
49
SOLUTION: Similarity functions + Winnow
Use the Balcan-Blum approach to generate extra features.
Append the extra features to the original features:-
x → {x,S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}
Run the Winnow algorithm on the combined features
(Winnow is known to be resistant to irrelevant features.)
50
Our Contributions
Practical techniques for using similarity functions
Combining graph based and feature based learning.
51
How to define a good similarity function?
By modifying a distance metric:K(x,y) = 1/(D(x,y)+1)
Problem: We can end up with all similarities close to ZERO (not good)
Solution: Scale the similarities as follows:
Sort the similarities for example x from most similar to least.
Give the most similar similarity +1 and the least, similarity -1 and
interpolate the remaining example in between.
VERY IMPORTANT: The ranked similarity may not be symmetric
Which is a big difference with kernels.
52
Evaluating a similarity function
K is strongly (ε,γ)-good similarity function for a learning problem
P if at least a (1- ε) probability mass of examples x satisfy
Ex’~P [K(x’,x)|l(x’)=l(x)] ≥ Ex’~P [K(x’,x)|l(x’) ≠l(x)] + γ
For a particular similarity function and dataset we can compute the
margin γ for each example and then plot the examples by decreasing
margin. If the margin is large for most examples, this is an
indication that the similarity function may perform well on a
particular dataset.
53
Compatibility of the naïve similarity function on Digits1
54
Compatibility of the ranked similarity function on Digits1
55
Experimental Results
We’ll look at some experimental results on both real and
synthetic datasets.
56
Synthetic Data: Circle
57
Experimental Results: Circle
58
Synthetic Data: Blobs and Lines
Can we create a data set that needs BOTH the original and the new
features to do well?
To answer this we create the data set we will call “Blobs and Lines”
We generate the data in the following way:
1. We select k point to be the centers of our “blobs” and assign them
labels in {-1,+1}.
2. We flip a coin.
3. If heads, then we set x to be a random boolean vector of dimension d
and set the label to be the first coordinate of x.
4. If tails, we pick one of the centers and flip r bits and set x equal to that
and set the label to the label of the center.
59
Synthetic Data: Blobs and Lines
+
+
+
-
-
+ +
+
+
+
-
-
60
Experimental Results: Blobs and Lines
61
Experimental Results: Real Data
Dataset
n
d
nl
Winnow
SVM
NN
SIM
Winnow+SVM
Congress
435
16
100
93.79
94.93
90.8
90.90
92.24
Webmaster
582
1406
100
81.97
71.78
72.5
69.90
81.20
Credit
653
46
100
78.50
55.52
61.5
59.10
77.36
Wisc
683
89
100
95.03
94.51
95.3
93.65
94.49
Digit1
1500 241
100
73.26
88.79
94.0
94.21
91.31
USPS
1500 241
100
71.85
74.21
92.0
86.72
88.57
62
Experimental Results: Concatenation
What if we did something halfway between synthetic and real, by
concatenating two different datasets? This can be viewed as simulating
a dataset that has two different kinds of data.
We concatenated the datasets by padding each of them with a block of ZEROS.
Credit (653 X 46)
Padding (653 X 241)
Padding (653 X 46)
Digit1 (653 X 241)
Dataset
n
d
nl
Winnow SVM
NN
Credit +
Digit1
1306
287
100
72.41
75.46 74.25
51.74
SIM
Winnow+SVM
83.95
63
Conclusions
Generic similarity functions have a lot of potential to be applied to
practical applications.
Combining feature based and graph based methods we can often get the
“best of both worlds”
FUTURE WORK
Designing similarity functions suited to particular domains.
Theoretically provable guarantees on the quality of a similarity function
64
QUESTIONS?
65
Back Up Slides
66
References
“Semi-supervised Learning Using Randomized Mincuts”,
A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
ICML 2004
67
My Work
Techniques for improving graph mincut
algorithms for semi-supervised classification
Techniques for extending Local Linear
Regression to the semi-supervised setting
Practical techniques for using unlabeled data
and generic similarity functions to
“kernelize” the winnow algorithm.
68
Problem
There may be several minimum cuts in the graph.
+


-
Indeed, there are potentially exponentially many
minimum cuts in the graph.
69
Real Data: CO2
Carbon dioxide concentration in the atmosphere over the last two
centuries.
Source: World Watch Institute
70
Experimental Results: CO2
Local Linear Regression, MSE = 144
71
Experimental Results: CO2
Weighted Kernel Regression, MSE = 660
72
Experimental Results:CO2
LLSR, MSE = 97.4
73
Winnow
A linear separator algorithm, first proposed by Littlestone.
We are particularly interested in winnow because
1. It is known to be able to effectively learn in the presence of irrelevant
attributes. Since we will be creating many new features, we expect many
of them will be irrelevant.
2. It is fast and does not require a lot of memory. Since we hope to use
large amounts of unlabeled data, scalability is an important
consideration.
74
PROPOSED WORK: Improving Running
Time
Sparsification: Ignore examples which are far away so as to get
a sparser matrix to invert.
Iterative Methods for solving Linear systems: For a matrix
equation Ax=b, we can obtain successive approximations x1, x2
… xk. Can be significantly faster if matrix A is sparse.
79
PROPOSED WORK: Improving Running
Time
Power series: Use the identity (I-A)-1 = I + A + A2 + A3 + …
y’ =(Q+γΔ)-1Py = Q-1Py + (-γQ-1Δ)Q-1Py + (-γQ-1Δ)2Q-1Py + …
A few terms may be sufficient to get a good approximation
Compute supervised answer first, then “smooth” the answer to get semiSupervised solution. This can be combined with iterative methods as we
can use the supervised solution as the starting point for our iterative
algorithm.
80
PROPOSED WORK: Experimental
Evaluation
Comparison against other proposed semi-supervised regression
algorithms.
Evaluation on a large variety of data sets, especially high dimensional
ones.
81
PROPOSED WORK
Overall goal: Investigate the practical applicability of this theory
and find out what is needed to make it work on real problems.
Two main application areas:
1. Domains which have expert defined similarity functions that are
not kernels (protein homology).
2. Domains which have many irrelevant features and in which the data
may not be linearly separable in the original features (text
classification).
82
PROPOSED WORK: Protein Homology
The Smith-Waterman score is the best performing measure of similarity
but it does not satisfy the kernel properties.
Machine learning applications have either used other similarity functions
Or tried to force SW score into a kernel.
Can we achieve better performance by using SW score directly?
83
PROPOSED WORK: Text Classification
Most popular technique is Bag-of-Words (BOW) where each document
is converted into a vector and each position in the vector indicates how
many times each word occurred.
The vectors tend to be sparse and there will be many irrelevant features,
hence this is well suited to the Winnow algorithm. Our approach makes
the winnow algorithm more powerful.
Within this framework we have strong motivation for investigating
“domain specific” similarity function, e.g. “edit distance” between
documents instead of cosine similarity.
Can we achieve better performance than current techniques using
“domain specific” similarity functions?
84
PROPOSED WORK: Domain Specific
Similarity Functions
As mentioned in the previous two slides, designing specific similarity
functions for each domain, is well motivated in this approach.
What are the “best practice” principles for designing domain specific
similarity functions?
In what circumstances are domain specific similarity functions likely to
be most useful?
We will answer these questions by generalizing from several
different datasets and systematically noting what seems to work best.
85
Proposed Work and Time Line
Summer 2007
(1)
(2)
Speeding up LLSR
Learning with similarity in protein homology and text classification
domain.
Fall 2007
(1)
(2)
Comparison of LLSR with other semi-supervised regression algs.
Investigate principles of domain specific similarity functions.
Spring 2008
Start Writing Thesis
Summer 2008
Finish Writing Thesis
86
Kernels
K(x,y) = Φ(x)∙Φ(y)
Allows us to implicitly project non-linearly separable data into a high
dimensional space where a linear separator can be found .
Kernel must satisfy strict mathematical definitions
1. Continuous
2. Symmetric
3. Positive semi-definite
87
Generic similarity Functions
What if the best similarity function in a given domain does not satisfy the
properties of a kernel?
Two options:
1. Use a kernel with inferior performance
2. Try to “coerce” the similarity function into a kernel by building a kernel
that has similar behavior.
There is another way …
88
Download