Actively Transfer Domain Knowledge - UIC

advertisement
Relaxed Transfer of
Different Classes
via Spectral Partition
1. Unsupervised
Xiaoxiao Shi1
Wei Fan2
3 Jiangtao
2. Can
use data
with different
classes to
help.4How
Qiang
Yang
Ren
so?
1 University of Illinois at Chicago
2 IBM T. J. Watson Research Center
3 Hong Kong University of Science and Technology
4 Sun Yat-sen University
What is Transfer Learning?
Standard Supervised Learning
training
(labeled)
test
(unlabeled)
Classifier
New York Times
85.5%
New York Times
2
What is Transfer Learning?
In Reality…
training
(labeled)
How to improve
the performance?
test
(unlabeled)
47.3%
Labeled data are
insufficient!
New York Times
New York Times3
What is Transfer Learning?
Source domain
training (labeled)
Target domain
test (unlabeled)
Transfer
Classifier
Reuters
82.6%
New York Times
Not necessary from
the same domain and do not follow the same distribution
4
Transfer across Different Class Labels
Source domain
training (labeled)
Target domain
test (unlabeled)
Transfer
Classifier
82.6%
Reuters
New York Times
Labels:
Labels:
World
How to transfer
Markets
U. S.
Since theywhen
are from
different domains,
class labels
Politics
are different
different? class Fashion
they may have
labels!Style
Entertainment
Travel
in number and meaning
Blogs
……
……
5
Two Main Categories of Transfer Learning
• Unsupervised Transfer Learning
– Do not have any labeled data from the target domain.
– Use source domain to help learning.
– Question: is it better than clustering?
• Supervised Transfer Learning
– Have limited number of labeled examples from target
domain
– Is it better than not using any source data example?
6
Transfer across Different Class Labels
• Two sub-problems:
– (1) What and how to transfer, since we can
not explicitly use P(x|y) or P(y|x) to build the
similarity among tasks (class labels ‘y’ have
different meanings)?
– (2) How to avoid negative transfer since the
tasks may be from very different domains?
Negative Transfer: when the tasks are too different,
transfer learning may hurt learning accuracy.
7
The proposed solution
• (1) What and How to Dataset
transfer?
exhibits complex
2
1.5
cluster shapes
– Transfer the eigensapce
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-0.5
-1
 K-means performs very
poorly in this space due bias
toward dense spherical
clusters.
-1.5
-2
Eigenspace: space expended by a set of eigen vectors.
0.8
0.6
0.4
0.2
In the eigenspace (space
given by the
eigenvectors), clusters are
trivial to separate.
-0.709
-0.7085
-0.708
-0.7075
-0.707
-0.7065
0
-0.706
-0.2
8-0.4
-0.6
9
The proposed solution
To• get
divergence:
(2) the
HowClustering-based
to avoid negative KL
transfer?
(1) Perform
Clustering
on
the
combined
dataset.
– A new clustering-based KL Divergence to reflect
(2) Calculate
thedifferences.
KL divergence by some basic
distribution
statistical
properties
the clusters.
– If distributions
are tooof
different
(KL is large),
See Example.
automatically decrease the effect from source domain.
Traditional KL Divergence
Need to solve P(x), Q(x) for every x,
which is normally difficult to obtain.
10
E(P)=8/15
E(Q)=7/15
An Example
Q
P
the portion of
For example, S(P’,examples
C) meansin“the
portion of
P that
thecontained
portion ofinin cluster C ”.
are
examples in P that are
contained
examples
Q that
clusterinC1
are contained in
the
portion
C2
cluster
C1of
examples
P that
S(P’, in
C1)
= 0.5
Clustering
are
thecontained
portion ofin
clusterinC2
examples
Q that
S(Q’,
C1)
= 0.5
are contained in
C1
cluster C2
Combined
Dataset
P’(C1)=3/15
Q’(C1)=3/15
P’(C2)=5/15
Q’(C2)=4/15
S(P’, C2) =5/9
KL=0.0309
S(Q’, C2) =4/9
11
Objective Function
• Objective: Find an eigenspace that well separates the target
data
– Intuition: If the source data is similar to the target data,
make good use of the source eigenspace;
– Otherwise, keep the original structure of the target data.
Traditional
Penalty
Normalized Cut Term
Prefer
Source
Eigenspace
Prefer
Original
Structure
Balanced by R(L; U)
More similar of distributions, less is R(L; U),
more the function will rely on source
eigenspace TL
12
How to construct constraint TL and Tu?
• Principle:
– To construct TL --- it is directly derived from the
“must-link” constraint (the examples with the
same label should be together).
1
4
3
1, 2, 4 should be together (blue);
5
3, 5, 6 should be together (red)
6
2
– To construct TU --- (1) Perform standard spectral
clustering (e.g., Ncut) on U. (2) the examples in
the same cluster should be together.
1
3
2
4
5
6
1, 2, 3 should be together;
4, 5, 6 should be together
13
How to construct constraint TL and Tu?
• Construct the constraint matrix
M=[m1, m2, …, mr]’
T
For example,
ML =
1
3
2
4
5
6
1, -1, 0, 0, 0, 0
1 and 2
1, 0, 0, -1, 0, 0
1 and 4
0, 0, 1, 0, -1, 0
3 and 5
……
14
Experiment Data sets
15
Experiment data sets
16
Text Classification
120%
Comp1
VS
Rec1
100%
80%
60%
40%
1
2
Full Transfer
1: comp2 VS Rec2
No Transfer
2: 4 classes (Graphics, etc)
3
RSP
3: 3 classes (crypt, etc)
90%
Org1
VS
People1
80%
70%
60%
50%
1
2
Full Transfer
1: org2 VS People2
No Transfer
2: 3 classes (Places, etc)
3
RSP
3: 3 classes (crypt, etc)
17
Image Classification
90%
80%
Homer
VS
Real Bear
70%
60%
50%
1
2
Full Transfer
1: Superman VS Teddy
3
No Transfer
2: 3 classes (cartman, etc)
RSP
3: 4 classes (laptop, etc)
100%
90%
80%
Cartman
VS
Fern
70%
60%
50%
1
2
Full Transfer
1: Superman VS Bonsai
No Transfer
2: 3 classes (homer, etc)
3
RSP
3: 4 classes (laptop, etc)
18
Parameter Sensitivity
19
Conclusions
• Problem: Transfer across tasks with different
class labels
• Two sub-problems:
• (1) What and How to transfer?
• Transfer the eigenspace.
• (2) How to avoid negative transfer?
• Propose an effective clustering-based KL
Divergence; if KL is large, or distributions
are too different, decrease the effect from
source domain.
20
Thanks!
Datasets and codes:
http://www.cs.columbia.edu/~wfan/software.htm
21
# Clusters?
Condition for Lemma 1 to be valid: In each cluster, the expected values
of the target and source data are about the same.
If
where
>
is close to 0.
Adaptively Control the #Clusters to guarantee Lemma 1 valid!
--Stop bisecting clustering when there is only target/source data in
the cluster, or
22
Optimization
Let
Then,
Algorithm flow
23
Download