Large Graph Construction for Scalable Semi-Supervised Learning Wei Liu Junfeng He

advertisement
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Large Graph Construction for Scalable
Semi-Supervised Learning
Wei Liu
Junfeng He
Shih-Fu Chang
Department of Electrical Engineering
Columbia University
New York, NY, USA
International Conference on Machine Learning, 2010
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Outline
1
Motivation
2
Related Work
Mid-Scale SSL
Large-Scale SSL
3
Anchor-Based Label Prediction
Anchors
Anchor-to-Data Nonparametric Regression
4
AnchorGraph
Coupling Label Prediction and Graph Construction
AnchorGraph Regularization
5
Experiments
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Why Semi-Supervised Learning (SSL)?
In practice, labeling data is expensive.
Using labeled data and unlabeled data, SSL generalizes
and improves supervised learning.
Most SSL methods cannot handle a large amount of
unlabeled data because they scale poorly with the data
size.
Large-scale SSL is critical nowadays as we can easily
collect massive (up to hundreds of millions) unlabeled data
from the Internet (Facebook, Flickr).
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
SSL on Large Social Networks
http://www.cmu.edu/joss/content/articles/volume1/Freeman.html
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Mid-Scale SSL
Manifold and Graph
Manifold Assumption
Close samples in some similarity measure have similar class
labels. Graph-based SSL methods including LGC, GFHF,
LapSVM, etc. adopt this assumption.
Neighborhood Graph
A sparse graph can reveal data manifold, where adjacent graph
nodes represent nearby data points. The commonly used graph
is kNN graph (linking graph edges between any data point and
its k nearest neighbors) whose construction takes O(kn2 ).
Inefficient in large scale!
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Mid-Scale SSL
Manifold and Graph
Dense graphs hardly reveal manifolds.
Sparse graphs are preferred.
Example
data manifold (two submanifolds)
6NN Graph
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.8
−1.5
−0.6
−1
−0.5
0
0.5
1
1.5
2
2.5
3
−0.8
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Large-Scale SSL
Existing Large-Scale SSL Methods
Nonparametric Function Induction, Delalleu et al., 2005
Use the truncated graph on a sample subset and its
connection to the rest samples.
Harmonic Mixtures, Zhu & Lafferty, 2005
Need a large sparse graph, but do not explain how to
construct.
Prototype Vector Machine (PVM), Zhang et al., 2009
Use the Nyström approximation of the fully-connected
graph adjacency matrix. There is no guarantee for the
graph Laplacian matrix computed from this approximation
to be positive semidefinite.
Eigenfunction, Fergus et al., 2010
Rely on the dimension-independent data density
assumption which is not always true.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Large-Scale SSL
Our Aim
Develop a scalable SSL approach that is easy to
implement and has closed-form solutions.
Address the critical issue on how to build an entire sparse
graph over a large dataset.
Low space and time complexities with respect to the data
size. Our Goal: achieve linear complexity!
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchors
How to scale up?
Graph-based SSL methods usually need a cubic time
complexity O(n3 ) with respect to the data size n because
n × n matrix inversion is needed in inferring full-size label
prediction.
In large scale, full-size label prediction is infeasible, which
has motivated the idea of landmark samples, i.e., anchors.
If one inferred the labels on a much smaller landmark set
with size m (≪ n), the time complexity could be reduced to
O(m3 ).
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchors
Anchor Points
Example
K−Means Clustering
4
x1
3
2
u1
1
u2
u4
0
−1
data point
u3
anchor point
−2
data
−3
cluster center
−4
−4
−3
−2
−1
0
1
2
3
4
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchors
Anchor-to-Data Relationship
Input data points X = {x1 , · · · , xn } ⊂ Rd and introduced
anchor points U = {u1 , · · · , um }.
Suppose a soft label prediction function f : Rd → R.
Predict the label for each data point as a weighted average
of the labels on anchors:
f (xi ) =
m
X
Zik f (uk ),
i = 1, · · · , n
(1)
k =1
where Zik ’s denote the combination weights.
The matrix form
f = Z a, Z ∈ Rn×m , m ≪ n
where a = [f (u1 ), · · · , f (um )]⊤ is the label vector that we
will solve.
(2)
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchor-to-Data Nonparametric Regression
How to obtain Z ?
We
Pm impose the nonnegative normalization constraints
k =1 Zik = 1 and Zik ≥ 0 in order to maintain the unified
range of values for all predicted soft labels via anchors.
The manifold assumption implies that nearby points should
have similar labels and distant points are very unlikely to
take similar labels.
So, we further impose Zik = 0 when anchor uk is far away
from xi . Specifically, Zik 6= 0 only for s closest anchor
points (the s indexes hii ⊂ [1 : m]) of xi .
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Anchor-to-Data Nonparametric Regression
The weights Zik ’s are sample-adaptive, and thus the
anchor-based label prediction scheme falls into
nonparametric regression: the label of xi is a locally
weighted average of the labels on anchors.
The regression matrix Z ∈ Rn×m is nonnegative and
sparse.
Example
s=2
x1
Z11
u1
Z 12
u2
u4
u3
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchor-to-Data Nonparametric Regression
Predefined Kernel-based Weights
The Nadaraya-Watson Kernel Regression
Zik = P
Kh (xi , uk )
k ′ ∈hii Kh (xi , uk ′ )
∀k ∈ hii
Typically, we may adopt the Gaussian kernel
Kh (xi , uk ) = exp(−kxi − uk k2 /2h2 ).
The above predefined weights are sensitive to the
hyperparameter h and lack a meaningful interpretation.
(3)
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchor-to-Data Nonparametric Regression
Optimized Weights
Local Anchor Embedding (LAE)
mins g(zi ) =
zi ∈R
s.t.
1
kxi − Uhii zi k2
2
1⊤ zi = 1, zi ≥ 0
(4)
U = [u1 , · · · , um ] and Uhii ∈ Rd×s is a sub-matrix
composed of s nearest anchors of xi .
Similar to Locally Linear Embedding (LLE), LAE
reconstructs any xi as a convex combination of its nearest
anchors. The combination coefficients are preserved for
the weights Zik ’s used in nonparametric regression:
Zi,hii = z⊤
i , Zi,hii = 0, |hii| = s.
(5)
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Anchor-to-Data Nonparametric Regression
We propose the Nesterov-accelerated projected gradient
algorithm, a first-order optimization procedure, to solve the
optimization problem.
In contrast to the predefined kernel-based weights, LAE is
more advantageous because it provides optimized
regression weights that are also sparser.
Example
Project xi onto the convex hull of its closest anchors Uhii
x1
u1
s=2
u2
u4
u3
Z 42 = 0, Z 43 = 1
x4
x7
Z 73 = 0.5, Z 74 = 0.5
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Large Graph Construction
Construct a large graph G(X , W ) over the dataset X .
W ∈ Rn×n denotes the adjacency matrix.
The exact kNN graph construction (square time
complexity) is infeasible in large scale; it is unrealistic to
save a full matrix W in memory.
Our idea is low-rank matrix factorization: representing W
as a Gram matrix.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Principles for Designing W
W ≥ 0, which is sufficient to make the resulting graph
Laplacian L = D − W (D = diag(W 1)) positive
semidefinite. Keeping p.s.d. graph Laplacians is important
to guarantee global optimum of graph-based SSL
(ensuring that the graph regularizer f⊤ Lf is convex).
Sparse W . Sparse graphs have much less spurious
connections between dissimilar points and tend to exhibit
high quality.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Nyström Approximation
When using the Nyström method to design the graph
adjacency matrix W = K̇ , produce an improper dense
graph.
K̇ = K.U KU−1
U KU . approximates a predefined kernel matrix
K ∈ Rn×n .
Cannot guarantee K̇ ≥ 0; K̇ is dense.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Design W Using Z
In anchor-based label prediction, we have obtained the
nonnegative and sparse Z ∈ Rn×m . Don’t waste it.
In depth, each row Zi. in Z is a new representation of raw
sample xi . xi → Zi.⊤ is reminiscent of sparse coding with
the basis U since xi ≈ Uhii zi = UZi.⊤ .
Guess the Gram matrix W = ZZ ⊤ ?
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Design W Using Z
Low-Rank Matrix Factorization
W = Z Λ−1 Z ⊤ ,
(6)
in which the diagonal matrix
∈ Rm×m contains inbound
PΛ
n
degrees of anchors Λkk = i=1 Zik .
W balances the popularity of anchors.
W is nonnegative and empirically sparse.
Theoretically, we can derive this equation by a probabilistic
means (e.g., employ the bipartite graph on the next slide).
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Coupling Label Prediction and Graph Construction
Bipartite Data-Anchor Graph
x1
x2
x3
Z21
u1
Z 22
x4
x5
u2
x6
Figure: A bipartite graph representation B(X , U, Z ) of data X and
anchors U. Z is defined as the cross adjacency matrix between X
and U (Z 1 = 1).
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Coupling Label Prediction and Graph Construction
One-Step Transition in Markov Random Walks
x1
x2
u1
x3
x4
u2
x5
x6
Z
p (1) (uk |xi ) = Pm ik
k ′ =1 Zik ′
= Zik
Experiments
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Coupling Label Prediction and Graph Construction
One-Step Transition in Markov Random Walks
x1
x2
u1
x3
x4
u2
x5
x6
Zik
p (1) (xi |uk ) = Pn
j=1 Zjk
=
Zik
Λkk
Experiments
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Two-Step Transition in Markov Random Walks
x1
x2
u1
x3
x4
u2
x5
x6
p (2) (xj |xi ) = p (2) (xi |xj ) =
m
X
k =1
p (1) (xj |uk )p (1) (uk |xi ) =
m
X
Zik Zjk
k =1
Λkk
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Coupling Label Prediction and Graph Construction
Interpret W as two-step transition probabilities in the
bipartite data-anchor graph B:
Wij = p (2) (xj |xi ) = p (2) (xi |xj )
⇒W = Z Λ−1 Z ⊤
Term the large graph G described by such a W
AnchorGraph.
Don’t need to compute W and the graph Laplacian
L = D − W explicitly. Only need to save Z (O(sn)) in
memory.
(7)
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
AnchorGraph Regularization
With f = Z a, the object function of SSL is
Q(a) = kZl a − yl k2 + γa⊤ Z ⊤ LZ a
|
{z
}
|
{z
}
label fitting
(8)
graph regularization
where we compute a “reduced" Laplacian matrix:
L̃ = Z ⊤ LZ = Z ⊤ (D − Z Λ−1Z ⊤ )Z
= Z ⊤ Z − (Z ⊤ Z )Λ−1 (Z ⊤ Z ),
where D = diag(W 1) = diag(Z Λ−1 Z ⊤ 1) = diag(Z Λ−1 Λ1) =
diag(Z 1) = diag(1) = I.
Computing L̃ is tractable, taking O(m3 + m2 n) time.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
AnchorGraph Regularization
We can easily obtain the globally optimal solution as
follows
a∗ = (Zl⊤ Zl + γ L̃)−1 Zl⊤ yl .
(9)
As such, we yield a closed-form solution for large-scale
SSL.
Final prediction of multiple classes (Y = {1, · · · , c}):
ŷi = arg max
j∈Y
Zi.a ∗j
λj
,
i = l + 1, · · · , n
(10)
where a∗j is solved corresponding to each class and the
a∗j balances skewed class
normalization factor λj = 1⊤ Za
distributions.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
AnchorGraph Regularization
Complexity Analysis
Space Complexity
O(n) since we save Z and L̃ ∈ Rm×m in memory.
Time Complexity
O(m2 n) since n ≫ m.
Table: Time complexity analysis. n is the data size, m is # of anchor
points, s is # of nearest anchors in designing Z , and T is # of
iterations of LAE (n ≫ m ≫ s).
Stage
find anchors
design Z
graph regularization
total time complexity
AnchorGraphReg
O(mn)
O(smn) or O(smn + s2 Tn)
O(m3 + m2 n)
O(m2 n)
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Compared Methods
LGC, GFHF, PVM, Eigenfunction, etc.
random AGR0 : random anchors, predefined Z
random AGR: random anchors, LAE-optimized Z
AGR0 : k-means anchors, predefined Z
AGR: k-means anchors, LAE-optimized Z
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Mid-sized Dataset
Table: Classification error rates (%) on USPS-Train (7,291 samples)
with 100 labeled samples. Take 1,000 anchors for four versions of
AGR. The running time of k-means clustering is 7.65 seconds. Error
rate GFHF < AGR < AGR0 < LGC. AGR is much faster than GFHF.
Method
1NN
LGC with 6NN graph
GFHF with 6NN graph
random AGR0
random AGR
AGR0
AGR
Error Rate
(%)
20.15±1.80
8.79±2.27
5.19±0.43
11.15±0.77
10.30±0.75
7.40±0.59
6.56±0.55
Running Time
(seconds)
0.12
403.02
413.28
2.55
8.85
10.20
16.57
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Mid-sized Dataset
Classification Error Rates vs. # Anchors on USPS−Train
22
LGC
GFHF
20
Error Rate (%)
0
18
random AGR
random AGR
16
AGR0
AGR
14
12
10
8
6
4
200
300
400
500
600
700
# of anchor points
800
900
1000
Figure: 100 labeled samples. When the anchor size reaches 600,
both AGR0 and AGR begin to outperform LGC.
Conclusions
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Large Dataset
Table: Classification error rates (%) on Extended MNIST (630,000
samples) with 100 labeled samples. Take 500 anchors for two
versions of AGR. The running time of k-means clustering is 195.16
seconds. AGR is the best in classification accuracy, achieving 1/2-fold
improvement over the baseline 1NN.
Method
1NN
Eigenfunction
PVM(square loss)
AGR0
AGR
Error Rate
(%)
39.65±1.86
36.94±2.67
29.37±2.53
24.71±1.92
19.75±1.83
Running Time
(seconds)
5.46
44.08
266.89
232.37
331.72
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Experiments
Conclusions
Conclusions
Develop a scalable SSL approach using the proposed
large graph construction, AnchorGraph, which is
guaranteed to result in positive semidefinite graph
Laplacians.
Both time and memory needed by AGR grow only linearly
with the data size, so it can enable us to apply SSL to even
larger datasets with millions of samples.
AnchorGraph has wider utility such as large-scale
nonlinear dimensionality reduction, spectral clustering,
regression, etc.
Motivation
Related Work
Anchor-Based Label Prediction
AnchorGraph
Thanks! Ask me via wliu@ee.columbia.edu.
Experiments
Conclusions
Download