Efficient Iterative Semi-Supervised Classification on Manifold

advertisement
Efficient Iterative Semi-Supervised Classification on Manifold
Mehrdad Farajtabar, Hamid R. Rabiee, Amirreza Shaban, Ali Soltani-Farani
Digital Media Lab, AICTC Research Center,
Department of Computer Engineering,
Sharif University of Technology,
Tehran, Iran.
{farajtabar, shaban, a soltani}@ce.sharif.edu, rabiee@sharif.edu
Abstract—Semi-Supervised Learning (SSL) has become a
topic of recent research that effectively addresses the problem of limited labeled data. Many SSL methods have been
developed based on the manifold assumption, among them, the
Local and Global Consistency (LGC) is a popular method.
The problem with most of these algorithms, and in particular
with LGC, is the fact that their naive implementations do not
scale well to the size of data. Time and memory limitations
are the major problems faced in large-scale problems. In this
paper, we provide theoretical bounds on gradient descent, and
to overcome the aforementioned problems, a new approximate
Newton’s method is proposed. Moreover, convergence analysis
and theoretical bounds for time complexity of the proposed
method is provided. We claim that the number of iterations in
the proposed methods, logarithmically depends on the number
of data, which is a considerable improvement compared to
the naive implementations. Experimental results on real world
datasets confirm superiority of the proposed methods over
LGC’s default iterative implementation and the state of the
art factorization method.
Keywords-Semi-supervised learning, Manifold assumption,
Local and global consistency, Iterative method, Convergence
analysis
I. I NTRODUCTION
Semi-supervised Learning has become a popular approach
to the problem of classification with limited labeled data in
recent years [1]. To use unlabeled data effectively in the
learning process, certain assumptions regarding the possible
labeling functions and the underlying geometry need to be
held [2]. In many real world classification problems, data
points lie on a low dimensional manifold. The manifold
assumption states that the labeling function varies smoothly
with respect to the underlying manifold [3]. Methods utilizing the manifold assumption prove to be effective in many
applications including image segmentation [4], handwritten
digit recognition, and text classification [5].
Regularization is essentially the soul of semi-supervised
learning based on the manifold assumption. Manifold regularization is commonly formulated as a quadratic optimization problem,
1
min xT Ax − bT x,
(1)
x 2
where A ∈ Rn×n and b, x ∈ Rn . It is in effect equivalent
to solving the system of linear equations, Ax = b. A is
fortunately a sparse symmetric positive definite matrix.
Naive solutions to this problem require O(n3 ) operations
to solve for x, while methods that take into account the
sparse structure of A can cost much less. Taking the inverse
of A directly is an obvious bad choice for various reasons.
First taking the inverse requires O(n3 ) operations regardless
of the sparse structure of A. Secondly A may be near
singular in which case the inverse operation is numerically
unstable. Lastly the inverse of A is usually not sparse in
which case a large amount of memory is needed to store
and process A−1 .
To elaborate, note that semi-supervised learning is specially advantageous when there is large amount of unlabeled
data which leads to better utilization of the underlying
manifold structure. For example consider the huge amount
of unlabeled documents or images on the web which may
be used to improve classification results. In these large-scale
settings ordinary implementations are not effective, because
time and memory limitations are an important concern in
SSL methods with the manifold assumption [1].
There are commonly two approaches to overcome this
problem. First, one may reformulate the manifold regularization problem in a new form, more suitable for large-scale
settings. For example, [6] considers a linear base kernel and
thus requires an inverse operation with a very smaller matrix.
[7] uses a sparsified manifold regularizer with core vector
machines (which is recently proposed for scaling up kernel
methods) to handle large-scale data.
The second approach to this problem (which is the focus
of this paper) relies heavily on factorization, optimization,
or iterative procedures to solve the original manifold regularization formulation. Specially, Iterative methods are of
great interest. Label propagation (LP) [8] is an iterative
algorithm for computing harmonic solution [9], which is
a variation of manifold regularization problem. The other
naturally iterative manifold regularization algorithm is local
and global consistency (LGC) [10], upon which we build
our work. Linear neighborhood propagation (LNP) [11] is
another iterative one which differs from other manifold
learning methods mostly in the way of constructing the
neighborhood graph. The problem with the most of these
iterative methods is that, though of being claimed to be
converged fast, there is no analytical guarantee or proof for
that claim.
In this paper we conduct a theoretical analysis of iterative
methods for the LGC. We apply gradient descent to the LGC
and derive an analytical bound for the number of iterations
and its dependency on the number of data. These bounds
are also true for other manifold regularization problems such
as harmonic solution and tikhonov regularization. We then
show that the LGC’s iterative procedure may be improved
through an approximation of the inverse Hessian and present
a detailed convergence analysis. Again a theoretical bound is
derived for the number of iterations. We show that these iterative implementations require O(log n) sparse matrix-vector
multiplications to compute LGC’s solution with sufficient
accuracy. Then it is proved that LGC’s iterative procedure
is a special case of our proposed method. Finally proposed
methods are compared with LGC’s iterative procedure, and
a state of the art factorization method utilizing Cholesky.
The rest of the paper is organized as follows. In section II some related works in the domain of optimization,
factorization and iterative methods are introduced. Section
III provides a basic overview of LGC and introduces the
notations. Section IV provides a detailed analysis of gradient
descent applied to LGC. In section V we then show how
the LGC’s iterative procedure may be improved and derive
further theoretical bounds. Section VI gives experimental
results validating the derived bounds, after which the paper
is concluded in Section VII.
II. R ELATED W ORKS
Methods such as LQ, LU, or Cholesky factorization
overcome the inverse operation problems by factorizing A
into matrices with special structure that greatly simplify
computations especially when A is sparse. In particular
Cholesky factorization best fits our problem by making use
of the symmetry and positive definiteness properties of A.
It decomposes A as P U T U P T , where P is a permutation
matrix and U is upper triangular with positive diagonal
elements. Heuristics are used to choose a matrix P that leads
to a sparse U . In some instances these heuristics fail and the
resulting algorithm may not be computationally as efficient
as expected [12].
Iterative methods are another well studied approaches
to the problem. Two views to the problem exist. When
considering the problem in its optimization form, solutions
such as gradient descent, conjugate gradient, steepest descent, and quasi-newton methods become evident. Taking
the machine learning view point leads to more meaningful
iterative methods. Among them are LP, LNP and LGC
which are introduced in the previous section. LGC’s iterative
procedure is useful in many other applications, so improving
and analyzing it may be helpful. For example [13] proposed
an iterative procedure based on LGC for ranking in the web
and [14] used similar ideas in image retrieval. As stated
before the problem with LGC or LP’s iterative procedure is
that there is no analysis provided on the number of iterations
for convergence. Morever, no explicit stopping criterion
is mentioned which is essential for bounding convergence
iterations.
Gradient descent is one of the simplest iterative solutions
to any optimization problem, however beyond this simplicity
its linear convergence rate is strongly dependent on the
condition number of the Hessian [15].
Conjugate gradient is a method especially designed to
solve large systems of linear equations. A conjugate set of
directions with respect to A are chosen. In each iteration
the objective function is minimized in one of the directions.
Theoretically the method should converge in at most n
iterations, with each iteration costing as much as a sparse
matrix-vector multiplication. While this makes conjugate
gradient a suitable choice, its inherent numerical instability
in finding conjugate directions could yield the procedure
slower than expected. [16], [1] apply conjugate gradient to
harmonic solution with both superior and inferior results to
LP depending on the dataset in use.
Quasi-newton methods exhibit super-linear convergence.
At each iteration the inverse Hessian in Newton’s method
is replaced by an approximation. These methods will not be
helpful unless the approximation is sparse, However sparse
quasi-newton methods have an empirically lower convergence rate than low storage quasi-newton [17]. Thus they
couldn’t be helpful. Moreover for our problem, in which the
Hessian is constant, computing an approximate to the inverse
Hessian per iteration is costly. In our proposed algorithm we
shall avoid this cost by computing a sufficiently precise and
also sparse approximation of the inverse Hessian at the start.
III. BASICS AND N OTATIONS
Consider the general problem of semi-supervised learning.
Let Xu = {x1 , . . . , xu } and Xl = {xu+1 , . . . , xu+l } be
sets of unlabeled and labeled data points respectively, where
n = u + l is the total number of data points. Also let y
be a vector of length n with yi = 0 for unlabeled xi and
yi equals to the −1 or 1 corresponding to the class labels
for the labeled data points. Our goal is to predict labels of
X = Xu ∪ Xl as f , where fi is the label associated to xi
for i = 1, . . . , n.
It’s usual to construct the similarity graph of data using
methods like weighted k-NN for better performance and
accuracy [1]. Let W be the n × n weight matrix
Wij = exp(−
kxi − xj k2
)
2σ
(2)
where σ is the bandwidth parameter. Define P
the diagon
nal matrix D with nonzero entries D(i, i) =
j=1 Wij .
Symmetrically normalize W by S = D−1/2 W D−1/2 . The
laplacian matrix is L = I − S.
The family of manifold regularization algorithms can be
formulated as following optimization problem:
min f T Qf + (f − y)T C(f − y)
f
(3)
where Q is a regularization matrix (usually the laplacian
itself) and C is a diagonal matrix with Cii equal to the
importance of the ith node to stick to its initial value yi .
The first term represents smoothness of the predicted labels
with respect to the underlying manifold and the second term
is squared error of the predicted labels compared with the
initial ones weighted by C.
Choosing different Qs and Cs leads to various manifold
classification methods [5], [10], [9], [3].
In LGC, Q = L and C = µI. It may easily be shown
that the solution is equal to:
f ∗ = (L + C)−1 Cy = (I − αS)−1 y,
(4)
Lemma 2. For any convex function R of f in (6) the
followings hold:
R − R∗ ≥
1
2
||∇R|| .
2λmax (∇2 R)
(9)
R − R∗ ≤
1
2
||∇R|| .
2λmin (∇2 R)
(10)
λmax (∇2 R)
2
||f − f ∗ ||
2
1
||f − f ∗ || ≥
||∇R|| .
λmax (∇2 R)
R − R∗ ≤
(11)
(12)
Proof: Considering that Hessian is a constant matrix,
the proof for equations (9) and (10) can be found in standard
optimization texts such as [15]. For (11) we need the
following [15]:
(5)
λmax (∇2 R)
2
||h − f || .
2
(13)
Replacing f ∗ for f and f for h we get:
Since 0 < α < 1 and the eigenvalues of S are in [0, 1], this
iterative algorithm converges to the solution of LGC [10].
In summary, the manifold regularization problem casts
into the minimizing,
λmax (∇2 R)
2
||f − f ∗ || .
(14)
2
And the third equation is proved. Combining this with (9)
the forth equation is proved .
1
µ+1 .
Authors of [10] propose an iterative
where α =
algorithm to compute this solution:
f (t+1) = αSf (t) + (1 − α)y.
R(f ) = f T Lf + (f − y)T C(f − y).
(6)
Throughout the paper R(t) and f (t) denote the value and
point respectively, at the tth iteration of the algorithm and
R∗ and f ∗ for corresponding optimal ones.
IV. A NALYSIS OF GRADIENT DESCENT
The gradient of (6) is ∇R = 2(Lf + C(f − y)), which
leads to the gradient descent update rule:
f (t+1) = f (t) − 2α(Lf + C(f − y))).
∗
(0)
t≤
log (1/z)
.
R(f ) ≤ R(f ∗ ) +
Theorem 1. The maximum number of iterations for gradient
descent with exact line search and fixed (η, µ) is O(log n).
Proof:
Consider
the iteration
t just
before stopping, i.e.,
when ∇R(t) > η and ∇R(t+1) ≤ η. using equation
(9) and lemma 1:
2
1
1
η2 .
R(t) − R∗ ≥
∇R(t) ≥
2λmax (L + C)
2(λM + µ)
(15)
Inserting this into (8) yields
(7)
The stopping criterion is ||∇R|| ≤ η. Choosing α appropriately is essential for convergence. Following [15], applying
exact line search to our problem ensures linear convergence
and at iteration t we have:
−R
log ( R
)
R(t) −R∗
R(h) ≤ R(f ) + ∇R(f )T (h − f ) +
(8)
λmin (L+C)
λmax (L+C) .
which z is a constant equal to 1 −
For deeper analysis of the method we need the following
lemma.
Lemma 1 ([18]). If λm and λM are the smallest and largest
eigenvalues of L respectively, then we have 0 = λm <
λM ≤ 2.
Using the above lemma and the fact that C = µI, we have
λmin (L + C) = µ and λmax (L + C) = µ + λM ≤ µ + 2.
t≤
log ( 2(λM +µ)(R
η2
log (1 +
(0)
µ
λM
−R∗ )
)
)
.
(16)
In order to find an upper bound for R(0) − R∗ inequality
(11) is used:
2
(λM + µ) (0)
(λM + µ)n
R(0) −R∗ ≤
(17)
f − f ∗ ≤
2
2
where in the last inequality we use the fact that f (0) = 0
and elements of f ∗ are in [−1, 1]. Using this in (16) we
reach
2
2
n
n
log ( (λM η+µ)
)
log ( (2+µ)
)
2
η2
t≤
≤
(18)
µ
µ .
log (1 + λM )
log (1 + 2 )
Each iteration of gradient descent in equation (7) consists
of two steps. First α is computed which takes a fixed number
of matrix-vector multiplications. Next Lf + C(f − y) is
computed which costs the same. Considering that all the
matrices involved are sparse, because L is constructed using
k-NN and C is diagonal, there are some sparse matrixvector multiplications. Thus the total cost of each iteration
is O(kn), where k is associated to neighborhood size in the
construction of similarity graph.
Putting these together we come to a O(kn log n) time
complexity of computing the solution of LGC with gradient
descent, i.e., a O(n log n) rate of growth with respect the
number of data, n, which is comparably less than the ordinary inverse complexity of O(n3 ) in naive implementations
or O(n2 ) with sparsity taken into consideration.
It is easy to show the analysis presented above is valid
for other laplacians, L, and Cs, i.e. applying gradient
descent to other manifold regularization methods, such as
harmonic solution and tikhonov regularization leads to the
same bound.
An interesting feature of the bound derived in (18) is that
it is independent of the dataset in use. Replacing λM for
its upper bound in (18) eliminates the dependence of the
bound to the data. This independence accompanied with
being sufficiently tight is appropirate for data-independent
practical implementation.
V. S PARSE A PPROXIMATION OF N EWTON ’ S M ETHOD
f (t+1) =f (t) − (∇2 R)−1 × 2 Lf + C(f − y)
i −1
≈f (t) − Σm−1
(I
+
C)
S
i=0
× (I + C)−1 Lf + C(f − y)
i
−1
=f (t) − Σm−1
S
i=0 (I + C)
× (I + C)−1 (I + C − S)f (t) − (I + C)−1 Cy
i −1
=f (t) − Σm−1
S
i=0 (I + C)
× I − (I + C)−1 S f (t)
i −1
+ Σm−1
S (I + C)−1 Cy
i=0 (I + C)
m (t)
=f (t) − I − (I + C)−1 S
f
i
−1
+ Σm−1
S (I + C)−1 Cy
i=0 (I + C)
m
= (I + C)−1 S f (t)
i −1
+ Σm−1
S (I + C)−1 Cy.
i=0 (I + C)
(22)
In summary it can be restated as:
f
=f
(t)
2
−1
− α(∇ R)
∇R
(∇2 R)−1
1
1
= (L + C)−1 = (I − S + C)−1
2
2
−1
1
−1
= I − (I + C) S
(I + C)−1
2
i 1 ∞
Σi=0 (I + C)−1 S (I + C)−1
=
2
(23)
H = (I + C)−1 S
(24)
m−1
X
(19)
For our quadratic problem one iteration is sufficient to reach
the optimum point with α = 1, however we wish to find a
sparse approximation of the inverse Hessian.
We show that using a sparse approximation of the inverse
Hessian leads to an iterative method with acceptable convergence rate. As an interesting result it may be seen that in
the special case our method reduces to the LGC. We start
with approximating the inverse Hessian.
f (t+1) = H m f (t) + gm ,
where
Newton’s update rule for our problem is
(t+1)
gm = (
H i )(I + C)−1 Cy.
This update rule is performed iteratively from an initial f (0)
until the stopping criterion ||∇R|| ≤ η is reached.
Theorem 2. The approximate Newton’s method in (23)
converges to the optimal solution of LGC.
Proof: Unfolding the update rule in (23) leads to
f (t) =H mt f (0) +
m−1
X
H mi gm
=H mt f (0) + (
(20)
Rewriting Newton’s method with the approximated inverse
Hessian results in the update rule below.
(25)
i=0
i=0
m−1
X
The last equality is obtained because eigenvalues of (I +
C)−1 S are all less than one.
Using the m first terms in the above summation leads to
an approximation of the inverse Hessian:
i −1
(∇2 R)−1 ≈ Σm−1
(I
+
C)
S
(I + C)−1 . (21)
i=0
=H mt f (0) + (
H mi )(
i=0
mt−1
X
m−1
X
H i )(I + C)−1 Cy
i=0
H i )(I + C)−1 Cy
i=0
(26)
Tending t → ∞ gives the final solution. Since the magnitude
of the eigenvalues of H are less than one, (H mt f (0) → 0),
and
lim f (t) = (I −H)−1 (I +C)−1 Cy = (L+C)−1 Cy, (27)
t→∞
which is equal to f ∗ in (4).
Theorem 3. For the approximate Newton’s method in (23)
the stopping criterion ||∇R|| ≤ η is reached in O(log n)
iterations with respect to the number of data n.
Proof:
f (t) − f ∗ = (H m f (t−1) − gm ) − (H m f ∗ − gm )
= H m (f (t−1) − f ∗ )
By rewriting the above inequality one can see that the
maximum number of iterations is bounded by
log (
(30)
As in gradient decent consider the iteration
t just
before
∇R(t) > η and
the
stopping
criterion
is
met,
i.e.,
when
∇R(t+1) ≤ η. Using equation (12) we have
1
1
(t)
η.
f − f ∗ ≥
∇R(t) ≥
λmax (L + C)
λM + µ
(31)
The maximum number of iterations is thus bounded above
by
t≤
≤
≤
(λM +µ)||f (0) −f ∗ ||
log (
)
η
log (
m log (1 + µ)
(2+µ)||f (0) −f ∗ ||
η
)
Approx. Newton m = 2
Gradient Descent
(28)
H m is symmetric so ||H m x|| ≤ λmax (H m ) ||x||, so
(t)
f − f ∗ ≤ λmax (H m ) f (t−1) − f ∗ ≤ λmax (H m )t f (0) − f ∗ (29)
= λmax ((I + C)−1 S)mt f (0) − f ∗ 1 mt (0)
=(
) f − f ∗ 1+µ
||f (0) −f ∗ ||
)
||f (t) −f ∗ ||
t≤
m log (1 + µ)
Approx. Newton m = 1
(32)
m log (1 + µ)
log ( (2+µ)n
)
η
m log (1 + µ)
Similar to gradient descent an O(log n) dependency on
the number of data is derived for our approximate Newton’s
method. The sparsity degree of H m is k m , So the matrixvector operations with this matrix cost O(k m n). As the
approximation become more exact, H m will become less
sparse. So as m increases the number of iterations decrease,
as can be seen from (32), however, the cost of each iteration
grows. Empirically it is seen that m should be chosen from 1
to 3, so we can treat it as constant and achive a O(k 3 n log n)
dependence on the number of data for the whole algorithm.
Also since k is chosen independent of n and is usually
constant, the growth of the algorithm’s time complexity is
O(n log n) with respect to the number of data.
Figure 1: Demonstration of steps taken by gradient descent
and approximate Newton’s method for two data points from
MNIST. The algorithms start their movments from top left
point to the optimal point which is located at bottom right.
Similar to gradient descent the bound derived in (32) is
independent of dataset, which accompanied with tightness
is a good feature for practical implementation. Experiments
show that that the bound derived here is tighter than that of
for gradient descent and of course the number of iterations
for approximate Newton is much less than that of for
gradient descent.
As a special case, we claim that for m = 1, the algorithm
is the same as LGC’s iteration procedure. Remembering C =
Iµ;
f (t+1) =Hf (t) + g1 = (I + C)−1 Sf (t) + (I + C)−1 Cy
1
µ
= Sf (t) +
Cy = αSf (t) + (1 − α)Cy,
µ
µ+1
(33)
which is the same as (5).
Figure 1 shows how increasing m affects steps taken
by the optimization algorithm in contrast to steps taken
by gradient descent for simulations on the MNIST dataset.
Gradient descent is extremely dependent on the condition
number of the Hessian; for high condition numbers gradient
descent usually takes a series of zigzag steps to reach the
optimum point. Approximating the Newton step refines the
search direction and decreases the zigzag effect. Figure 1
shows that the steps form approximately a line at m = 2.
The Newton step for quadratic problems is in the direction
to the optimal point. The trace of approximate method
with m = 2 highly coincides with the true direction to
the optimum point, indicating how well inverse Hessian is
approximated in the proposed method. This is the reason
of small number of iterations needed for convergence of
approximate method compared with that of for gradient
descent. The experiments validating the improvement are
presented in the next section.
VI. E XPERIMENTS
For experiments three real world datasets are used:
MNIST for digit recognition, Covertype for forest cover
prediction, and Classic for text categorization. These rather
large datasets are chosen to better simulate a large-scale
setting, for which naive solutions, such as inverse operation,
are not applicable in terms of memory and time.
The MNIST is a collection of 60000 handwritten digit
recognition samples. For classification we choose 10000 data
points from digits 2 and 8. Each is of dimension 784. No processing is done on the data. The forest Covertype dataset is
collected for predicting forest cover type from cartographic
variables. It includes seven classes and 581012 samples of
dimension 54. We randomly select 20000 samples of types
1 and 2, and normalize them such that each feature is in [0
1]. Classic collection is a benchmark dataset in text mining.
This dataset consists of 4 different document collections:
CACM (3204 documents), CISI (1460 documents), CRAN
(1398 documents), and MED (1033 documents). We try to
separate first category from others. Terms are single words;
Minimum term length is 3. A term appears at least in 3
documents, and a term can appear at most 95 % of the
documents. Moreover, Porters stemming is applied while
preprocessing. Features are weighted with TFIDF scheme
and normalized to 1.
For all the datasets we use the same setting: Adjacency
matrices are constructed using 5-NN with the bandwidth size
set to mean of standard deviation of data. 2 % of data points
are labeled. µ is set to 0.5. Choosing η = 0.005 empirically
ensures convergence to the optimal solutions. Number of
Iterations, accuracy, and distance to optimum are reported
by average of 10 runs for different random labelings. The
algorithms are run on datasets and the results are depicted
and discussed in the following.
Figure 2 shows the number of iterations for three iterative
methods with respect to the number of data. The solution of
iterative methods are almost converged to the optimum point
(as depicted by Figure 3). LGC’s default implementation
is the worst among the three. Gradient descent is second, and our approximate Newton’s method has the fastest
convergence rate consistently in the three diverse datasets.
Note that LGC corresponds to the approximate method with
m = 1, and as indicated in figure 1 has better direction
compared with gradient descent, so it may be surprising that
its iterations are more than that of gradient descent. The key
point is the line search. Although the direction proposed by
gradient descent is worse than the one for LGC, exact line
search causes gradient descent to reach the optimum faster.
If we incorporate our approximate method with an exact line
search we reach even fewer iterations, however empirically it
was observed that due to the time consumed by line search,
there is no improvement in terms of time duration.
Another important point about diagrams in figure 2 is
the order of growth with respect to the number of data,
which is consistent with the logarithmic growth derived
in the previous sections. This makes LGC with iterative
implementation a good choice for large-scale SSL tasks. To
illustrate how tight the bounds derived for iterative methods
are, we put the parameters into equations (32) and (18) to
get 19, 38, and 97 for approximate method with m = 2,
m = 1, and for gradient descent respectively, which may
be compared with the empirical values from the diagrams
in figure 2. Interestingly the diagrams show that the derived
bounds are quite tight regardless of the dataset.
Figure 3 shows accuracy of the iterative methods compared with a factorization method, CHOLMOD [19], which
uses Cholesky factorization to solve a system of linear
equations fast. Since computing exact solution via inverse
is impractical we use a factorization method to solve for the
exact solution and compare it with the solution of iterative
methods. As seen from the diagrams, for all three datasets,
the solution of iterative methods is sufficiently close to the
optimal solution, with the number of iterations demonstrated
in figure 2.
Figure 4 compares distance to optimum with different
methods at each iteration and shows how these methods
converge to the optimum. As expected from previous results
approximate Newton’s method with m = 2 has the fastest
convergence, while LGC is the slowest one. As stated before,
the superiority (measured by number of iterations point of
view) of gradient descent to LGC is due to its line search,
not the direction chosen by the method.
Figure 5 shows the time needed to compute the solution.
Figure 5a compares our approximate Newton’s method with
CHOLMOD which is the state of the art method in solving
large systems of linear equations. Iterative methods are
obviously superior to CHOLMOD. Figure 5b compares
running times of different iterative methods. Again the
proposed method with m = 2 is the best, however this time
LGC performs better than gradient descent, because of the
overhead imposed due to the line search. As the number of
data get larger the difference between the methods becomes
more evident. Time growth is of order n log(n), as predicted
by theorems 1 and 3.
VII. C ONCLUSION AND F UTURE W ORKS
In this paper, a novel approximation to Newton’s method
is proposed for solving manifold regularization problem
along with a theoretical analysis on the number of iterations.
We proved that the number of iterations have logarithmic
dependence on the number of data. We also applied gradient
descent to this problem and proved that its number of
iterations also grows logarithmically with the number of
data. The logarithmic dependence makes iterative methods
a reasonable approach when a large amount of data is
being classified. It’s notable that the bounds derived, are
empirically tight independent of the dataset in use, which is
25
20
15
10
0
2000
35
Number of Iterations
35
Number of Iterations
Number of Iterations
35 LGC
Approx. Newton m = 2
30 Gradient Descent
30
25
20
15
10
0
4000 6000 8000 10000
Number of data
0.5
(a) MNIST
1
1.5
Number of data
30
25
20
15
10
0
2
4
2000
4000
6000
Number of data
x 10
(b) Covertype
8000
(c) Classic
Figure 2: Number of iterations for three iterative methods with respect to the number of data.
1.05 LGC
Approx. Newton m = 2
1
0.95
1 CHOLMOD
0.95
Accuracy
0.9
Accuracy
Accuracy
Gradient Descent
1
0.8
0.7
0.9
0.85
0.8
0.75
0.9
0
2000
0
4000 6000 8000 10000
Number of data
(a) MNIST
0.7
0
0.5
1
1.5
2
4
Number of data
x 10
(b) Covertype
2000
4000
6000
Number of data
8000
(c) Classic
Figure 3: Accuracy of the iterative methods compared with CHOLMOD
LGC
150
Approx. Newton m = 2
150
100
80
50
100
||f(t) ï f*||
||f(t) ï f*||
||f(t) ï f*||
Gradient Descent
100
50
60
40
20
0
0
10
20
Number of iterations
(a) MNIST
30
0
0
10
20
Number of iterations
(b) Covertype
30
0
0
5
10
15
20
Number of iterations
25
(c) Classic
Figure 4: Distance form optimum for the three methods with respect to the iteration number
practically an important feature of an algorithm. We derived
LGC’s iterative procedure as a special case of our proposed
approximate Newton’s method. Our method is based upon
approximation of the inverse Hessian. The more exact the
approximation is, the better the search direction is chosen.
Experimental results confirm improvement of our proposed
method over LGC’s iterative procedure without any loss
in accuracy of classification. Also the improvement of our
approximate method over gradient descent is revealed both
theoretically and empirically.
A theoretical analysis of robustness against noise, incorporating a low cost line search with the proposed method,
and finding lower bounds on the number of iterations or
tighter bounds, to name a few, are interesting problems that
remain as future work.
0.08 LGC
CHOLMOD
3
0.06 Gradient Descent
Approx. Newton m = 2
Duration (Sec)
Duration (Sec)
4Approx. Newton m = 2
2
1
0
0
2000
4000 6000 8000 10000
Number of data
(a) MNIST
0.04
0.02
0
0
2000
4000 6000 8000 10000
Number of data
(b) MNIST
Figure 5: Comparison of time needed to compute the solution for iterative methods and CHOLMOD
R EFERENCES
[1] X. Zhu, “Semi-supervised learning with graphs,” Ph.D. dissertation, Carnegie Mellon University, 2005.
[2] O. Chapelle, B. Scholkopf, and A. Zien, Semi-supervised
learning. MIT press Cambridge, MA, 2006, vol. 2.
[3] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and
unlabeled examples,” Journal of Machine Learning Research,
vol. 7, pp. 2399–2434, 2006.
[4] O. Duchenne, J. Audibert, R. Keriven, J. Ponce, and
F. SeĢgonne, “Segmentation by transduction,” in Computer
Vision and Pattern Recognition, 2008. CVPR 2008. IEEE
Conference on. IEEE, 2008, pp. 1–8.
[5] M. Belkin and P. Niyogi, “Using manifold stucture for
partially labeled classification,” in NIPS, 2002, pp. 929–936.
[6] V. Sindhwani, P. Niyogi, M. Belkin, and S. Keerthi, “Linear
manifold regularization for large scale semi-supervised learning,” in Proc. of the 22nd ICML Workshop on Learning with
Partially Classified Training Data, 2005.
[7] I. Tsang and J. Kwok, “Large-scale sparsified manifold
regularization,” Advances in Neural Information Processing
Systems, vol. 19, p. 1401, 2007.
[8] X. Zhu and Z. Ghahramani, “Learning from labeled and
unlabeled data with label propagation,” School Comput. Sci.,
Carnegie Mellon Univ., Tech. Rep. CMUCALD-02-107, 2002.
[9] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised
learning using gaussian fields and harmonic functions,” in
ICML, 2003, pp. 912–919.
[10] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf,
“Learning with local and global consistency,” in NIPS, 2003.
[11] F. Wang and C. Zhang, “Label propagation through linear
neighborhoods,” in Proceedings of the 23rd international
conference on Machine learning. ACM, 2006, pp. 985–992.
[12] A. George and J. Liu, Computer solution of large sparse positive definite systems, ser. Prentice-Hall series in computational
mathematics. Prentice-Hall, 1981.
[13] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and
B. Scholkopf, “Ranking on data manifolds,” in Advances in
neural information processing systems 16: proceedings of the
2003 conference, vol. 16. The MIT Press, 2004, p. 169.
[14] J. He, M. Li, H. Zhang, H. Tong, and C. Zhang, “Manifoldranking based image retrieval,” in Proceedings of the 12th
annual ACM international conference on Multimedia. ACM,
2004, pp. 9–16.
[15] S. Boyd and L. Vandenberghe, Convex optimization.
bridge Univ Pr, 2004.
Cam-
[16] A. Argyriou, “Efficient approximation methods for harmonic
semi- supervised learning,” Master’s thesis, University College London, UK, 2004.
[17] J. Nocedal and S. Wright, Numerical optimization. Springer
verlag, 1999.
[18] F. Chung, Spectral graph theory.
Society, 1997, no. 92.
Amer Mathematical
[19] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam,
“Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate,” ACM Trans. Math. Softw.,
vol. 35, pp. 22:1–22:14, October 2008.
Download