Efficient Iterative Semi-Supervised Classification on Manifold Mehrdad Farajtabar, Hamid R. Rabiee, Amirreza Shaban, Ali Soltani-Farani Digital Media Lab, AICTC Research Center, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. {farajtabar, shaban, a soltani}@ce.sharif.edu, rabiee@sharif.edu Abstract—Semi-Supervised Learning (SSL) has become a topic of recent research that effectively addresses the problem of limited labeled data. Many SSL methods have been developed based on the manifold assumption, among them, the Local and Global Consistency (LGC) is a popular method. The problem with most of these algorithms, and in particular with LGC, is the fact that their naive implementations do not scale well to the size of data. Time and memory limitations are the major problems faced in large-scale problems. In this paper, we provide theoretical bounds on gradient descent, and to overcome the aforementioned problems, a new approximate Newton’s method is proposed. Moreover, convergence analysis and theoretical bounds for time complexity of the proposed method is provided. We claim that the number of iterations in the proposed methods, logarithmically depends on the number of data, which is a considerable improvement compared to the naive implementations. Experimental results on real world datasets confirm superiority of the proposed methods over LGC’s default iterative implementation and the state of the art factorization method. Keywords-Semi-supervised learning, Manifold assumption, Local and global consistency, Iterative method, Convergence analysis I. I NTRODUCTION Semi-supervised Learning has become a popular approach to the problem of classification with limited labeled data in recent years [1]. To use unlabeled data effectively in the learning process, certain assumptions regarding the possible labeling functions and the underlying geometry need to be held [2]. In many real world classification problems, data points lie on a low dimensional manifold. The manifold assumption states that the labeling function varies smoothly with respect to the underlying manifold [3]. Methods utilizing the manifold assumption prove to be effective in many applications including image segmentation [4], handwritten digit recognition, and text classification [5]. Regularization is essentially the soul of semi-supervised learning based on the manifold assumption. Manifold regularization is commonly formulated as a quadratic optimization problem, 1 min xT Ax − bT x, (1) x 2 where A ∈ Rn×n and b, x ∈ Rn . It is in effect equivalent to solving the system of linear equations, Ax = b. A is fortunately a sparse symmetric positive definite matrix. Naive solutions to this problem require O(n3 ) operations to solve for x, while methods that take into account the sparse structure of A can cost much less. Taking the inverse of A directly is an obvious bad choice for various reasons. First taking the inverse requires O(n3 ) operations regardless of the sparse structure of A. Secondly A may be near singular in which case the inverse operation is numerically unstable. Lastly the inverse of A is usually not sparse in which case a large amount of memory is needed to store and process A−1 . To elaborate, note that semi-supervised learning is specially advantageous when there is large amount of unlabeled data which leads to better utilization of the underlying manifold structure. For example consider the huge amount of unlabeled documents or images on the web which may be used to improve classification results. In these large-scale settings ordinary implementations are not effective, because time and memory limitations are an important concern in SSL methods with the manifold assumption [1]. There are commonly two approaches to overcome this problem. First, one may reformulate the manifold regularization problem in a new form, more suitable for large-scale settings. For example, [6] considers a linear base kernel and thus requires an inverse operation with a very smaller matrix. [7] uses a sparsified manifold regularizer with core vector machines (which is recently proposed for scaling up kernel methods) to handle large-scale data. The second approach to this problem (which is the focus of this paper) relies heavily on factorization, optimization, or iterative procedures to solve the original manifold regularization formulation. Specially, Iterative methods are of great interest. Label propagation (LP) [8] is an iterative algorithm for computing harmonic solution [9], which is a variation of manifold regularization problem. The other naturally iterative manifold regularization algorithm is local and global consistency (LGC) [10], upon which we build our work. Linear neighborhood propagation (LNP) [11] is another iterative one which differs from other manifold learning methods mostly in the way of constructing the neighborhood graph. The problem with the most of these iterative methods is that, though of being claimed to be converged fast, there is no analytical guarantee or proof for that claim. In this paper we conduct a theoretical analysis of iterative methods for the LGC. We apply gradient descent to the LGC and derive an analytical bound for the number of iterations and its dependency on the number of data. These bounds are also true for other manifold regularization problems such as harmonic solution and tikhonov regularization. We then show that the LGC’s iterative procedure may be improved through an approximation of the inverse Hessian and present a detailed convergence analysis. Again a theoretical bound is derived for the number of iterations. We show that these iterative implementations require O(log n) sparse matrix-vector multiplications to compute LGC’s solution with sufficient accuracy. Then it is proved that LGC’s iterative procedure is a special case of our proposed method. Finally proposed methods are compared with LGC’s iterative procedure, and a state of the art factorization method utilizing Cholesky. The rest of the paper is organized as follows. In section II some related works in the domain of optimization, factorization and iterative methods are introduced. Section III provides a basic overview of LGC and introduces the notations. Section IV provides a detailed analysis of gradient descent applied to LGC. In section V we then show how the LGC’s iterative procedure may be improved and derive further theoretical bounds. Section VI gives experimental results validating the derived bounds, after which the paper is concluded in Section VII. II. R ELATED W ORKS Methods such as LQ, LU, or Cholesky factorization overcome the inverse operation problems by factorizing A into matrices with special structure that greatly simplify computations especially when A is sparse. In particular Cholesky factorization best fits our problem by making use of the symmetry and positive definiteness properties of A. It decomposes A as P U T U P T , where P is a permutation matrix and U is upper triangular with positive diagonal elements. Heuristics are used to choose a matrix P that leads to a sparse U . In some instances these heuristics fail and the resulting algorithm may not be computationally as efficient as expected [12]. Iterative methods are another well studied approaches to the problem. Two views to the problem exist. When considering the problem in its optimization form, solutions such as gradient descent, conjugate gradient, steepest descent, and quasi-newton methods become evident. Taking the machine learning view point leads to more meaningful iterative methods. Among them are LP, LNP and LGC which are introduced in the previous section. LGC’s iterative procedure is useful in many other applications, so improving and analyzing it may be helpful. For example [13] proposed an iterative procedure based on LGC for ranking in the web and [14] used similar ideas in image retrieval. As stated before the problem with LGC or LP’s iterative procedure is that there is no analysis provided on the number of iterations for convergence. Morever, no explicit stopping criterion is mentioned which is essential for bounding convergence iterations. Gradient descent is one of the simplest iterative solutions to any optimization problem, however beyond this simplicity its linear convergence rate is strongly dependent on the condition number of the Hessian [15]. Conjugate gradient is a method especially designed to solve large systems of linear equations. A conjugate set of directions with respect to A are chosen. In each iteration the objective function is minimized in one of the directions. Theoretically the method should converge in at most n iterations, with each iteration costing as much as a sparse matrix-vector multiplication. While this makes conjugate gradient a suitable choice, its inherent numerical instability in finding conjugate directions could yield the procedure slower than expected. [16], [1] apply conjugate gradient to harmonic solution with both superior and inferior results to LP depending on the dataset in use. Quasi-newton methods exhibit super-linear convergence. At each iteration the inverse Hessian in Newton’s method is replaced by an approximation. These methods will not be helpful unless the approximation is sparse, However sparse quasi-newton methods have an empirically lower convergence rate than low storage quasi-newton [17]. Thus they couldn’t be helpful. Moreover for our problem, in which the Hessian is constant, computing an approximate to the inverse Hessian per iteration is costly. In our proposed algorithm we shall avoid this cost by computing a sufficiently precise and also sparse approximation of the inverse Hessian at the start. III. BASICS AND N OTATIONS Consider the general problem of semi-supervised learning. Let Xu = {x1 , . . . , xu } and Xl = {xu+1 , . . . , xu+l } be sets of unlabeled and labeled data points respectively, where n = u + l is the total number of data points. Also let y be a vector of length n with yi = 0 for unlabeled xi and yi equals to the −1 or 1 corresponding to the class labels for the labeled data points. Our goal is to predict labels of X = Xu ∪ Xl as f , where fi is the label associated to xi for i = 1, . . . , n. It’s usual to construct the similarity graph of data using methods like weighted k-NN for better performance and accuracy [1]. Let W be the n × n weight matrix Wij = exp(− kxi − xj k2 ) 2σ (2) where σ is the bandwidth parameter. Define P the diagon nal matrix D with nonzero entries D(i, i) = j=1 Wij . Symmetrically normalize W by S = D−1/2 W D−1/2 . The laplacian matrix is L = I − S. The family of manifold regularization algorithms can be formulated as following optimization problem: min f T Qf + (f − y)T C(f − y) f (3) where Q is a regularization matrix (usually the laplacian itself) and C is a diagonal matrix with Cii equal to the importance of the ith node to stick to its initial value yi . The first term represents smoothness of the predicted labels with respect to the underlying manifold and the second term is squared error of the predicted labels compared with the initial ones weighted by C. Choosing different Qs and Cs leads to various manifold classification methods [5], [10], [9], [3]. In LGC, Q = L and C = µI. It may easily be shown that the solution is equal to: f ∗ = (L + C)−1 Cy = (I − αS)−1 y, (4) Lemma 2. For any convex function R of f in (6) the followings hold: R − R∗ ≥ 1 2 ||∇R|| . 2λmax (∇2 R) (9) R − R∗ ≤ 1 2 ||∇R|| . 2λmin (∇2 R) (10) λmax (∇2 R) 2 ||f − f ∗ || 2 1 ||f − f ∗ || ≥ ||∇R|| . λmax (∇2 R) R − R∗ ≤ (11) (12) Proof: Considering that Hessian is a constant matrix, the proof for equations (9) and (10) can be found in standard optimization texts such as [15]. For (11) we need the following [15]: (5) λmax (∇2 R) 2 ||h − f || . 2 (13) Replacing f ∗ for f and f for h we get: Since 0 < α < 1 and the eigenvalues of S are in [0, 1], this iterative algorithm converges to the solution of LGC [10]. In summary, the manifold regularization problem casts into the minimizing, λmax (∇2 R) 2 ||f − f ∗ || . (14) 2 And the third equation is proved. Combining this with (9) the forth equation is proved . 1 µ+1 . Authors of [10] propose an iterative where α = algorithm to compute this solution: f (t+1) = αSf (t) + (1 − α)y. R(f ) = f T Lf + (f − y)T C(f − y). (6) Throughout the paper R(t) and f (t) denote the value and point respectively, at the tth iteration of the algorithm and R∗ and f ∗ for corresponding optimal ones. IV. A NALYSIS OF GRADIENT DESCENT The gradient of (6) is ∇R = 2(Lf + C(f − y)), which leads to the gradient descent update rule: f (t+1) = f (t) − 2α(Lf + C(f − y))). ∗ (0) t≤ log (1/z) . R(f ) ≤ R(f ∗ ) + Theorem 1. The maximum number of iterations for gradient descent with exact line search and fixed (η, µ) is O(log n). Proof: Consider the iteration t just before stopping, i.e., when ∇R(t) > η and ∇R(t+1) ≤ η. using equation (9) and lemma 1: 2 1 1 η2 . R(t) − R∗ ≥ ∇R(t) ≥ 2λmax (L + C) 2(λM + µ) (15) Inserting this into (8) yields (7) The stopping criterion is ||∇R|| ≤ η. Choosing α appropriately is essential for convergence. Following [15], applying exact line search to our problem ensures linear convergence and at iteration t we have: −R log ( R ) R(t) −R∗ R(h) ≤ R(f ) + ∇R(f )T (h − f ) + (8) λmin (L+C) λmax (L+C) . which z is a constant equal to 1 − For deeper analysis of the method we need the following lemma. Lemma 1 ([18]). If λm and λM are the smallest and largest eigenvalues of L respectively, then we have 0 = λm < λM ≤ 2. Using the above lemma and the fact that C = µI, we have λmin (L + C) = µ and λmax (L + C) = µ + λM ≤ µ + 2. t≤ log ( 2(λM +µ)(R η2 log (1 + (0) µ λM −R∗ ) ) ) . (16) In order to find an upper bound for R(0) − R∗ inequality (11) is used: 2 (λM + µ) (0) (λM + µ)n R(0) −R∗ ≤ (17) f − f ∗ ≤ 2 2 where in the last inequality we use the fact that f (0) = 0 and elements of f ∗ are in [−1, 1]. Using this in (16) we reach 2 2 n n log ( (λM η+µ) ) log ( (2+µ) ) 2 η2 t≤ ≤ (18) µ µ . log (1 + λM ) log (1 + 2 ) Each iteration of gradient descent in equation (7) consists of two steps. First α is computed which takes a fixed number of matrix-vector multiplications. Next Lf + C(f − y) is computed which costs the same. Considering that all the matrices involved are sparse, because L is constructed using k-NN and C is diagonal, there are some sparse matrixvector multiplications. Thus the total cost of each iteration is O(kn), where k is associated to neighborhood size in the construction of similarity graph. Putting these together we come to a O(kn log n) time complexity of computing the solution of LGC with gradient descent, i.e., a O(n log n) rate of growth with respect the number of data, n, which is comparably less than the ordinary inverse complexity of O(n3 ) in naive implementations or O(n2 ) with sparsity taken into consideration. It is easy to show the analysis presented above is valid for other laplacians, L, and Cs, i.e. applying gradient descent to other manifold regularization methods, such as harmonic solution and tikhonov regularization leads to the same bound. An interesting feature of the bound derived in (18) is that it is independent of the dataset in use. Replacing λM for its upper bound in (18) eliminates the dependence of the bound to the data. This independence accompanied with being sufficiently tight is appropirate for data-independent practical implementation. V. S PARSE A PPROXIMATION OF N EWTON ’ S M ETHOD f (t+1) =f (t) − (∇2 R)−1 × 2 Lf + C(f − y) i −1 ≈f (t) − Σm−1 (I + C) S i=0 × (I + C)−1 Lf + C(f − y) i −1 =f (t) − Σm−1 S i=0 (I + C) × (I + C)−1 (I + C − S)f (t) − (I + C)−1 Cy i −1 =f (t) − Σm−1 S i=0 (I + C) × I − (I + C)−1 S f (t) i −1 + Σm−1 S (I + C)−1 Cy i=0 (I + C) m (t) =f (t) − I − (I + C)−1 S f i −1 + Σm−1 S (I + C)−1 Cy i=0 (I + C) m = (I + C)−1 S f (t) i −1 + Σm−1 S (I + C)−1 Cy. i=0 (I + C) (22) In summary it can be restated as: f =f (t) 2 −1 − α(∇ R) ∇R (∇2 R)−1 1 1 = (L + C)−1 = (I − S + C)−1 2 2 −1 1 −1 = I − (I + C) S (I + C)−1 2 i 1 ∞ Σi=0 (I + C)−1 S (I + C)−1 = 2 (23) H = (I + C)−1 S (24) m−1 X (19) For our quadratic problem one iteration is sufficient to reach the optimum point with α = 1, however we wish to find a sparse approximation of the inverse Hessian. We show that using a sparse approximation of the inverse Hessian leads to an iterative method with acceptable convergence rate. As an interesting result it may be seen that in the special case our method reduces to the LGC. We start with approximating the inverse Hessian. f (t+1) = H m f (t) + gm , where Newton’s update rule for our problem is (t+1) gm = ( H i )(I + C)−1 Cy. This update rule is performed iteratively from an initial f (0) until the stopping criterion ||∇R|| ≤ η is reached. Theorem 2. The approximate Newton’s method in (23) converges to the optimal solution of LGC. Proof: Unfolding the update rule in (23) leads to f (t) =H mt f (0) + m−1 X H mi gm =H mt f (0) + ( (20) Rewriting Newton’s method with the approximated inverse Hessian results in the update rule below. (25) i=0 i=0 m−1 X The last equality is obtained because eigenvalues of (I + C)−1 S are all less than one. Using the m first terms in the above summation leads to an approximation of the inverse Hessian: i −1 (∇2 R)−1 ≈ Σm−1 (I + C) S (I + C)−1 . (21) i=0 =H mt f (0) + ( H mi )( i=0 mt−1 X m−1 X H i )(I + C)−1 Cy i=0 H i )(I + C)−1 Cy i=0 (26) Tending t → ∞ gives the final solution. Since the magnitude of the eigenvalues of H are less than one, (H mt f (0) → 0), and lim f (t) = (I −H)−1 (I +C)−1 Cy = (L+C)−1 Cy, (27) t→∞ which is equal to f ∗ in (4). Theorem 3. For the approximate Newton’s method in (23) the stopping criterion ||∇R|| ≤ η is reached in O(log n) iterations with respect to the number of data n. Proof: f (t) − f ∗ = (H m f (t−1) − gm ) − (H m f ∗ − gm ) = H m (f (t−1) − f ∗ ) By rewriting the above inequality one can see that the maximum number of iterations is bounded by log ( (30) As in gradient decent consider the iteration t just before ∇R(t) > η and the stopping criterion is met, i.e., when ∇R(t+1) ≤ η. Using equation (12) we have 1 1 (t) η. f − f ∗ ≥ ∇R(t) ≥ λmax (L + C) λM + µ (31) The maximum number of iterations is thus bounded above by t≤ ≤ ≤ (λM +µ)||f (0) −f ∗ || log ( ) η log ( m log (1 + µ) (2+µ)||f (0) −f ∗ || η ) Approx. Newton m = 2 Gradient Descent (28) H m is symmetric so ||H m x|| ≤ λmax (H m ) ||x||, so (t) f − f ∗ ≤ λmax (H m ) f (t−1) − f ∗ ≤ λmax (H m )t f (0) − f ∗ (29) = λmax ((I + C)−1 S)mt f (0) − f ∗ 1 mt (0) =( ) f − f ∗ 1+µ ||f (0) −f ∗ || ) ||f (t) −f ∗ || t≤ m log (1 + µ) Approx. Newton m = 1 (32) m log (1 + µ) log ( (2+µ)n ) η m log (1 + µ) Similar to gradient descent an O(log n) dependency on the number of data is derived for our approximate Newton’s method. The sparsity degree of H m is k m , So the matrixvector operations with this matrix cost O(k m n). As the approximation become more exact, H m will become less sparse. So as m increases the number of iterations decrease, as can be seen from (32), however, the cost of each iteration grows. Empirically it is seen that m should be chosen from 1 to 3, so we can treat it as constant and achive a O(k 3 n log n) dependence on the number of data for the whole algorithm. Also since k is chosen independent of n and is usually constant, the growth of the algorithm’s time complexity is O(n log n) with respect to the number of data. Figure 1: Demonstration of steps taken by gradient descent and approximate Newton’s method for two data points from MNIST. The algorithms start their movments from top left point to the optimal point which is located at bottom right. Similar to gradient descent the bound derived in (32) is independent of dataset, which accompanied with tightness is a good feature for practical implementation. Experiments show that that the bound derived here is tighter than that of for gradient descent and of course the number of iterations for approximate Newton is much less than that of for gradient descent. As a special case, we claim that for m = 1, the algorithm is the same as LGC’s iteration procedure. Remembering C = Iµ; f (t+1) =Hf (t) + g1 = (I + C)−1 Sf (t) + (I + C)−1 Cy 1 µ = Sf (t) + Cy = αSf (t) + (1 − α)Cy, µ µ+1 (33) which is the same as (5). Figure 1 shows how increasing m affects steps taken by the optimization algorithm in contrast to steps taken by gradient descent for simulations on the MNIST dataset. Gradient descent is extremely dependent on the condition number of the Hessian; for high condition numbers gradient descent usually takes a series of zigzag steps to reach the optimum point. Approximating the Newton step refines the search direction and decreases the zigzag effect. Figure 1 shows that the steps form approximately a line at m = 2. The Newton step for quadratic problems is in the direction to the optimal point. The trace of approximate method with m = 2 highly coincides with the true direction to the optimum point, indicating how well inverse Hessian is approximated in the proposed method. This is the reason of small number of iterations needed for convergence of approximate method compared with that of for gradient descent. The experiments validating the improvement are presented in the next section. VI. E XPERIMENTS For experiments three real world datasets are used: MNIST for digit recognition, Covertype for forest cover prediction, and Classic for text categorization. These rather large datasets are chosen to better simulate a large-scale setting, for which naive solutions, such as inverse operation, are not applicable in terms of memory and time. The MNIST is a collection of 60000 handwritten digit recognition samples. For classification we choose 10000 data points from digits 2 and 8. Each is of dimension 784. No processing is done on the data. The forest Covertype dataset is collected for predicting forest cover type from cartographic variables. It includes seven classes and 581012 samples of dimension 54. We randomly select 20000 samples of types 1 and 2, and normalize them such that each feature is in [0 1]. Classic collection is a benchmark dataset in text mining. This dataset consists of 4 different document collections: CACM (3204 documents), CISI (1460 documents), CRAN (1398 documents), and MED (1033 documents). We try to separate first category from others. Terms are single words; Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95 % of the documents. Moreover, Porters stemming is applied while preprocessing. Features are weighted with TFIDF scheme and normalized to 1. For all the datasets we use the same setting: Adjacency matrices are constructed using 5-NN with the bandwidth size set to mean of standard deviation of data. 2 % of data points are labeled. µ is set to 0.5. Choosing η = 0.005 empirically ensures convergence to the optimal solutions. Number of Iterations, accuracy, and distance to optimum are reported by average of 10 runs for different random labelings. The algorithms are run on datasets and the results are depicted and discussed in the following. Figure 2 shows the number of iterations for three iterative methods with respect to the number of data. The solution of iterative methods are almost converged to the optimum point (as depicted by Figure 3). LGC’s default implementation is the worst among the three. Gradient descent is second, and our approximate Newton’s method has the fastest convergence rate consistently in the three diverse datasets. Note that LGC corresponds to the approximate method with m = 1, and as indicated in figure 1 has better direction compared with gradient descent, so it may be surprising that its iterations are more than that of gradient descent. The key point is the line search. Although the direction proposed by gradient descent is worse than the one for LGC, exact line search causes gradient descent to reach the optimum faster. If we incorporate our approximate method with an exact line search we reach even fewer iterations, however empirically it was observed that due to the time consumed by line search, there is no improvement in terms of time duration. Another important point about diagrams in figure 2 is the order of growth with respect to the number of data, which is consistent with the logarithmic growth derived in the previous sections. This makes LGC with iterative implementation a good choice for large-scale SSL tasks. To illustrate how tight the bounds derived for iterative methods are, we put the parameters into equations (32) and (18) to get 19, 38, and 97 for approximate method with m = 2, m = 1, and for gradient descent respectively, which may be compared with the empirical values from the diagrams in figure 2. Interestingly the diagrams show that the derived bounds are quite tight regardless of the dataset. Figure 3 shows accuracy of the iterative methods compared with a factorization method, CHOLMOD [19], which uses Cholesky factorization to solve a system of linear equations fast. Since computing exact solution via inverse is impractical we use a factorization method to solve for the exact solution and compare it with the solution of iterative methods. As seen from the diagrams, for all three datasets, the solution of iterative methods is sufficiently close to the optimal solution, with the number of iterations demonstrated in figure 2. Figure 4 compares distance to optimum with different methods at each iteration and shows how these methods converge to the optimum. As expected from previous results approximate Newton’s method with m = 2 has the fastest convergence, while LGC is the slowest one. As stated before, the superiority (measured by number of iterations point of view) of gradient descent to LGC is due to its line search, not the direction chosen by the method. Figure 5 shows the time needed to compute the solution. Figure 5a compares our approximate Newton’s method with CHOLMOD which is the state of the art method in solving large systems of linear equations. Iterative methods are obviously superior to CHOLMOD. Figure 5b compares running times of different iterative methods. Again the proposed method with m = 2 is the best, however this time LGC performs better than gradient descent, because of the overhead imposed due to the line search. As the number of data get larger the difference between the methods becomes more evident. Time growth is of order n log(n), as predicted by theorems 1 and 3. VII. C ONCLUSION AND F UTURE W ORKS In this paper, a novel approximation to Newton’s method is proposed for solving manifold regularization problem along with a theoretical analysis on the number of iterations. We proved that the number of iterations have logarithmic dependence on the number of data. We also applied gradient descent to this problem and proved that its number of iterations also grows logarithmically with the number of data. The logarithmic dependence makes iterative methods a reasonable approach when a large amount of data is being classified. It’s notable that the bounds derived, are empirically tight independent of the dataset in use, which is 25 20 15 10 0 2000 35 Number of Iterations 35 Number of Iterations Number of Iterations 35 LGC Approx. Newton m = 2 30 Gradient Descent 30 25 20 15 10 0 4000 6000 8000 10000 Number of data 0.5 (a) MNIST 1 1.5 Number of data 30 25 20 15 10 0 2 4 2000 4000 6000 Number of data x 10 (b) Covertype 8000 (c) Classic Figure 2: Number of iterations for three iterative methods with respect to the number of data. 1.05 LGC Approx. Newton m = 2 1 0.95 1 CHOLMOD 0.95 Accuracy 0.9 Accuracy Accuracy Gradient Descent 1 0.8 0.7 0.9 0.85 0.8 0.75 0.9 0 2000 0 4000 6000 8000 10000 Number of data (a) MNIST 0.7 0 0.5 1 1.5 2 4 Number of data x 10 (b) Covertype 2000 4000 6000 Number of data 8000 (c) Classic Figure 3: Accuracy of the iterative methods compared with CHOLMOD LGC 150 Approx. Newton m = 2 150 100 80 50 100 ||f(t) ï f*|| ||f(t) ï f*|| ||f(t) ï f*|| Gradient Descent 100 50 60 40 20 0 0 10 20 Number of iterations (a) MNIST 30 0 0 10 20 Number of iterations (b) Covertype 30 0 0 5 10 15 20 Number of iterations 25 (c) Classic Figure 4: Distance form optimum for the three methods with respect to the iteration number practically an important feature of an algorithm. We derived LGC’s iterative procedure as a special case of our proposed approximate Newton’s method. Our method is based upon approximation of the inverse Hessian. The more exact the approximation is, the better the search direction is chosen. Experimental results confirm improvement of our proposed method over LGC’s iterative procedure without any loss in accuracy of classification. Also the improvement of our approximate method over gradient descent is revealed both theoretically and empirically. A theoretical analysis of robustness against noise, incorporating a low cost line search with the proposed method, and finding lower bounds on the number of iterations or tighter bounds, to name a few, are interesting problems that remain as future work. 0.08 LGC CHOLMOD 3 0.06 Gradient Descent Approx. Newton m = 2 Duration (Sec) Duration (Sec) 4Approx. Newton m = 2 2 1 0 0 2000 4000 6000 8000 10000 Number of data (a) MNIST 0.04 0.02 0 0 2000 4000 6000 8000 10000 Number of data (b) MNIST Figure 5: Comparison of time needed to compute the solution for iterative methods and CHOLMOD R EFERENCES [1] X. Zhu, “Semi-supervised learning with graphs,” Ph.D. dissertation, Carnegie Mellon University, 2005. [2] O. Chapelle, B. Scholkopf, and A. Zien, Semi-supervised learning. MIT press Cambridge, MA, 2006, vol. 2. [3] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006. [4] O. Duchenne, J. Audibert, R. Keriven, J. Ponce, and F. SeĢgonne, “Segmentation by transduction,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8. [5] M. Belkin and P. Niyogi, “Using manifold stucture for partially labeled classification,” in NIPS, 2002, pp. 929–936. [6] V. Sindhwani, P. Niyogi, M. Belkin, and S. Keerthi, “Linear manifold regularization for large scale semi-supervised learning,” in Proc. of the 22nd ICML Workshop on Learning with Partially Classified Training Data, 2005. [7] I. Tsang and J. Kwok, “Large-scale sparsified manifold regularization,” Advances in Neural Information Processing Systems, vol. 19, p. 1401, 2007. [8] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” School Comput. Sci., Carnegie Mellon Univ., Tech. Rep. CMUCALD-02-107, 2002. [9] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, 2003, pp. 912–919. [10] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in NIPS, 2003. [11] F. Wang and C. Zhang, “Label propagation through linear neighborhoods,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 985–992. [12] A. George and J. Liu, Computer solution of large sparse positive definite systems, ser. Prentice-Hall series in computational mathematics. Prentice-Hall, 1981. [13] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf, “Ranking on data manifolds,” in Advances in neural information processing systems 16: proceedings of the 2003 conference, vol. 16. The MIT Press, 2004, p. 169. [14] J. He, M. Li, H. Zhang, H. Tong, and C. Zhang, “Manifoldranking based image retrieval,” in Proceedings of the 12th annual ACM international conference on Multimedia. ACM, 2004, pp. 9–16. [15] S. Boyd and L. Vandenberghe, Convex optimization. bridge Univ Pr, 2004. Cam- [16] A. Argyriou, “Efficient approximation methods for harmonic semi- supervised learning,” Master’s thesis, University College London, UK, 2004. [17] J. Nocedal and S. Wright, Numerical optimization. Springer verlag, 1999. [18] F. Chung, Spectral graph theory. Society, 1997, no. 92. Amer Mathematical [19] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, “Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate,” ACM Trans. Math. Softw., vol. 35, pp. 22:1–22:14, October 2008.